TPN: Using positive-only learning to deal with the heterogeneity of labeled and unlabeled data
نویسندگان
چکیده
This paper introduces TPN, the runner up method in both tasks of the ECML-PKDD Discovery Challenge 2006 on personalized spam filtering. TPN is a classifier training method that bootstraps positive-only learning with fully-supervised learning, in order to make the most of labeled and unlabeled data, under the assumption that the two are drawn from significantly different distributions. Furthermore, the unlabeled data themselves are separated into subsets that are assumed to be drawn from multiple distributions. For that reason, TPN trains a different classifier for each subset, making use of all unlabeled data each time.
منابع مشابه
Detecting Concept Drift in Data Stream Using Semi-Supervised Classification
Data stream is a sequence of data generated from various information sources at a high speed and high volume. Classifying data streams faces the three challenges of unlimited length, online processing, and concept drift. In related research, to meet the challenge of unlimited stream length, commonly the stream is divided into fixed size windows or gradual forgetting is used. Concept drift refer...
متن کاملAssessing binary classifiers using only positive and unlabeled data
Assessing the performance of a learned model is a crucial part of machine learning. Most evaluation metrics can only be computed with labeled data. Unfortunately, in many domains we have many more unlabeled than labeled examples. Furthermore, in some domains only positive and unlabeled examples are available, in which case most standard metrics cannot be computed at all. In this paper, we propo...
متن کاملکاهش ابعاد دادههای ابرطیفی به منظور افزایش جداییپذیری کلاسها و حفظ ساختار داده
Hyperspectral imaging with gathering hundreds spectral bands from the surface of the Earth allows us to separate materials with similar spectrum. Hyperspectral images can be used in many applications such as land chemical and physical parameter estimation, classification, target detection, unmixing, and so on. Among these applications, classification is especially interested. A hyperspectral im...
متن کاملSemi-Supervised Self-training Approaches for Imbalanced Splice Site Datasets
Machine Learning algorithms produce accurate classifiers when trained on large, balanced datasets. However, it is generally expensive to acquire labeled data, while unlabeled data is available in much larger amounts. A cost-effective alternative is to use Semi-Supervised Learning, which uses unlabeled data to improve supervised classifiers. Furthermore, for many practical problems, data often e...
متن کاملPositive and Unlabeled Examples Help Learning
In many learning problems, labeled examples are rare or expensive while numerous unlabeled and positive examples are available. However, most learning algorithms only use labeled examples. Thus we address the problem of learning with the help of positive and unlabeled data given a small number of labeled examples. We present both theoretical and empirical arguments showing that learning algorit...
متن کامل