Fast Uncertainty Sampling for Labeling Large E-mail Corpora
نویسندگان
چکیده
One of the biggest challenges in building effective anti-spam solutions is designing systems to defend against the ever-evolving bag of tricks spammers use to defeat them. Because of this, spam filters that work well today may not work well tomorrow. The adversarial nature of the spam problem makes large, up-to-date, and diverse e-mail corpora critical for the development and evaluation of new anti-spam filtering technologies. Gathering large collections of messages can actually be quite easy, especially in the context of a large, corporate or ISP environment. The challenge is not necessarily in collecting enough mail, however, but in collecting a representative distribution of mail types as seen “in the wild” and in then accurately labeling the hundreds of thousands or millions of accumulated messages as spam or non-spam. In the field of machine learning Uncertainty Sampling is a well-known Active Learning algorithm which uses a collaborative model to minimize the human effort required to label large datasets. While conventional Uncertainty Sampling has been shown to be very effective, it is also computationally very expensive since the learner must reclassify all the unlabeled instances during each learning iteration. We propose a new algorithm, Approximate Uncertainty Sampling (AUS), which is nearly as efficacious as Uncertainty Sampling, but has substantially lower computational complexity. The reduced computational cost allows Approximate Uncertainty Sampling to be applied to labeling larger datasets and also makes it possible to update the learned model more frequently. Approximate Uncertainty Sampling encourages the building of larger, more topical, and more realistic example e-mail corpora for evaluating new anti-spam filters. While we focus on the binary labeling of large volumes of e-mail messages, as with Uncertainty Sampling, Approximate Uncertainty Sampling can be used with a wide range of underlying classification algorithms for a variety of categorization tasks.
منابع مشابه
A Genre Analysis of Reprint Request E-mails Written by EFL and Physics Professionals
The present study aimed to analyze reprint request e-mail messages written by postgraduates (MA students) of two fields of study, namely Physics and EFL, to realize the differences and similarities between the two email types. To investigate the purpose of the study, a sample of 100 e-mail messages, 50 Physics and 50 EFL, were analyzed according to Swales’ (1990) model for reprint requests and ...
متن کاملImproving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets
E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in th...
متن کاملPropagation of large uncertainty sets in orbital dynamics by automatic domain splitting
Current approaches to uncertainty propagation in astrodynamics mainly refer to linearized models or Monte Carlo simulations. Naive linear methods fail in nonlinear dynamics, whereas Monte Carlo simulations tend to be computationally intensive. Differential algebra has already proven to be an efficient compromise by replacing thousands of pointwise integrations of Monte Carlo runs with the fast ...
متن کاملLocalization aware sampling and connection strategies for incremental motion planning under uncertainty
We present efficient localization aware sampling and connection strategies for incremental sampling-based stochastic motion planners. For sampling, we introduce a new measure of localization ability of a sample, one that is independent of the path taken to reach the sample and depends only on the sensor measurement at the sample. Using this measure, our sampling strategy puts more samples in re...
متن کاملTowards Improving E-mail Content Classification for Spam Control: Architecture, Abstraction, and Strategies
This dissertation discusses techniques to improve the effectiveness and the efficiency of spam control. Specifically, layer-3 e-mail content classification is proposed to allow e-mail pre-classification (for fast spam detection at receiving e-mail servers) and to allow distributed processing at network nodes for fast spam detection at spam control points, e.g., at e-mail servers. Fast spam dete...
متن کامل