Exploiting Concept Clumping for Efficient Incremental E-Mail Categorization

نویسندگان

  • Alfred Krzywicki
  • Wayne Wobcke
چکیده

We introduce a novel approach to incremental e-mail categorization based on identifying and exploiting “clumps” of messages that are classified similarly. Clumping reflects the local coherence of a classification scheme and is particularly important in a setting where the classification scheme is dynamically changing, such as in e-mail categorization. We propose a number of metrics to quantify the degree of clumping in a series of messages. We then present a number of fast, incremental methods to categorize messages and compare the performance of these methods with measures of the clumping in the datasets to show how clumping is being exploited by these methods. The methods are tested on 7 large real-world e-mail datasets of 7 users from the Enron corpus, where each message is classified into one folder. We show that our methods perform well and provide accuracy comparable to several common machine learning algorithms, but with much greater computational efficiency.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Concept Clumping for Efficient Incremental News Article Categorization

In this paper, we introduce efficient methods for incremental multilabel categorization of documents. We use concept clumping to efficiently categorize news articles into a hierarchical structure of categories. Concept clumping is a phenomenon of local coherences occurring in the data and it has been previously used for fast, incremental e-mail classification. We extend the definition of clumpi...

متن کامل

TRESTLE: Incremental Learning in Structured Domains using Partial Matching and Categorization

We present TRESTLE, an incremental algorithm for probabilistic concept formation in structured domains that builds on prior concept learning research. TRESTLE works by creating a hierarchical categorization tree that can be used to predict missing attribute values and cluster sets of examples into conceptually meaningful groups. It is able to update its knowledge by partially matching novel str...

متن کامل

Scalable packet classification with controlled cross-producting

1389-1286/$ see front matter 2008 Elsevier B.V doi:10.1016/j.comnet.2008.11.017 * Tel.: +886 4 22840497x710. E-mail address: [email protected] 1 This work is supported in part by the National S Grant No. NSC 97-2221-E-005-049. Packet classification is central among traffic classification techniques that categorize packets with a traffic descriptor or with user-defined criteria. This categor...

متن کامل

A Hybrid Framework for Building an Efficient Incremental Intrusion Detection System

In this paper, a boosting-based incremental hybrid intrusion detection system is introduced. This system combines incremental misuse detection and incremental anomaly detection. We use boosting ensemble of weak classifiers to implement misuse intrusion detection system. It can identify new classes types of intrusions that do not exist in the training dataset for incremental misuse detection. As...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010