Sequential sampling models of human text classification

نویسندگان

  • Michael D. Lee
  • Elissa Y. Corlett
چکیده

Text classification involves deciding whether or not a document is about a given topic. It is an important problem in machine learning, because automated text classifiers have enormous potential for application in information retrieval systems. It is also an interesting problem for cognitive science, because it involves real world human decision making with complicated stimuli. This paper develops two models of human text document classification based on random walk and accumulator sequential sampling processes. The models are evaluated using data from an experiment where participants classify text documents presented one word at a time under task instructions that emphasize either speed or accuracy, and rate their confidence in their decisions. Fitting the random walk and accumulator models to these data shows that the accumulator provides a better account of the decisions made, and a “balance of evidence” measure provides the best account of confidence. Both models are also evaluated in the applied information retrieval context, by comparing their performance to established machine learning techniques on the standard Reuters-21578 corpus. It is found that they are almost as accurate as the benchmarks, and make decisions much more quickly because they only need to examine a small proportion of the words in the document. In addition, the ability of the accumulator model to produce useful confidence measures is shown to have application in prioritizing the results of classification decisions. © 2002 Cognitive Science Society, Inc. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

k-Nearest Neighbor Classification Algorithm for Multiple Choice Sequential Sampling

Decision making from sequential sampling, especially when more than two alternative choices are possible, requires appropriate stopping criteria to maximize accuracy under time constraints. Optimal conditions for stopping have previously been investigated for modeling human decision making processes. In this work, we show how the k-nearest neighbor classification algorithm in machine learning c...

متن کامل

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...

متن کامل

A Sequential Algorithm for Training

The ability to cheaply train text classiiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classiiers was developed and tested on a newswire text categorization task. This method, which we call uncertaint...

متن کامل

An Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification

In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Cognitive Science

دوره 27  شماره 

صفحات  -

تاریخ انتشار 2003