Two-Step Feature Selection and Neural Network Classification for the TREC-8 Routing

نویسندگان

  • Mathieu Stricker
  • Frantz Vichot
  • Gérard Dreyfus
  • Francis Wolinski
چکیده

At the Caisse des Dépôts et Consignations (CDC), the Agence France-Presse (AFP) news releases are filtered continuously according to the users' interests. Once a user has specified a topic of interest, a filter is customized to fit this user's profile. Until now, these filters would rely on rule-based methods, whose efficiency is proven [Vichot et al., 1999], but which require a large amount of work for each specific filter. This drawback can be avoided by using statistical methods which have the ability to learn from examples of relevant documents. Recently, we have developed a methodology for the AFP corpus. This paper presents its application to the TREC-8 corpus. For the TREC-8 routing, one specific filter is built for each topic. Each filter is a classifier trained to recognize the documents that are relevant to the topic. When presented with a document, each classifier estimates the probability for the document to be relevant to the topic for which it has been trained. Since the procedure for building a filter is topic-independent, the system is fully automatic. Therefore, we describe it for one topic; the procedure is repeated 50 times. By making use of a sample of documents that have previously been evaluated as relevant or not relevant to a particular topic, a term selection is performed, and a neural network is trained. Each document is represented by a vector of frequencies of a list of selected terms. This list depends on the topic to be filtered; it is constructed in two steps. The first step defines the characteristic words used in the relevant documents of the corpus; the second one chooses, among the previous list, the most discriminant ones. The length of the vector is optimized automatically for each topic. At the end of the term selection, a vector of typically 25 words is defined for the topic, so that each document which has to be processed is represented by a vector of term frequencies. This vector is subsequently input to a classifier that is trained from the same sample. After training, the classifier estimates for each document of a test set its probability of being relevant; for submission to TREC, the top 1000 documents are ranked in order of decreasing relevance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Two Steps Feature Selection and Neural Network Classification for the TREC-8 Routing

At the Caisse des Dépôts et Consignations (CDC), the Agence France-Presse (AFP) news releases are filtered continuously according to the users' interests. Once a user has specified a topic of interest, a filter is customized to fit this user's profile. Until now, these filters would rely on rule-based methods, whose efficiency is proven [Vichot et al., 1999], but which require a large amount of...

متن کامل

Determining the effective features in classification of heart sounds using trained intelligent network and genetic algorithm

Heart diseases are among the most important causes of mortality in the world, especially in industrial countries. Using heart sounds and the features extracted from them are among the non-aggressive diagnosis and prognosis methods for heart diseases. In this study, the time-scale, Cepstral, frequency, temporal and turbulence features are saved and extracted from the heart sounds, and then they ...

متن کامل

Application of Artificial Neural Networks in a Two-step Classification for Acute Lymphocytic Leukemia Diagnosis by Blood Lamella Images

Introduction: This study aimed to present a system based on intelligent models that can enhance the accuracy of diagnostic systems for acute leukemia. The three parts including preprocessing, feature extraction, and classification network are considered as associated series of actions. Therefore, any dysfunction or poor accuracy in each part might lead in general dysfunction of...

متن کامل

Accurate Fault Classification of Transmission Line Using Wavelet Transform and Probabilistic Neural Network

Fault classification in distance protection of transmission lines, with considering the wide variation in the fault operating conditions, has been very challenging task. This paper presents a probabilistic neural network (PNN) and new feature selection technique for fault classification in transmission lines. Initially, wavelet transform is used for feature extraction from half cycle of post-fa...

متن کامل

Training Context-Sensitive Neural Networks with Few Relevant Examples for the TREC-9 Routing

The present paper describes our second participation to the routing task; it features improvements over our previous approach [Stricker et al., 2000]. Our former model used a "bag of words" for text representation with a feature selection, and a neural network without hidden neuron (i.e. a logistic regression), to estimate the probability of relevance of each document. This approach was close t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999