Statistical Consistency of Keywords Dictionary Parameters

نویسنده

  • Grigorij Martynenko
چکیده

The construction of optimal keywords dictionary is one of the most important tasks in developing of thesaurusbased Informational Retrieval Systems and some other NLP applications. The problem is, which parameters of keywords dictionaries may be rated as consistent ones from statistical point of view (that is not dependent on sample size)? Analysis of the recent scientific works and the results of our own investigations allowed us to determine a rather complete list of parameters, which may be used for description of texts and language resources. Each of these parameters was exposed to the consistency test. Methodology for consistency test has been elaborated using the method of least squares with a number of principle modifications. Our main results are the following: 1) Theoretically all analyzed parameters have either upper or lower limits. That means that in principle they are statistically consistent. However, for the most of parameters actual consistency is achieved only in the very big sample sizes. 2) The most consistent parameters turned out to be: order coefficient, logarithmic concentration coefficient, entropy, rank golden section, and rank mean. Their rapid speed of convergence to the limit values allows to effectively perform classification procedures on data of the arbitrary size. 3) The proposed model of approximation makes it possible to forecast the values of all parameters for any sample size.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Face Detection Method Based on Over-complete Incoherent Dictionary Learning

In this paper, face detection problem is considered using the concepts of compressive sensing technique. This technique includes dictionary learning procedure and sparse coding method to represent the structural content of input images. In the proposed method, dictionaries are learned in such a way that the trained models have the least degree of coherence to each other. The novelty of the prop...

متن کامل

A New Dictionary Construction Method in Sparse Representation Techniques for Target Detection in Hyperspectral Imagery

Hyperspectral data in Remote Sensing which have been gathered with efficient spectral resolution (about 10 nanometer) contain a plethora of spectral bands (roughly 200 bands). Since precious information about the spectral features of target materials can be extracted from these data, they have been used exclusively in hyperspectral target detection. One of the problem associated with the detect...

متن کامل

Hidden Jargon: Everyday Words with Meanings Specific to Statistics

If students of statistics or our collaborators from other disciplines do not immediately understand the terms “probit regression” or “kriging”, we are not surprised and are happy to carefully explain these advanced statistical terms and concepts. A different class of words has one or more distinct statistical meanings in addition to their standard English definitions and is ripe for confusion a...

متن کامل

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Automatic indexing is one of the important technologies used for Textual Data Analysis applications. Standard document indexing techniques usually identify the most relevant keywords in the documents. This paper presents an alternative approach that aims at performing document indexing by associating concepts with the document to index instead of extracting keywords out of it. The concepts are ...

متن کامل

Dictionary of Abstract and Concrete Words of the Russian Language: A Methodology for Creation and Application

The paper describes the first stage of a project on creating an electronic dictionary with numerical estimates of the degree of abstractness and concreteness of Russian words. Our approach is to integrate data obtained from several different sources: text corpora, psycholinguistic experiments, published dictionaries, markers of abstractness (certain suffixes) and a translation of a similar dict...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000