Keyword Extraction for Text Characterization

نویسندگان

  • Ingrid Renz
  • Andrea Ficzay
  • Holger Hitzler
چکیده

Keywords are valuable means for characterizing texts. In order to extract keywords we propose an efficient and robust, language-and domain-independent approach which is based on small word parts (quadgrams). The basic algorithm can be improved by reexamining and re-ranking keywords using edit distance (i.e. Levenshtein distance) and an algorithm based on the relativistic addition of velocities (here: weights). For the purpose of evaluation, we compare our approach to frequency-based keyword extraction (exemplary text collection: 45000 intranet documents in German and English). The analysis of huge text collections usually aims at finding relevant texts (known as text retrieval with search engines) or text groups (supervised grouping like categorization or classifying, unsupervised grouping like clustering). Thus, all these text mining tasks result in retrieved texts or text groups. For an overview of text mining tasks see [Ch00]. But it is a tedious task of any information-seeking user to scan all retrieved items. In order to facilitate this task, most text mining systems characterize their resulting texts with various kinds of annotations. They link texts to external topic schemes or find relevant concepts out of the texts themselves. These items of external topic schemes as well as text-based concepts can be presented as keywords which are a helpful characterization of textual content. An extensive survey of summarization gives [Ho02]. Here, topic identification as the simplest type of a summary also subsumes keyword extraction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Intelligent Text Processing Techniques for Textual-Profile Gene Characterization

We present a suite of Machine Learning and knowledge-based components for textual-profile based gene prioritization. Most genetic diseases are characterized by many potential candidate genes that can cause the disease. Gene expression analysis typically produces a large number of co-expressed genes that could be potentially responsible for a given disease. Extracting prior knowledge from text-b...

متن کامل

Automatic Keyword Extraction for News Finder

Newspapers are one of the most challenging domains for information retrieval systems: new articles appear everyday written in different languages, with multimedia contents and the news repositories may be updated in a matter of hours so information extraction is crucial to the metadata contents of the news. Further approaches of “smart retrieval” have to cope with multimedia and multilingual fe...

متن کامل

Patent Keyword Extraction Algorithm Based on Distributed Representation for Patent Classification

Many text mining tasks such as text retrieval, text summarization, and text comparisons depend on the extraction of representative keywords from the main text. Most existing keyword extraction algorithms are based on discrete bag-of-words type of word representation of the text. In this paper, we propose a patent keyword extraction algorithm (PKEA) based on the distributed Skip-gram model for p...

متن کامل

Automatic Keyword Extraction for Text Summarization: A Survey

In recent times, data is growing rapidly in every domain such as news, social media, banking, education, etc. Due to the excessiveness of data, there is a need of automatic summarizer which will be capable to summarize the data especially textual data in original document without losing any critical purposes. Text summarization is emerged as an important research area in recent past. In this re...

متن کامل

Analysis of Statistical Keyword Extraction Methods for Incremental Clustering

Incremental clustering is a very useful approach to organize dynamic text collections. Due to the time/space restrictions for incremental clustering, the textual documents must be preprocessed to maintain only their most important information. Statistical keyword extraction methods from single documents are useful in this scenario. However, different statistical methods have different assumptio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003