Keyword Extraction for Text Characterization
نویسندگان
چکیده
Keywords are valuable means for characterizing texts. In order to extract keywords we propose an efficient and robust, language-and domain-independent approach which is based on small word parts (quadgrams). The basic algorithm can be improved by reexamining and re-ranking keywords using edit distance (i.e. Levenshtein distance) and an algorithm based on the relativistic addition of velocities (here: weights). For the purpose of evaluation, we compare our approach to frequency-based keyword extraction (exemplary text collection: 45000 intranet documents in German and English). The analysis of huge text collections usually aims at finding relevant texts (known as text retrieval with search engines) or text groups (supervised grouping like categorization or classifying, unsupervised grouping like clustering). Thus, all these text mining tasks result in retrieved texts or text groups. For an overview of text mining tasks see [Ch00]. But it is a tedious task of any information-seeking user to scan all retrieved items. In order to facilitate this task, most text mining systems characterize their resulting texts with various kinds of annotations. They link texts to external topic schemes or find relevant concepts out of the texts themselves. These items of external topic schemes as well as text-based concepts can be presented as keywords which are a helpful characterization of textual content. An extensive survey of summarization gives [Ho02]. Here, topic identification as the simplest type of a summary also subsumes keyword extraction.
منابع مشابه
Intelligent Text Processing Techniques for Textual-Profile Gene Characterization
We present a suite of Machine Learning and knowledge-based components for textual-profile based gene prioritization. Most genetic diseases are characterized by many potential candidate genes that can cause the disease. Gene expression analysis typically produces a large number of co-expressed genes that could be potentially responsible for a given disease. Extracting prior knowledge from text-b...
متن کاملAutomatic Keyword Extraction for News Finder
Newspapers are one of the most challenging domains for information retrieval systems: new articles appear everyday written in different languages, with multimedia contents and the news repositories may be updated in a matter of hours so information extraction is crucial to the metadata contents of the news. Further approaches of “smart retrieval” have to cope with multimedia and multilingual fe...
متن کاملPatent Keyword Extraction Algorithm Based on Distributed Representation for Patent Classification
Many text mining tasks such as text retrieval, text summarization, and text comparisons depend on the extraction of representative keywords from the main text. Most existing keyword extraction algorithms are based on discrete bag-of-words type of word representation of the text. In this paper, we propose a patent keyword extraction algorithm (PKEA) based on the distributed Skip-gram model for p...
متن کاملAutomatic Keyword Extraction for Text Summarization: A Survey
In recent times, data is growing rapidly in every domain such as news, social media, banking, education, etc. Due to the excessiveness of data, there is a need of automatic summarizer which will be capable to summarize the data especially textual data in original document without losing any critical purposes. Text summarization is emerged as an important research area in recent past. In this re...
متن کاملAnalysis of Statistical Keyword Extraction Methods for Incremental Clustering
Incremental clustering is a very useful approach to organize dynamic text collections. Due to the time/space restrictions for incremental clustering, the textual documents must be preprocessed to maintain only their most important information. Statistical keyword extraction methods from single documents are useful in this scenario. However, different statistical methods have different assumptio...
متن کامل