Clustering Abstracts Instead of Full Texts

نویسندگان

  • Pavel Makagonov
  • Mikhail Alexandrov
  • Alexander F. Gelbukh
چکیده

Accessibility of digital libraries and other web-based repositories has caused the illusion of accessibility of the full texts of scientific papers. However, in the majority of cases such an access (at least free access) is limited only to abstracts having no more then 50-100 words. Traditional keyword-based approach for clustering this type of documents gives unstable and imprecise results. We show that they can be easy improved with more adequate keyword selection and document similarity evaluation. We suggest simple procedures for this. We evaluate our approach on the data from two international conferences. One of our conclusions is the suggestion for the digital libraries and other repositories to provide document images of full texts of the papers along with their abstracts for open access via Internet.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Approach to Clustering

Free access to full-text scientific papers in major digital libraries and other web repositories is limited to only their abstracts consisting of no more than several dozens of words. Current keyword-based techniques allow for clustering such type of short texts only when the data set is multi-category, e.g., some documents are devoted to sport, others to medicine, others to politics, etc. Howe...

متن کامل

Full Text Clustering and Relationship Network Analysis of Biomedical Publications

Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of ...

متن کامل

Functional gene clustering via gene annotation sentences, MeSH and GO keywords from biomedical literature

Gene function annotation remains a key challenge in modern biology. This is especially true for high-throughput techniques such as gene expression experiments. Vital information about genes is available electronically from biomedical literature in the form of full texts and abstracts. In addition, various publicly available databases (such as GenBank, Gene Ontology and Entrez) provide access to...

متن کامل

Sense Cluster Based Categorization and Clustering of Abstracts

This paper focuses on the use of sense clusters for classification and clustering of very short texts such as conference abstracts. Common keyword-based techniques are effective for very short documents only when the data pertain to different domains. In the case of conference abstracts, all the documents are from a narrow domain (i.e., share a similar terminology), that increases the difficult...

متن کامل

Lexical Cohesion in English and Persian Abstracts

This study compares and contrasts lexical cohesion in English and Persian abstracts of Iranian medical students’ theses to appreciate textualization processes in the two languages. For this purpose, one hundred English and Persian abstracts were selected randomly and analyzed based on Seddigh and Yarmohamadi’s (1996) lexical cohesion framework, a version of Halliday and Hasan’s (1976) and Halli...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004