Mining Linguistically Interpreted Texts

نویسندگان

  • Cassiana Fagundes Da Silva
  • Renata Vieira
  • Fernando Santos Osorio
  • Paulo Quaresma
چکیده

This paper proposes and evaluates the use of linguistic information in the pre-processing phase of text mining tasks. We present several experiments comparing our proposal for selection of terms based on linguistic knowledge with usual techniques applied in the field. The results show that part of speech information is useful for the pre-processing phase of text categorization and clustering, as an alternative for stop words and stemming.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building a Linguistically Interpreted Corpus of Bulgarian: the BulTreeBank

In the field of Human Language Technology (HLT), the existence of linguistically interpreted real-world texts provides the license necessary for a given language to enter the area of high-tech applications. The significance of BulTreeBank is the granting of an HLT license to a “less processed” language like Bulgarian which, until recently, has been formally modelled and processed mainly on the ...

متن کامل

Pattern Mining with Natural Language Processing: An Exploratory Approach

Pattern mining derives from the need of discovering hidden knowledge in very large amounts of data, regardless of the form in which it is presented. When it comes to Natural Language Processing (NLP), it arose along the humans’ necessity of being understood by computers. In this paper we present an exploratory approach that aims at bringing together the best of both worlds. Our goal is to disco...

متن کامل

Multilayer model for Arabic text compression

This article describes a multilayer model-based approach for text compression. It uses linguistic information to develop a multilayer decomposition model of the text in order to achieve better compression. This new approach is illustrated for the case of the Arabic language, where the majority of words are generated according to the Semitic root-and-pattern scheme. Text is split into three ling...

متن کامل

Back to the Roots of Genres: Text Classification by Language Function

The term “genre” covers different aspects of both texts and documents, and it has led to many classification schemes. This makes different approaches to genre identification incomparable and the task itself unclear. We introduce the linguistically motivated text classification task language function analysis, LFA, which focuses on one well-defined aspect of genres. The aim of LFA is to determin...

متن کامل

Bridging the Gap between Domain-Oriented and Linguistically-Oriented Semantics

This paper compares domain-oriented and linguistically-oriented semantics, based on the GENIA event corpus and FrameNet. While the domain-oriented semantic structures are direct targets of Text Mining (TM), their extraction from text is not straghtforward due to the diversity of linguistic expressions. The extraction of linguistically-oriented semactics is more straghtforward, and has been stud...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004