Ontology-driven Conceptual Document Classification

نویسندگان

  • Gordana Pavlovic-Lazetic
  • Jelena Graovac
چکیده

Document classification based on the lexical-semantic network, wordnet, is presented. Two types of document classification in Serbian have been experimented with – classification based on chosen concepts from Serbian WordNet (SWN) and proper names-based classification. Conceptual document classification criteria are constructed from hierarchies rooted in a set of chosen concepts (first case) or in hierarchies rooted in some of the proper names’ hypernyms (second case). A classificator of the first type is trained and then tested on an indexed and already classified Ebart corpus of Serbian newspapers (476917 articles). Precision, recall and F-measure show that this type of classification is promising although incomplete due mainly to SWN incompleteness. In the context of proper names-based classification, a proper names ontology based on the SWN is presented in the paper. A distance based similarity measure is defined, based on Euclidean and Manhattan distances. Classification of a subset of Contemporary Serbian Language Corpus is presented.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Workflow Generation and Modification by Enterprise Ontologies and Documents

This article presents a novel method and development paradigm that proposes a general template for an enterprise information structure and allows for the automatic generation and modification of enterprise workflows. This dynamically integrated workflow development approach utilises a conceptual ontology of domain processes and tasks, enterprise charts, and enterprise entities. It also suggests...

متن کامل

Automatic Workflow Generation and Modification by Enterprise Ontologies and Documents

This article presents a novel method and development paradigm that proposes a general template for an enterprise information structure and allows for the automatic generation and modification of enterprise workflows. This dynamically integrated workflow development approach utilises a conceptual ontology of domain processes and tasks, enterprise charts, and enterprise entities. It also suggests...

متن کامل

Text Categorization and Classification in Terms of Multi- Attribute Concepts for Enriching Existing Ontologies

In this paper, we propose a novel comprehensive architecture for recognition of different kinds of documents and then use appropriate compoenent to extracting document information and feeding them to an existing ontology. Where by ontology, we mean a multi-attribute conceptual graph with different types of relations. This novel architecture includes two types of text processors: Information Ext...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Automatic ontology extraction for document classification

The amount of information in the world is enormous. Millions of documents in electronic libraries, thousands of them on each personal computer waiting for the expert to organize this information, to be assigned to appropriate categories. Automatic classification can help. However, synonymy, polysemy and word usage patterns problems usually arise. Modern knowledge representation mechanisms such ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010