Media monitoring and information extraction for the highly inflected agglutinative language Hungarian

نویسندگان

  • Júlia Pajzs
  • Ralf Steinberger
  • Maud Ehrmann
  • Mohamed Ebrahim
  • Leonida Della Rocca
  • Stefano Bucci
  • Eszter Simon
  • Tamás Váradi
چکیده

The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web page http://emm.newsbrief.eu/overview.html.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A XML-Based Term Extraction Tool for Basque

This project combines linguistic and statistical information to develop a term extraction tool for Basque. Being Basque an agglutinative and highly inflected language, the treatment of morphosyntactic information is vital. In addition, due to late unification process of the language, texts present more elevated term dispersion than in a highly normalized language. The result is a semiautomatic ...

متن کامل

The Production of Nominal and Verbal Inflection in an Agglutinative Language: Evidence from Hungarian

The contrast between regular and irregular inflectional morphology has been useful in investigating the functional and neural architecture of language. However, most studies have examined the regular/irregular distinction in non-agglutinative Indo-European languages (primarily English) with relatively simple morphology. Additionally, the majority of research has focused on verbal rather than no...

متن کامل

Multilingual Media Monitoring and Text Analysis - Challenges for Highly Inflected Languages

We present the highly multilingual news analysis system Europe Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We also give an overview of EMM’s text mining tool set, focusing on the issue of how the software deals with highly inflected lang...

متن کامل

Morphosyntactic structure of terms in Basque for automatic terminology extraction

This paper describes the morphosyntactic patterns of technical terms in Basque and presents an architecture for a term-extracting tool. As Basque is a highly inflected agglutinative language, partof-speech information is not enough to define term patterns. The use of morphological and syntactic information is essential to reduce considerably the number of structures. For example, a noun, an adv...

متن کامل

Multi-granularity Word Alignment and Decoding for Agglutinative Language Translation

Lexical sparsity problem ismuchmore serious for agglutinative language translation due to the multitude of inflected variants of lexicons. In this paper, we propose a novel optimization strategy to ease spareness bymulti-granularity word alignment and translation for agglutinative language. Multiple alignment results are combined to catch the complementary information for alignments, and rules ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014