From raw corpus to word lattices: robust pre-parsing processing

نویسندگان

  • Benoît Sagot
  • Pierre Boullier
چکیده

We present a robust full-featured architecture to preprocess text before parsing. It converts raw noisy corpus into a word lattice that will be used as input by a parser. It includes sequentially named-entity recognition, tokenization and sentence boundaries detection, lexicon-aware named-entity recognition, spelling correction, and non-deterministic multi-words processing, re-accentuation and de-/recapitalization. Though our system currently deals with French language, almost all components are in fact language-independent, and the others can be straightforwardly adapted to almost any inflectional language. The output is a lattice of words that are present in the lexicon. It has been applied on a large scale during a French parsing evaluation campaign, showing both extreme efficiency and very good precision and recall.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SXPipe

We present a robust full-featured architecture to preprocess text before parsing. This architecture, called SXPipe, converts raw noisy corpora into word lattices, one by sentence, that can be used as input by a parser. It includes sequentially named-entity recognition, tokenization and sentence boundaries detection, lexicon-aware named-entity recognition, spelling correction, and non-determinis...

متن کامل

تأثیر ساخت‌واژه‌ها در تجزیه وابستگی زبان فارسی

Data-driven systems can be adapted to different languages and domains easily. Using this trend in dependency parsing was lead to introduce data-driven approaches. Existence of appreciate corpora that contain sentences and theirs associated dependency trees are the only pre-requirement in data-driven approaches. Despite obtaining high accurate results for dependency parsing task in English langu...

متن کامل

3we Thank a Cl/dc1 (data Collection Initiative), the Collins Publishing Company, the Wall Street Journal, for Providing Invaluable On-line Data, and the Treebank Project for Providing Tagged Corpus for Reference. Pre-processing: the Big Picture Re-processing up Against?

Thematic analysis is best manifested by contrasting collocations1 such as “shipping pacemakers” vs. “shipping departments”. While in the first pair, the pacemakers are being shipped, in the second one, the departments are probably engaged in some shipping activity, but are not being shipped. Text pre-processors, intended to inject corpus-based intuition into the parsing process, must adequately...

متن کامل

Enriching ASR Lattices with POS Tags for Dependency Parsing

Parsing speech requires a richer representation than 1-best or n-best hypotheses, e.g. lattices. Moreover, previous work shows that part-of-speech (POS) tags are a valuable resource for parsing. In this paper, we therefore explore a joint modeling approach of automatic speech recognition (ASR) and POS tagging to enrich ASR word lattices. To that end, we manipulate the ASR process from the prono...

متن کامل

From Czech Morphology through Partial Parsing to Disambiguation

This paper deals with a complex system of processing raw Czech texts. Several modules were implemented which perform different levels of processing. These modules can easily be incorporated into many other linguistic applications and some of them are already exploited in this way. The first level of processing raw texts represents a reliable morphological analysis – we give a survey of the effe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005