From raw corpus to word lattices: robust pre-parsing processing
نویسندگان
چکیده
We present a robust full-featured architecture to preprocess text before parsing. It converts raw noisy corpus into a word lattice that will be used as input by a parser. It includes sequentially named-entity recognition, tokenization and sentence boundaries detection, lexicon-aware named-entity recognition, spelling correction, and non-deterministic multi-words processing, re-accentuation and de-/recapitalization. Though our system currently deals with French language, almost all components are in fact language-independent, and the others can be straightforwardly adapted to almost any inflectional language. The output is a lattice of words that are present in the lexicon. It has been applied on a large scale during a French parsing evaluation campaign, showing both extreme efficiency and very good precision and recall.
منابع مشابه
From Raw Corpus to Word Lattices: Robust Pre-parsing Processing with SXPipe
We present a robust full-featured architecture to preprocess text before parsing. This architecture, called SXPipe, converts raw noisy corpora into word lattices, one by sentence, that can be used as input by a parser. It includes sequentially named-entity recognition, tokenization and sentence boundaries detection, lexicon-aware named-entity recognition, spelling correction, and non-determinis...
متن کاملتأثیر ساختواژهها در تجزیه وابستگی زبان فارسی
Data-driven systems can be adapted to different languages and domains easily. Using this trend in dependency parsing was lead to introduce data-driven approaches. Existence of appreciate corpora that contain sentences and theirs associated dependency trees are the only pre-requirement in data-driven approaches. Despite obtaining high accurate results for dependency parsing task in English langu...
متن کامل3we Thank a Cl/dc1 (data Collection Initiative), the Collins Publishing Company, the Wall Street Journal, for Providing Invaluable On-line Data, and the Treebank Project for Providing Tagged Corpus for Reference. Pre-processing: the Big Picture Re-processing up Against?
Thematic analysis is best manifested by contrasting collocations1 such as “shipping pacemakers” vs. “shipping departments”. While in the first pair, the pacemakers are being shipped, in the second one, the departments are probably engaged in some shipping activity, but are not being shipped. Text pre-processors, intended to inject corpus-based intuition into the parsing process, must adequately...
متن کاملEnriching ASR Lattices with POS Tags for Dependency Parsing
Parsing speech requires a richer representation than 1-best or n-best hypotheses, e.g. lattices. Moreover, previous work shows that part-of-speech (POS) tags are a valuable resource for parsing. In this paper, we therefore explore a joint modeling approach of automatic speech recognition (ASR) and POS tagging to enrich ASR word lattices. To that end, we manipulate the ASR process from the prono...
متن کاملFrom Czech Morphology through Partial Parsing to Disambiguation
This paper deals with a complex system of processing raw Czech texts. Several modules were implemented which perform different levels of processing. These modules can easily be incorporated into many other linguistic applications and some of them are already exploited in this way. The first level of processing raw texts represents a reliable morphological analysis – we give a survey of the effe...
متن کامل