Training and Evaluation of POS Taggers on the French MULTITAG Corpus
نویسندگان
چکیده
The explicit introduction of morphosyntactic information into statistical machine translation approaches is receiving an important focus of attention. The current freely available Part of Speech (POS) taggers for the French language are based on a limited tagset which does not account for some flectional particularities. Moreover, there is a lack of a unified framework of training and evaluation for these kind of linguistic resources. Therefore in this paper, three standard POS taggers (Treetagger, Brill’s tagger and the standard HMM POS tagger) are trained and evaluated in the same conditions on the French MULTITAG corpus. This POS-tagged corpus provides a tagset richer than the usual ones, including gender and number distinctions, for example. Experimental results show significant differences of performance between the taggers. According to the tagging accuracy estimated with a tagset of 300 items, taggers may be ranked as follows: Treetagger (95.7% ), Brill’s tagger (94.6%), HMM tagger (93.4%). Examples of translation outputs illustrate how considering gender and number distinctions in the POS tagset can be relevant.
منابع مشابه
Part of Speech Tagging for New Words (Étiquetage morpho-syntaxique pour des mots nouveaux) [in French]
Part-of-speech (POS) taggers are more or less robust with respect to the labeling of unknown words not found in the training corpus. It is important to know precisely how these tools perfom when we target part-of-speech tagging for formal neologisms. Indeed, grammatical category is an important criterion for both their identification and documentation. We present an evaluation and comparison of...
متن کاملDétection et correction automatique d'erreurs d'annotation morpho-syntaxique du French TreeBank (Detecting and Correcting POS Annotation in the French TreeBank) [in French]
Detecting and correcting POS annotation in the French TreeBank The quality of the Part-Of-Speech (POS) annotation in a corpus has a large impact on training and evaluating POS taggers. In this paper, we present a series of experiments that we have conducted on automatically detecting and correcting annotation errors in the French TreeBank. Two methods are used. The first simply relies on identi...
متن کاملTCOF-POS : un corpus libre de français parlé annoté en morphosyntaxe (TCOF-POS : A Freely Available POS-Tagged Corpus of Spoken French) [in French]
TCOF-POS : A Freely Available POS-Tagged Corpus of Spoken French This article details the creation of TCOF-POS, the first freely available corpus of spontaneous spoken French. We present here the methodology that was followed in order to obtain the best possible quality in the final resource. This corpus already is freely available and can be used as a training/validation corpus for NLP tools, ...
متن کاملThe importance of high-quality input for WSD: an application-oriented comparison of part-of-speech taggers
In this paper, we present an applicationoriented evaluation of three Part-ofSpeech (PoS) taggers in a word sense disambiguation (WSD) system. Following the intuition that high quality input is likely to influence the final results of a complex system, we test whether the more accurate taggers also produce better results when integrated into the WSD system. For this purpose, a stand-alone evalua...
متن کاملWeb-Based Bengali News Corpus for Lexicon Development and POS Tagging
Lexicon development and Part of Speech (POS) tagging are very important for almost all Natural Language Processing (NLP) applications. The rapid development of these resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. We have used a Bengali news corpus, developed from the web archive of a widely read Bengali newspaper. The ...
متن کامل