Intégration de ressources lexicales riches dans un analyseur syntaxique probabiliste. (Integration of lexical resources in a probabilistic parser)

نویسنده

  • Anthony Sigogne
چکیده

This thesis focuses on the integration of lexical and syntactic resources of French in two fundamental tasks of Natural Language Processing [NLP], that are probabilistic part-of-speech tagging and probabilistic parsing. In the case of French, there are a lot of lexical and syntactic data created by automatic processes or by linguists. In addition, a number of experiments have shown interest to use such resources in processes such as tagging or parsing, since they can significantly improve system performances. In this paper, we use these resources to give an answer to two problems that we describe briefly below : data sparseness and automatic segmentation of texts. Through more and more sophisticated parsing algorithms, parsing accuracy is becoming higher for many languages including French. However, there are several problems inherent in mathematical formalisms that statistically model the task (grammar, discriminant models,...). Data sparseness is one of those problems, and is mainly caused by the small size of annotated corpora available for the language. Data sparseness is the difficulty of estimating the probability of syntactic phenomena, appearing in the texts to be analyzed, that are rare or absent from the corpus used for learning parsers. Moreover, it is proved that sparsness is partly a lexical problem, because the richer the morphology of a language is, the sparser the lexicons built from a treebank will be for that language. Our first problem is therefore based on mitigating the negative impact of lexical data sparseness on parsing performance. To this end, we were interested in a method called word clustering that consists in grouping words of corpus and texts into clusters. These clusters reduce the number of unknown words, and therefore the number of rare or unknown syntactic phenomena, related to the lexicon, in texts to be analyzed. Our goal is to propose word clustering methods based on syntactic information from French lexicons, and observe their impact on parsers accuracy. Furthermore, most evaluations about probabilistic tagging and parsing were performed with a perfect segmentation of the text, as identical to the evaluated corpus. But in real cases of application, the segmentation of a text is rarely available and automatic segmentation tools fall short of proposing a high quality segmentation, because of the presence of many multi-word units (compound words, named entities, ...). In this paper, we focus on continuous multi-word units, called compound words, that form lexical units which we can associate a part-of-speech tag. We may see the task of searching compound words as text segmentation. Our second issue will therefore focus on automatic segmentation of French texts and its impact on the performance of automatic processes. In order to do this, we focused on an approach of coupling, in a unique probabilistic model, the recognition of compound words and another task. In our case, it may be parsing or tagging. Recognition of compound words is performed within the probabilistic process rather than in a preliminary phase. Our goal is to propose innovative strategies for integrating resources of compound words in both processes combining probabilistic tagging, or parsing, and text segmentation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Intégration des données d'un lexique syntaxique dans un analyseur syntaxique probabiliste

Résumé : Cet article présente les résultats d’une évaluation sur l’intégration des données issues d’un lexique syntaxique, le Lexique-Grammaire, dans un analyseur syntaxique. Nous montrons qu’en modifiant le jeu d’étiquettes des verbes et des noms prédicatifs, un analyseur syntaxique probabiliste non lexicalisé obtient des performances accrues sur le français. Mots clés : Analyse syntaxique pro...

متن کامل

Building a Tree-Bank of Modern Hebrew Text

This paper describes the process of building the first tree-bank for Modern Hebrew texts. A major concern in this process is the need for reducing the cost of manual annotation by the use of automatic means. To this end, the joint utility of an automatic morphological analyzer, a probabilistic parser and a small manually annotated tree-bank was explored. An initial tree-bank that consists of 50...

متن کامل

Reinforcing Parser Preferences through Tagging

Lexical ambiguity is an important source of inefficiency for wide-coverage HPSG parsing. In this paper, we propose a lexical analysis filter which removes unlikely lexical categories. The filter is implemented as a straightforward HMM n-gram POS-tagger, which computes the ’a posteriori’ probability of each lexical category. A lexical category is removed if a competing lexical category is suffic...

متن کامل

Vers un analyseur syntaxique du wolof (Towards a syntactic analyzer of Wolof) [in French]

Mar Ndiaye, 1 Cherif Mbodj2 (1) Ecole supérieure de commerce Dakar, 7 av. Faidherbe BP21354 Dakar (2) Centre de linguistique appliquée de Dakar (UCAD) [email protected], [email protected] RESUME _________________________________________________________________________________________________ Dans cet article nous présentons notre projet d’analyseur syntaxique du wolof, une langue parlée au Séné...

متن کامل

Playing with parsers (Jouer avec des analyseurs syntaxiques) [in French]

Résumé. Nous présentons DYALOG-SR, un analyseur syntaxique statistique par dépendances développé dans le cadre de la tâche SPRML 2013 portant sur un jeu de 9 langues très différentes. L’analyseur DYALOG-SR implémente un algorithme d’analyse par transition (à la MALT), étendu par utilisation de faisceaux et de techniques de programmation dynamique. Une des particularité de DYALOG-SR provient de ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012