Automatically identifying implicit discourse relations using annotated data and raw corpora (Identification automatique des relations discursives « implicites » à partir de données annotées et de corpus bruts) [in French]
نویسندگان
چکیده
Automatically identifying implicit discourse relations using annotated data and raw corpora This paper presents a system for identifying « implicit » discourse relations (that is, relations that are not marked by a discourse connective). Given the little amount of available annotated data for this task, our system also resorts to additional automatically labeled data wherein unambiguous connectives have been suppressed and used as relation labels, a method introduced by (Marcu et Echihabi, 2002). As shown by (Sporleder et Lascarides, 2008) for English, this approach doesn’t generalize well to implicit relations as annotated by humans. We show that the same conclusion applies to French due to important distribution differences between the two types of data. In consequence, we propose various simple methods, all inspired from work on domain adaptation, with the aim of better combining annotated data and artificial data. We evaluate these methods through various experiments carried out on the ANNODIS corpus : our best system reaches a labeling accuracy of 45.6%, corresponding to a 5.9% significant gain over a system solely trained on manually labeled data. MOTS-CLÉS : analyse du discours, relations implicites, apprentissage automatique.
منابع مشابه
Identifier les relations discursives implicites en combinant données naturelles et données artificielles
This paper presents the first experiments on French in automatic identification of implicit discourse relations (i.e. relations that lack an overt connective). Our systems exploit hand-labeled implicit examples, along with artificial implicit examples obtained from explicit examples by suppressing their connective, following (Marcu et Echihabi, 2002). Previous work on English show that the use ...
متن کاملModèles d'Ordonnancement pour l'Annotation Automatique d'Images dans les Réseaux Sociaux
RÉSUMÉ. Nous proposons un modèle d’ordonnancement de données relationnelles pour apprendre automatiquement à annoter des images dans les sites permettant le partage social d’images. Ce modèle apprend à associer une liste ordonnée d’étiquettes à une image en considérant simultanément l’information de contenu (texte/image) et les informations relationnelles entre les images. Il est capable d’util...
متن کاملWide-Coverage Semantics for Spatio-Temporal Reasoning
In this article, we describe our research on wide-coverage semantics for Frenchlanguage texts and on its application to produce detailed semantic descriptions of itineraries. Using a categorial grammar semi-automatically extracted from the French Treebank and a manually constructed semantic lexicon, the resulting parser computes discourse representation structures representing the meaning of ar...
متن کاملStreet-Level Geolocation From Natural Language Descriptions
In this article, we describe the TEGUS system for mining geospatial path data from natural language descriptions. TEGUS uses natural language processing and geospatial databases to recover path coordinates from user descriptions of paths at street level. We also describe the PURSUIT Corpus — an annotated corpus of geospatial path descriptions in spoken natural language. PURSUIT includes the spo...
متن کاملVers une annotation automatique de corpus audio pour la synthèse de parole (Towards Fully Automatic Annotation of Audio Books for Text-To-Speech (TTS) Synthesis) [in French]
RÉSUMÉ La construction de corpus de parole est une étape cruciale pour tout système de synthèse de la parole à partir du texte. L’usage de modèles statistiques nécessite aujourd’hui l’utilisation de corpus de très grande taille qui doivent être enregistrés, transcrits, annotés et segmentés afin d’être exploitables. La variété des corpus nécessaire aux applications actuelles (contenu, style, etc...
متن کامل