Partial Parsing of Spontaneous Spoken French

نویسندگان

  • Olivier Blanc
  • Matthieu Constant
  • Anne Dister
  • Patrick Watrin
چکیده

This paper describes the process and the resources used to automatically annotate a French corpus of spontaneous speech transcriptions in super-chunks. Super-chunks are enhanced chunks that can contain lexical multiword units. This partial parsing is based on a preprocessing stage of the spoken data that consists in reformatting and tagging utterances that break the syntactic structure of the text, such as disfluencies. Spoken specificities were formalized thanks to a systematic linguistic study of a 40-hour-long speech transcription corpus. The chunker uses large-coverage and fine-grained language resources for general written language that have been augmented with resources specific to spoken French. It consists in iteratively applying finite-state lexical and syntactic resources and outputing a finite automaton representing all possible chunk analyses. The best path is then selected thanks to a hybrid disambiguation stage. We show that our system reaches scores that are comparable with state-of-the-art results in the field.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word order phenomena in conversational spoken French A study on task-oriented dialogue corpora and its consequences on language processing

This paper presents a corpus study that investigates the question of word order variations (WOV) in spontaneous spoken French and its consequences on the parsing techniques that are used in Natural Language Processing. We have studied four taskoriented spoken dialogue corpora which concern different application tasks (air transport or tourism information, switchboard calls). Two corpora concern...

متن کامل

Robust dependency parsing for spoken language understanding of spontaneous speech

We describe in this paper a syntactic parser for spontaneous speech geared towards the identification of verbal subcategorization frames. The parser proceeds in two stages. The first stage is based on generic syntactic resources for French. The second stage is a reranker which is specially trained for a given application. The parser is evaluated on the MEDIA corpus.

متن کامل

A hybrid approach to spoken dialogue understanding: prosody, statistics and partial parsing

Linguistic processing in spoken dialogue systems has to be robust against a large number of phenomena such as recognizer errors, spontaneous speech phenomena and out-of-vocabulary (OOV) words. A commonly used solution to this problem is partial parsing, that aims at detecting only parts of sentences/utterances that are vital for the respective task of the parser. In our paper we present a frame...

متن کامل

Adapting dependency parsing to spontaneous speech for open domain spoken language understanding

Parsing human-human conversations consists in automatically enriching text transcription with semantic structure information. We use in this paper a FrameNet-based approach to semantics that, without needing a full semantic parse of a message, goes further than a simple flat translation of a message into basic concepts. FrameNet-based semantic parsing may follow a syntactic parsing step, howeve...

متن کامل

Semantic tree unification grammar: a new formalism for spoken language processing

In this paper we present the Semantic Tree Unification Grammar (STUG) which is a new formalism for parsing spoken language. The main motivation of this formalism is the combination of the robustness and simplicity of the classical semantic grammar to the deepness of the traditional syntactic formalisms. The key properties of STUG are: the direct linearization of the semantic structure, an econo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010