Multi-level annotation for spoken language corpora

نویسندگان

  • Philippe Blache
  • Daniel Hirst
چکیده

The constitution of multi-level databases integrating, for example, both prosodic and morphosyntactic levels of representation presents a number of problems, some specific to the individual domains, and others concerning the integration of the two domains. It is argued that the formalism of annotation graphs provides an adequate solution to these problems, which can be implemented in an XML representation. It is further argued that a generic query language, DQL, currently being developed, will provide a satisfactory framework both for querying and for manipulating documents of this type.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A multi-level multimedia concordancer for spoken language corpora (Un concordancier multi-niveaux et multimédia pour des corpus oraux) [in French]

Concordances have always played an important role in the analysis of language corpora, for studies in humanities, literature, linguistics, translation and language teaching. However, very few of the available systems support multi-level queries against a richly-annotated, sound-aligned spoken corpus. The rapid growth in the development of spoken corpora, particularly for French, increases the n...

متن کامل

Detecting Annotation Errors in Spoken Language Corpora

Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), more recently work has also started to address errors in syntactic and other...

متن کامل

Towards Detecting Annotation Errors in Spoken Language Corpora

The issue Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), only recently has there been some work in detecting errors in synt...

متن کامل

DisMo: A Morphosyntactic, Disfluency and Multi-Word Unit Annotator. An Evaluation on a Corpus of French Spontaneous and Read Speech

We present DisMo, a multi-level annotator for spoken language corpora that integrates part-of-speech tagging with basic disfluency detection and annotation, and multi-word unit recognition. DisMo is a hybrid system that uses a combination of lexical resources, rules, and statistical models based on Conditional Random Fields (CRF). In this paper, we present the first public version of DisMo for ...

متن کامل

Query Language for Research in Phonetics

With the growing availability of spoken language corpora more and more data driven research in phonetics is possible. The downside of having huge speech corpora is that they have to be segmented and labeled, before they can be exploited. As labeling and annotation are time-consuming and costly, there is an interest in standardization which would support the exchange and reuse of labeled data. T...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000