Very Large Annotated Database of American English
نویسنده
چکیده
Object ive To construct a data base (the "Penn Treebank') of written and transcribed spoken American English annotated with detailed grammatical structure. This data base will serve as a national resource, providing training material for a wide variety of approaches to automatic language acquisition, a rei~rence standard for the rigorous evaluation of some components of natural language understanding systems, and a research tool for the investigation of the grammar and prosodic structure of naturally spoken English.
منابع مشابه
First Steps Towards an Annotated Database of American English
This paper reports on one of the first steps in building a very large annotated database of American English. We present and discuss the results of an experiment comparing manual part-of-speech tagging with manual verification and correction of automatic stochastic tagging. The experiment shows that correcting is superior to tagging with respect to speed, consistency and accuracy. Comments Univ...
متن کاملAutomatic Prediction of Intelligibility of Spoken Words in Japanese Accented English
This study examines automatic prediction of the words that will be unintelligible if they are spoken by Japanese speakers of English. In our previous study [1], 800 English utterances spoken by Japanese speakers, which contained 6,063 words, were presented to 173 American listeners and correct perception rate was obtained for each spoken word. By using the results, in this study, we define the ...
متن کاملSlovene-English Datasets for MT
Advances in machine translation are becoming increasingly dependent on the availability of large scale language resources, in particular parallel corpora. The talk presents Slovene-English language resources that were developed as datasets for translation studies and machine learning programs. Three parallel datasets are introduced: the MULTEXT-East multilingual word-annotated corpus, the IJS-E...
متن کاملA large scale annotated child language construction database
Large scale annotated corpora of child language can be of great value in assessing theoretical proposals regarding language acquisition models. For example, they can help determine whether the type and amount of data required by a proposed language acquisition model can actually be found in a naturalistic data sample. To this end, several recent efforts have augmented the CHILDES child language...
متن کاملDesigning and Labelling a Prosodic Database for American English
A corpus of read American English was designed as a research tool for speech synthesis and prosody research with an emphasis on concept-to-speech research. The total duration of the corpus is two hours. It was recorded with two native speakers who also provide the voices of the VERBMOBIL American English speech synthesis. The corpus was annotated linguistically on several levels (syntax, semant...
متن کامل