Taiwan Child Language Corpus: Data Collection and Annotation
نویسنده
چکیده
Taiwan Child Language Corpus contains scripts transcribed from about 330 hours of recordings of fourteen young children from Southern Min Chinese speaking families in Taiwan. The format of the corpus adopts the Child Language Data Exchange System (CHILDES). The size of the corpus is about 1.6 million words. In this paper, we describe data collection, transcription, word segmentation, and part-of-speech annotation of this corpus. Applications of the corpus are also discussed.
منابع مشابه
Construction and Automatization of a Minnan Child Speech Corpus with some Research Findings
Taiwanese Child Language Corpus (TAICORP) is a corpus based on spontaneous conversations between young children and their adult caretakers in Minnan (Taiwan Southern Min) speaking families in Chiayi County, Taiwan. This corpus is special in several ways: (1) It is a Minnan corpus; (2) It is a speech-based corpus; (3) It is a corpus of a language that does not yet have a conventionalized orthogr...
متن کاملUsing a Serious Game to Collect a Child Learner Speech Corpus
We present an English-L2 child learner speech corpus, produced by Swiss German-L1 students in their third year of learning English, which is currently in the process of being collected. The collection method uses a web-enabled multimodal language game implemented using the CALL-SLT platform, in which subjects hold prompted conversations with an animated agent. Prompts consist of a short animate...
متن کاملThe Creagest Project: a Digitized and Annotated Corpus for French Sign Language (LSF) and Natural Gestural Languages
In this paper, we discuss the theoretical, sociolinguistic, methodological and technical objectives and issues of the French Creagest Project (2007-2012) in setting up, documenting and annotating a large corpus of adult and child French Sign Language (LSF) and of natural gestural language. In section 2., we address theoretical and practical issues, emphasizing the outstanding features of the Cr...
متن کاملRWTH-Phoenix: Analysis of the German Sign Language Corpus
In this work, the recent additions to the RWTH-Phoenix corpus, a data collection of interpreted news announcement, are analysed. The corpus features videos, gloss annotation of German Sign Language and transcriptions of spoken German. The annotation procedure is reported, and the corpus statistics are discussed. We present automatic machine translation results for both directions, and discuss s...
متن کاملHigh-accuracy Annotation and Parsing of CHILDES Transcripts
Corpora of child language are essential for psycholinguistic research. Linguistic annotation of the corpora provides researchers with better means for exploring the development of grammatical constructions and their usage. We describe an ongoing project that aims to annotate the English section of the CHILDES database with grammatical relations in the form of labeled dependency structures. To d...
متن کامل