The C-ORAL-BRASIL I: Reference Corpus for Spoken Brazilian Portuguese
نویسندگان
چکیده
C-ORAL-BRASIL I is a Brazilian Portuguese spontaneous speech corpus compiled following the same architecture adopted by the C-ORAL-ROM resource. The main goal is the documentation of the diaphasic and diastratic variations in Brazilian Portuguese. The diatopic variety represented is that of the metropolitan area of Belo Horizonte, capital city of Minas Gerais. Even though it was not a primary goal, a nice balance was achieved in terms of speakers’ diastratic features (sex, age and school level). The corpus is entirely dedicated to informal spontaneous speech and comprises 139 informal speech texts, 208,130 words and 21:08:52 hours of recording, distributed into family/private (80%) and public (20%) contexts. The LR includes audio files, transcripts in text format and text-to-speech alignment (accessible with WinPitch Pro software). C-ORAL-BRASIL I also provides transcripts with Part-of-Speech annotation implemented through the parser system Palavras. Transcripts were validated regarding the proper application of transcription criteria and also for the annotation of prosodic boundaries. Some quantitative features of C-ORAL-BRASIL I in comparison with the informal C-ORAL-ROM are reported.
منابع مشابه
Prosody, syntax, and pragmatics: insubordination in spoken Brazilian Portuguese
In this paper, we approach the phenomenon of insubordination in spoken Brazilian Portuguese through data on adverbial clauses extracted from the C-ORAL-BRASIL corpus [Raso and Mello 2012]. Differently from the traditional conception of insubordination [Evans 2007], we propose a synchronic view, strongly based on the analysis of the prosody/pragmatics interface in spoken language. We show that i...
متن کاملProviding On-line Access to Portuguese Language Resources: Corpora and Lexicons
Several Language Resources (LRs) for Portuguese, developed at the Center of Linguistics of the Lisbon University (CLUL), are available on-line at CLUL’s webpage: www.clul.ul.pt/english/sectores/projecto_rld.html. These LRs have been extracted from or developed based on the Reference Corpus of Contemporary Portuguese (CRPC), a monitor corpus containing, at the present, more than 300 million word...
متن کاملWhat Corpus Linguistics can offer Contact Linguistics: the c-oral-brasil corpus experience O que a Linguística de Corpus pode oferecer à Linguística de Contato: a experiência do corpus c-oral-brasil
Contact Linguistics, throughout its history, has been mostly a data-oriented subdiscipline. From the gathering of word lists in colonial settings by pioneer scholars to the current compilation of narratives, interviews and databanks, Contact Linguistics, differently from other Linguistics subdisciplines, has strived to base its findings on the analysis of actual language produced by speakers of...
متن کاملThe annotation of the C-ORAL-BRASIL spoken corpus using an adaptation of the Palavras Parser
This article describes the morphosyntactic annotation of the C-ORAL-BRASIL speech corpus, using an adapted version of the Palavras parser. In order to achieve compatibility with annotation rules designed for standard written Portuguese, transcribed words were orthographically normalized, and the parsing lexicon augmented with speech-specific material, phonetically spelled abbreviations etc. Usi...
متن کامل‘Minor’ Languages, ‘Broken’ Translations: On Brazilian Reworkings of an Albanian Novel
This essay approaches the challenges of global translation in the 21st century from what might still be considered a somewhat uncommon example: a direct translation of Ismail Kadaré's 1978 novel Prill e thyër (Broken April) from the original Albanian into Brazilian Portuguese in 2001. Not only does it examine and compare lexical elements in the source and target texts and the usage of translato...
متن کامل