The Gulf of Guinea Creole Corpora
نویسندگان
چکیده
We present the process of building linguistic corpora of the Portuguese-related Gulf of Guinea creoles, a cluster of four historically related languages: Santome, Angolar, Principense and Fa d’Ambô. We faced the typical difficulties of languages lacking an official status, such as lack of standard spelling, language variation, lack of basic language instruments, and small data sets, which comprise data from the late 19th century to the present. In order to tackle these problems, the compiled written and transcribed spoken data collected during field work trips were adapted to a normalized spelling that was applied to the four languages. For the corpus compilation we followed corpus linguistics standards. We recorded meta data for each file and added morphosyntactic information based on a part-of-speech tag set that was designed to deal with the specificities of these languages. The corpora of three of the four creoles are already available and searchable via an online web interface.
منابع مشابه
Wanpela deitabeis long Tok Pisin bilong baim tiket bilong balus. (An ATIS database in Tok Pisin.) Methodological observations with regard to the collection of human–human data
This paper describes the collection of authentic human–human air travel information data in Tok Pisin, the pidgin/creole language spoken in Papua New Guinea. Pros and cons of authentic data are discussed, as compared to data collected in more controlled settings like Wizard-of-Oz simulations. Some unexpected real-life phenomena that affect the data, and normally do not occur in corpora compiled...
متن کاملSpell Checking Techniques for Replacement of Unknown Words and Data Cleaning for Haitian Creole SMS Translation
We report results on translation of SMS messages from Haitian Creole to English. We show improvements by applying spell checking techniques to unknown words and creating a lattice with the best known spelling equivalents. We also used a small cleaned corpus to train a cleaning model that we applied to the noisy corpora.
متن کاملAnou Tradir: Experiences In Building Statistical Machine Translation Systems For Mauritian Languages - Creole, English, French
We present, in this paper, our experiences in developing Statistical Machine Translation (SMT) systems involving English, French and Mauritian Creole, the languages most spoken in Mauritius. We give a brief overview of the peculiarities of the language phenomena in Mauritian Creole and indicate the differences between it and English and French. We then give descriptions of the developed corpora...
متن کاملHLA Genes in Afro-American Colombians (San Basilio de Palenque): The First Free Africans in America
An Afro-American semi-isolated Colombian population is studied for its HLA genes: San Basilio de Palenque community in Colombia northern mountains. This community represents the first free Africans in America earning recognition by the Spanish Crown in 1691 AD. Nowadays, they also speak the only extant Bantu-Spanish Creole language over the World; these people have been apart from there neighbo...
متن کاملCombating Piracy in the Gulf of Guinea
1 The 5,000-nautical mile (nmi) coastline of the wider Gulf of Guinea offers seemingly idyllic conditions for shipping. It is host to numerous natural harbors and is largely devoid of chokepoints and extreme weather conditions. It is also rich in hydrocarbons, fish, and other resources. These attributes provide immense potential for maritime commerce, resource extraction, shipping, and developm...
متن کامل