Generating a bilingual lexical corpus using interlanguage normalized Levenshtein distances
نویسندگان
چکیده
Finding large numbers of target items for phonetic and phonological experiments can be a time-consuming and error-prone task. Using freely available tools and data, we have generated a bilingual corpus with the specific aim of investigating the processing and perception of stress in second-language (L2) words. Normalized Levenshtein distances between orthographic and phonemic transcriptions of Brazilian Portuguese (BP) and American English (AmE) translation word pairs were used to automatically generate similar and dissimilar word pairs. Frequency data from corpora were used as a metric of familiarity. To test if these generated metrics correspond to speakers' representations, BP L1 speakers of AmE L2 rated the word pairs on orthographic and phonological similarity, and indicated their familiarity with the English words. Results showed a high correlation between subjective ratings and the computed similarity and familiarity values of the bilingual corpus. We conclude that automatically constructed bilingual corpora such as ours, combined with simple string similarity metrics, are a valid and useful tool for experimental research into L2 (word stress). Key-words: Phonetics, Corpus Linguistics, Psycholinguistics, normalized Levenshtein distances, L2 segmental categorization, L2 word stress
منابع مشابه
Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day
This paper presents a method for bootstrapping a fine-grained, broad-coverage part-of-speech (POS) tagger in a new language using only one personday of data acquisition effort. It requires only three resources, which are currently readily available in 60-100 world languages: (1) an online or hard-copy pocket-sized bilingual dictionary, (2) a basic library reference grammar, and (3) access to an...
متن کاملLexical semantic typologies from bilingual corpora - A framework
We present a framework, based on Sejane and Eger (2012), for inducing lexical semantic typologies for groups of languages. Our framework rests on lexical semantic association networks derived from encoding, via bilingual corpora, each language in a common reference language, the tertium comparationis, so that distances between languages can easily be determined.
متن کاملLexical evolution rates by automated stability measure
Phylogenetic trees can be reconstructed from the matrix which contains the distances between all pairs of languages in a family. Recently, we proposed a new method which uses normalized Levenshtein distances among words with same meaning and averages on all the items of a given list. Decisions about the number of items in the input lists for language comparison have been debated since the begin...
متن کاملPoverty driven bilingual alignment
Bilingual corpora are essential for the construction of bilingual resources just as for any other work in translation studies, but the alignment itself needs bilingual resources or important interventions of bilingual speakers. This article describes work in progress on bilingual text alignment with a dynamic time warping algorithm (DTW). All other algorithms rely on bilingual resources or on t...
متن کاملInterlanguage Talk: What Can Breadth of Knowledge Features Tell Us about Input and Output Differences?
The purpose of this study is to investigate the use of breadth of knowledge lexical features in non-native speakers' (NNS) input and output. Our primary interest is analyzing potential breadth of knowledge lexical differences in the output of NNSs when engaged in interlanguage talk (NNS-NNS) and when engaged in naturalistic speech with a native speaker (NS). We are also interested in input diff...
متن کامل