Bridge-Language Capitalization Inference in Western Iranian: Sorani, Kurmanji, Zazaki, and Tajik
نویسندگان
چکیده
In Sorani Kurdish, one of the most useful orthographic features in named-entity recognition – capitalization – is absent, as the language’s Perso-Arabic script does not make a distinction between uppercase and lowercase letters. We describe a system for deriving an inferred capitalization value from closely related languages by phonological similarity, and illustrate the system using several related Western Iranian languages.
منابع مشابه
Named Entity Recognition for Linguistic Rapid Response in Low-Resource Languages: Sorani Kurdish and Tajik
This paper describes our construction of named-entity recognition (NER) systems in two Western Iranian languages, Sorani Kurdish and Tajik, as a part of a pilot study of Linguistic Rapid Response to potential emergency humanitarian relief situations. In the absence of large annotated corpora, parallel corpora, treebanks, bilingual lexica, etc., we found the following to be effective: exploiting...
متن کاملSorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison
Resource scarcity along with diversity– both in dialect and script–are the two primary challenges in Kurdish language processing. In this paper we aim at addressing these two problems by (i) building a text corpus for Sorani and Kurmanji, the two main dialects of Kurdish, and (ii) highlighting some of the orthographic, phonological, and morphological differences between these two dialects from ...
متن کاملKurdish Interdialect Machine Translation
This research suggests a method for machine translation among two Kurdish dialects. We chose the two widely spoken dialects, Kurmanji and Sorani, which are considered to be mutually unintelligible. Also, despite being spoken by about 30 million people in different countries, Kurdish is among less-resourced languages. The research used bi-dialectal dictionaries and showed that the lack of parall...
متن کاملStemming for Kurdish Information Retrieval
Resource scarcity along with diversity –in both dialect and script– are the two primary challenges in Kurdish language processing. In this paper we aim at addressing these two problems by building stemmers for the two main dialects of the Kurdish language (i.e. Sorani and Kurmanji) and investigate their effectiveness on Kurdish Information Retrieval. More specifically, we build Jedar, the first...
متن کاملMtDNA and Y-chromosome variation in Kurdish groups.
In order to investigate the origins and relationships of Kurdish-speaking groups, mtDNA HV1 sequences, eleven Y chromosome bi-allelic markers, and 9 Y-STR loci were analyzed among three Kurdish groups: Zazaki and Kurmanji speakers from Turkey, and Kurmanji speakers from Georgia. When compared with published data from other Kurdish groups and from European, Caucasian, and West and Central Asian ...
متن کامل