Stemming and Segmentation for Classical Tibetan
نویسندگان
چکیده
Tibetan is a monosyllabic language for which computerized language tools are largely lacking. We describe the development of a syllable stemmer for Tibetan. The stemmer is based on a set of rules that strive to identify the vowel, the core letter of the syllable, and then the other parts. We demonstrate the value of the stemmer with two applications: determining stem similarity of two syllables and word segmentation. Our stemmer is being made available as an open-source tool and word segmentation as a freely-available online tool. It is worthy of remark that a tongue which in its nature was monosyllabic, when written in the characters of a polysyllabic language like the Sanskrit, had necessarily to undergo some modification. Sarat Chandra Das, “Life of Sum-pa mkhan-po, also styled Ye-śes dpal-’byor, the author of Rehumig (Chronological Table)”, Journal of the Asiatic Society of Bengal (1889)
منابع مشابه
A Hackathon for Classical Tibetan
We describe the course of a hackathon dedicated to the development of linguistic tools for Tibetan Buddhist studies. Over a period of five days, a group of seventeen scholars, scientists, and students developed and compared algorithms for intertextual alignment and text classification, along with some basic language tools, including a stemmer and word segmenter. keywords Tibetan; hackathon; ste...
متن کاملResearch on Tibetan Automatic Word Segmentation
This paper researches on Tibetan automatic word segmentation. We focus on three key technologies of Tibetan automatic word segmentation: (1) a Tibetan automatic word segmentation approach is proposed, which is taking the advantage of case-auxiliary words and continuous feature. (2) a resolution method of overlapping ambiguity in Tibetan word segmentation is proposed, which is based on forward-b...
متن کاملPerceptual evaluation of models for music segmentation
Background in music perception and cognition. Stemming from the seminal work of Lerdhal and Jackendoff (1983), a number of studies have examined the relevance of musicological rules and elements to the perceptual structure in music (Deliege, 1987; Clark and Krumhansl, 1990; Frankland and Cohen, 2004). While certain cues and rules have been shown to be related to perceptual segmentation, the foc...
متن کاملTibetan Unknown Word Identification from News Corpora for Supporting Lexicon-based Tibetan Word Segmentation
In Tibetan, as words are written consecutively without delimiters, finding unknown word boundary is difficult. This paper presents a hybrid approach for Tibetan unknown word identification for offline corpus processing. Firstly, Tibetan named entity is preprocessed based on natural annotation. Secondly, other Tibetan unknown words are extracted from word segmentation fragments using MTC, the co...
متن کاملTibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation
Tibetan word segmentation is essential for Tibetan information processing. People mainly use the basic machine matching method which is based on dictionary to segment Tibetan words at present, because there is no segmented Tibetan corpus which can be used for training in Tibetan word segmentation. But the method based on dictionary is not fit to Tibetan number identification. This paper studies...
متن کامل