Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines
نویسندگان
چکیده
The accuracy of part-of-speech (POS) tagging for unknown words is substantially lower than that for known words. Considering the high accuracy rate of up-to-date statistical POS taggers, unknown words account for a non-negligible portion of the errors. This paper describes POS prediction for unknown words using Support Vector Machines. We achieve high accuracy in POS tag prediction using substrings and surrounding context as the features. Furthermore, we integrate this method with a practical English POS tagger, and achieve accuracy of 97.1%, higher than conventional approaches.
منابع مشابه
Multilingual Word Segmentation and Part - of - Speech Tagging : a Machine Learning Approach Incorporating Diverse Features ∗
The aim of this dissertation is to study statistical methods for multilingual word segmentation and POS tagging with high accuracy. Word segmentation and part-of-speech (POS) tagging are fundamental language analysis tasks in natural language processing, and used in many applications. Existence of unknown words is a large problem in these tasks and they need to be properly handled. We attempt t...
متن کاملPrediction of part of speech tags for punjabi using support vector machines
Part-Of-Speech (POS) tagging is a task of assigning the appropriate POS or lexical category to each word in a natural language sentence. In this paper, we have worked on automated annotation of POS tags for Punjabi. We have collected a corpus of around 27,000 words, which included the text from various stories, essays, day-to-day conversations, poems etc., and divided these words into different...
متن کاملHigh Speed Unknown Word Prediction Using Support Vector Machine for Chinese Text-to-Speech Systems
One of the most significant problems in POS (Part-of-Speech) tagging of Chinese texts is an identification of words in a sentence, since there is no blank to delimit the words. Because it is impossible to pre-register all the words in a dictionary, the problem of unknown words inevitably occurs during this process. Therefore, the unknown word problem has remarkable effects on the accuracy of th...
متن کاملAutomatic Rule Induction for Unknown-Word Guessing
Words unknown to the lexicon present a substantial problem to NLP modules that rely on morphosyntactic information, such as part-of-speech taggers or syntactic parsers. In this paper we present a technique for fully automatic acquisition of rules that guess possible part-of-speech tags for unknown words using their starting and ending segments. The learning is performed from a general-purpose l...
متن کاملUnsupervised Learning of Word-Category Guessing Rules
Words unknown to the lexicon present a substantial problem to part-of-speech tagging. In this paper we present a technique for fully unsupervised statistical acquisition of rules which guess possible partsof-speech for unknown words. Three complementary sets of word-guessing rules are induced from the lexicon and a raw corpus: prefix morphological rules, suffix morphological rules and ending-gu...
متن کامل