Ending-based Strategies for Part-of-Speech Tagging
نویسندگان
چکیده
Probabilistic approaches to part-of-speech tagging rely primarily on whole-word statis tics about word/tag combinations as well as contextual information. But experience shows about 4 per cent of tokens encountered in test sets are unknown even when the train ing set is as large as a million words. Unseen words are tagged using secondary strategies that exploit word features such as endings, capitalizations and punctuation marks. In this work, word-ending statistics are pri mary and whole-word statistics are sec ondary. First, a tagger was trained and tested on word endings only. Subsequent ex periments added back whole-word statistics for the N words occurring most frequently in the training set. As N grew larger, per formance was expected to improve, in the limit performing the same as word-based tag gers. Surprisingly, the ending-based tag ger initially performed nearly as well as the word-based tagger; in the best case, its per formance significantly exceeded that of the word-based tagger. Lastly, and unexpect edly, an effect of negative returns was ob servedas N grew larger, performance gen erally improved and then declined. By vary ing factors such as ending length and tag-list strategy, we achieved a success rate of 97.5 percent.
منابع مشابه
An improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملMorphological Ending – based Strategies of Unknown Word Estimation for Statistical POS Urdu Tagger
Natural language processing has widely used Statistical based language models to solve disambiguation problems. Over the past decades different techniques regarding POS tagging have been proposed for English, European and East Asian languages. In this paper our focus is POS tagging for Urdu due to the infancy stage of Urdu language based tagging system. We have combined two approaches (Statisti...
متن کاملسیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی
Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...
متن کاملبرچسبگذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی
Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...
متن کاملA Practical Part-of-Speech Tagger
We present an implementation of a part-of-speech tagger based on a hidden Markov model. The methodology enables robust and accurate tagging with few resource requirements. Only a lexicon and some unlabeled training text are required. Accuracy exceeds 96%. We describe implementation strategies and optimizations which result in high-speed operation. Three applications for tagging are described: p...
متن کامل