Splitting compounds with ngrams
نویسنده
چکیده
Compound words with unmarked word boundaries are problematic for many tasks in NLP and computational linguistics, including information extraction, machine translation, and syllabification. This paper introduces a simple, proof-of-concept language modeling approach to automatic compound segmentation, demonstrated with Finnish. The approach utilizes an off-the-shelf morphological analyzer to split training words into their constituent morphemes. A language model is subsequently trained on ngrams composed of morphemes, morpheme boundaries, and word boundaries. Finally, linguistic constraints are used to weed out phonotactically ill-formed segmentations, thereby allowing the language model to select the best grammatical segmentation. This approach achieves an accuracy of ∼97%.
منابع مشابه
Accounting ngrams and multi-word terms can improve topic models
The paper presents an empirical study of integrating ngrams and multi-word terms into topic models, while maintaining similarities between them and words based on their component structure. First, we adapt the PLSA-SIM algorithm to the more widespread LDA model and ngrams. Then we propose a novel algorithm LDA-ITER that allows the incorporation of the most suitable ngrams into topic models. The...
متن کاملNgram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
We developed a search tool for ngrams extracted from a very large corpus (the current system uses the entire Wikipedia, which has 1.7 billion tokens). The tool supports queries with an arbitrary number of wildcards and/or specification by a combination of token, POS, chunk (such as NP, VP, PP) and Named Entity (NE). It outputs the matched ngrams with their frequencies as well as all the context...
متن کاملOccurrence Based Statistics in Machine Translation
As MT approaches demand longer context for better translation quality, the limitations of current language modeling techniques become explicit. The computational inability to model the likelihood of longer ngrams and the likelihood of their usage in probabilistic manner, have prevented us from exploring long ngrams in MT. In this paper, we propose and investigate a new set of features called oc...
متن کاملFinding the Correct Interpretation of Swedish Compounds, a Statistical Approach
This paper treats compound splitting for Swedish, where compounding is productive and very common. A method for splitting compounds and several methods for choosing the correct interpretation of ambiguous compounds are presented. 99% of all compounds are split, 97% of these are correctly interpreted.
متن کاملEffects of Location in the Tree Canopy on Some Quality Characteristics of Fresh Pistachio Fruit
Fresh pistachio fruit cv. Kalleghochi was harvested from the exterior and interior parts of the tree canopy in four geographical directions. The fruit position in exterior and interior parts of the tree canopy has a significant influence on the number of nuts per ounce, pistachio splitting, hull weight, shell weight, kernel weight, colour indices and total anthocyanin content. Results indicated...
متن کامل