Extracting Semantic Classes and Morphosyntactic Features for English-Polish Machine Translation
نویسندگان
چکیده
This paper describes a procedure aimed at automatic extraction of certain noun and verb categories from Polish texts. The general goal is to construct a lexical database that should be incorporated into a system for machine translation and multilingual generation of summaries. High quality processing of inflectional languages like Polish requires quite elaborated lexical entries, it is therefore highly desirable to automate the process of lexicon construction, at least partially. However, purely statistical methods for languages with less elaborated inflectional systems do not perform especially well on Slavic languages. As primary cues for automatic subcategorization we used inflectional morphemes expressing the greatest number of semantico-syntactic functions. The crucial semantic category for noun classification was the degree of animacy. Morphosyntactically, this category is expressed by nominal suffixes and subject-verb agreement markers. The procedure for lexical extraction and classification was implemented in Delphi and the system was trained for extraction of so-called superanimate nouns, i.e. nouns denoting male human beings, or groups including both male and female humans. The usability of lexical extraction based on concurrence of morphological features rather than on concurrence of whole word forms is evaluated and discussed.
منابع مشابه
Arabic-English Semantic Word Class Alignment to Improve Statistical Machine Translation
Clustering words is a widely used technique in statistical natural language processing. It requires syntactic, semantic, and contextual features. Especially, semantic clustering is gaining a lot of interest. It consists in grouping a set of words expressing the same idea or sharing the same semantic properties. In this paper, we present a new method to integrate semantic classes in a Statistica...
متن کاملA Comparative Study of English-Persian Translation of Neural Google Translation
Many studies abroad have focused on neural machine translation and almost all concluded that this method was much closer to humanistic translation than machine translation. Therefore, this paper aimed at investigating whether neural machine translation was more acceptable in English-Persian translation in comparison with machine translation. Hence, two types of text were chosen to be translated...
متن کاملMachine Learning of Syntactic Attachment from Morphosyntactic and Semantic Co-occurrence Statistics
The paper presents a novel approach to extracting dependency information in morphologically rich languages using co-occurrence statistics based not only on lexical forms (as in previously described collocation-based methods), but also on morphosyntactic and wordnet-derived semantic properties of words. Statistics generated from a corpus annotated only at the morphosyntactic level are used as fe...
متن کاملUtilizing Semantic Equivalence Classes of Japanese Functional Expressions in Translation Rule Acquisition from Parallel Patent Sentences
In the “Sandglass” MT architecture, we identify the class of monosemous Japanese functional expressions and utilize it in the task of translating Japanese functional expressions into English. We employ the semantic equivalence classes of a recently compiled large scale hierarchical lexicon of Japanese functional expressions. We then study whether functional expressions within a class can be tra...
متن کاملBoosting Statistical Machine Translation by Lemmatization and Linear Interpolation
Data sparseness is one of the factors that degrade statistical machine translation (SMT). Existing work has shown that using morphosyntactic information is an effective solution to data sparseness. However, fewer efforts have been made for Chinese-to-English SMT with using English morpho-syntactic analysis. We found that while English is a language with less inflection, using English lemmas in ...
متن کامل