Deriving Paraphrases for Highly Inflected Languages from Comparable Documents
نویسندگان
چکیده
We describe an automatic paraphrase-inference procedure for a highly inflected language like Arabic. Paraphrases are derived from comparable documents, that is, distinct documents dealing with the same topic. A co-training approach is taken, with two classifiers, one designed to model the contexts surrounding occurrences of paraphrases, and the other trained to identify significant features of the words within paraphrases. In particular, we use morpho-syntactic features calculated for both classifiers, as is to be expected when working with highly inflected languages. We provide some experimental results for Arabic, and for the simpler English, which we find to be encouraging. Our immediate interest is to incorporate such paraphrases within an Arabic-toEnglish translation system.
منابع مشابه
Inferring Paraphrases for a Highly Inflected Language from a Monolingual Corpus
We suggest a new technique for deriving paraphrases from a monolingual corpus, supported by a relatively small set of comparable documents. Two somewhat similar phrases that each occur in one of a pair of documents dealing with the same incident are taken as potential paraphrases, which are evaluated based on the contexts in which they appear in the larger monolingual corpus. We apply this tech...
متن کاملExternal Plagiarism Detection based on Human Behaviors in Producing Paraphrases of Sentences in English and Persian Languages
With the advent of the internet and easy access to digital libraries, plagiarism has become a major issue. Applying search engines is one of the plagiarism detection techniques that converts plagiarism patterns to search queries. Generating suitable queries is the heart of this technique and existing methods suffer from lack of producing accurate queries, Precision and Speed of retrieved result...
متن کاملAutomatic Semantic Subject Indexing of Web Documents in Highly Inflected Languages
Structured semantic metadata about unstructured web documents can be created using automatic subject indexing methods, avoiding laborious manual indexing. A succesful automatic subject indexing tool for the web should work with texts in multiple languages and be independent of the domain of discourse of the documents and controlled vocabularies. However, analyzing text written in a highly infle...
متن کاملImproved Statistical Machine Translation Using Monolingually-Derived Paraphrases
Untranslated words still constitute a major problem for Statistical Machine Translation (SMT), and current SMT systems are limited by the quantity of parallel training texts. Augmenting the training data with paraphrases generated by pivoting through other languages alleviates this problem, especially for the so-called “low density” languages. But pivoting requires additional parallel texts. We...
متن کاملStatistical Machine Translation of Australian Aboriginal Languages: Morphological Analysis with Languages of Differing Morphological Richness
Morphological analysis is often used during preprocessing in Statistical Machine Translation. Existing work suggests that the benefit would be greater for more highly inflected languages, although to our knowledge this has not been systematically tested on languages with comparable morphology. In this paper, two comparable languages with different amounts of inflection are tested, to see if the...
متن کامل