Coarse-grained Cross-lingual Alignment of Comparable Texts with Topic Models and Encyclopedic Knowledge
نویسندگان
چکیده
We present a method for coarse-grained cross-lingual alignment of comparable texts: segments consisting of contiguous paragraphs that discuss the same theme (e.g. history, economy) are aligned based on induced multilingual topics. The method combines three ideas: a two level LDA model that filters out words that do not convey themes, an HMM that models the ordering of themes in the collection of documents, and language-independent concept annotations to serve as a crosslanguage bridge and to strengthen the connection between paragraphs in the same segment through concept relations. The method is evaluated on English and French data previously used for monolingual alignment. The results show state-ofthe-art performance in both monolingual and cross-lingual settings.
منابع مشابه
Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications
Probabilistic topic models are unsupervised generative models which model document content as a two-step generation process, that is, documents are observed as mixtures of latent concepts or topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingua...
متن کاملCross-lingual Semantic Relatedness Using Encyclopedic Knowledge
In this paper, we address the task of crosslingual semantic relatedness. We introduce a method that relies on the information extracted from Wikipedia, by exploiting the interlanguage links available between Wikipedia versions in multiple languages. Through experiments performed on several language pairs, we show that the method performs well, with a performance comparable to monolingual measur...
متن کاملMonolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval
Probabilistic topic models are a group of unsupervised generative machine learning models that can be effectively trained on large text collections. They model document content as a two-step generation process, i.e., documents are observed as mixtures of latent topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested int...
متن کاملCross-Lingual Text Fragment Alignment Using Divergence from Randomness
This paper describes an approach to automatically align fragments of texts of two documents in different languages. A text fragment is a list of continuous sentences and an aligned pair of fragments consists of two fragments in two documents, which are content-wise related. Cross-lingual similarity between fragments of texts is estimated based on models of divergence from randomness. A set of a...
متن کاملProbabilistic Models of Cross-Lingual Semantic Similarity in Context Based on Latent Cross-Lingual Concepts Induced from Comparable Data
We propose the first probabilistic approach to modeling cross-lingual semantic similarity (CLSS) in context which requires only comparable data. The approach relies on an idea of projecting words and sets of words into a shared latent semantic space spanned by language-pair independent latent semantic concepts (e.g., crosslingual topics obtained by a multilingual topic model). These latent cros...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1411.7820 شماره
صفحات -
تاریخ انتشار 2014