A survey of cross-lingual embedding models
نویسنده
چکیده
Cross-lingual embedding models allow us to project words from different languages into a shared embedding space. This allows us to apply models trained on languages with a lot of data, e.g. English to low-resource languages. In the following, we will survey models that seek to learn cross-lingual embeddings. We will discuss them based on the type of approach and the nature of parallel data that they employ. Finally, we will present challenges and summarize how to evaluate cross-lingual embedding models. In recent years, driven by the success of word embeddings, many models that learn accurate representations of words haven been proposed [Mikolov et al., 2013a, Pennington et al., 2014]. However, these models are generally restricted to capture representations of words in the language they were trained on. The availability of resources, training data, and benchmarks in English leads to a disproportionate focus on the English language and a negligence of the plethora of other languages that are spoken around the world. In our globalised society, where national borders increasingly blur, where the Internet gives everyone equal access to information, it is thus imperative that we do not only seek to eliminate bias pertaining to gender or race [Bolukbasi et al., 2016] inherent in our representations, but also aim to address our bias towards language. To remedy this and level the linguistic playing field, we would like to leverage our existing knowledge in English to equip our models with the capability to process other languages. Perfect machine translation (MT) would allow this. However, we do not need to actually translate examples, as long as we are able to project examples into a common subspace such as the one in Figure 1. Figure 1: A shared embedding space between two languages [Luong et al., 2015] ∗This article originally appeared as a blog post at http://sebastianruder.com/ cross-lingual-embeddings/index.html on 28 November 2016. ar X iv :1 70 6. 04 90 2v 1 [ cs .C L ] 1 5 Ju n 20 17 Ultimately, our goal is to learn a shared embedding space between words in all languages. Equipped with such a vector space, we are able to train our models on data in any language. By projecting examples available in one language into this space, our model simultaneously obtains the capability to perform predictions in all other languages (we are glossing over some considerations here; for these, refer to Section 7. This is the promise of cross-lingual embeddings. Over the course of this survey, we will give an overview of models and algorithms that have been used to come closer to the elusive goal of capturing the relations between words in multiple languages in a common embedding space. Note that while neural MT approaches implicitly learn a shared cross-lingual embedding space by optimizing for the MT objective, we will focus on models that explicitly learn cross-lingual word representations throughout this blog post. These methods generally do so at a much lower cost than MT and can be considered to be to MT what word embedding models [Mikolov et al., 2013a, Pennington et al., 2014] are to language modelling. 1 Types of cross-lingual embedding models In recent years, various models for learning cross-lingual representations have been proposed. In the following, we will order them by the type of approach that they employ. Note that while the nature of the parallel data used is equally discriminatory and has been shown to account for inter-model performance differences [Levy et al., 2017], we consider the type of approach more conducive to understanding the assumptions a model makes and – consequently – its advantages and deficiencies. Cross-lingual embedding models generally use four different approaches: 1. Monolingual mapping: These models initially train monolingual word embeddings on large monolingual corpora. They then learn a linear mapping between monolingual representations in different languages to enable them to map unknown words from the source language to the target language. 2. Pseudo-cross-lingual: These approaches create a pseudo-cross-lingual corpus by mixing contexts of different languages. They then train an off-the-shelf word embedding model on the created corpus. The intuition is that the cross-lingual contexts allow the learned representations to capture cross-lingual relations. 3. Cross-lingual training: These models train their embeddings on a parallel corpus and optimize a cross-lingual constraint between embeddings of different languages that encourages embeddings of similar words to be close to each other in a shared vector space. 4. Joint optimization: These approaches train their models on parallel (and optionally monolingual data). They jointly optimise a combination of monolingual and cross-lingual losses. In terms of parallel data, methods may use different supervision signals that depend on the type of data used. These are, from most to least expensive: 1. Word-aligned data: A parallel corpus with word alignments that is commonly used for machine translation; this is the most expensive type of parallel data to use. 2. Sentence-aligned data: A parallel corpus without word alignments. If not otherwise specified, the model uses the Europarl corpus2 consisting of sentence-aligned text from the proceedings of the European parliament that is generally used for training Statistical Machine Translation models. 3. Document-aligned data: A corpus containing documents in different languages. The documents can be topic-aligned (e.g. Wikipedia) or label/class-aligned (e.g. sentiment analysis and multi-class classification datasets). 4. Lexicon: A bilingual or cross-lingual dictionary with pairs of translations between words in different languages. 5. No parallel data: No parallel data whatsoever. Learning cross-lingual representations from only monolingual resources would enable zero-shot learning across languages. http://www.statmt.org/europarl/
منابع مشابه
Cross-lingual Models of Word Embeddings: An Empirical Comparison
Despite interest in using cross-lingual knowledge to learn word embeddings for various tasks, a systematic comparison of the possible approaches is lacking in the literature. We perform an extensive evaluation of four popular approaches of inducing cross-lingual embeddings, each requiring a different form of supervision, on four typologically different language pairs. Our evaluation setup spans...
متن کاملCross-Lingual Word Representations via Spectral Graph Embeddings
Cross-lingual word embeddings are used for cross-lingual information retrieval or domain adaptations. In this paper, we extend Eigenwords, spectral monolingual word embeddings based on canonical correlation analysis (CCA), to crosslingual settings with sentence-alignment. For incorporating cross-lingual information, CCA is replaced with its generalization based on the spectral graph embeddings....
متن کاملEarth Mover's Distance Minimization for Unsupervised Bilingual Lexicon Induction
Cross-lingual natural language processing hinges on the premise that there exists invariance across languages. At the word level, researchers have identified such invariance in the word embedding semantic spaces of different languages. However, in order to connect the separate spaces, cross-lingual supervision encoded in parallel data is typically required. In this paper, we attempt to establis...
متن کاملMultilingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment
Many recent works have demonstrated the benefits of knowledge graph embeddings in completing monolingual knowledge graphs. Inasmuch as related knowledge bases are built in several different languages, achieving cross-lingual knowledge alignment will help people in constructing a coherent knowledge base, and assist machines in dealing with different expressions of entity relationships across div...
متن کاملCross-Lingual Entity Alignment via Joint Attribute-Preserving Embedding
Entity alignment is the task of finding entities in two knowledge bases (KBs) that represent the same real-world object. When facing KBs in different natural languages, conventional cross-lingual entity alignment methods rely on machine translation to eliminate the language barriers. These approaches often suffer from the uneven quality of translations between languages. While recent embedding-...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1706.04902 شماره
صفحات -
تاریخ انتشار 2017