JEIDA's Bilingual Corpus and Other Corpora for NLP Research in Japan
نویسنده
چکیده
The committee on text processing technology of JEIDA (Japan Electronics Industry Development Association) has been developing its bilingual corpus for research on machine translation systems since the 1996 Japanese fiscal year. An overview of this bilingual corpus is presented in this paper. And other linguistic data recently developed in Japan, which includes the RWC text database and the simple sentence data by the CRL and IPA.
منابع مشابه
JEIDA's English-Japanese Bilingual Corpus Project
JEIDA (Japan Electronics Industry Development Association) has been developing a large bilingual aligned corpus for research in NLP, since the 1996 Japanese fiscal year. In fiscal year 1996, JEIDA did a feasibility study and received permission from the Japanese Ministries to create such a resource. JEIDA, then, made a "small" sentence aligned corpus in fiscal year 1997. JEIDA's new project sta...
متن کاملMultilingual Document Alignment - A Study with Chinese and Japanese
Natural language processing (NLP) community is increasingly using paralleland comparablecorpora for cross-linguistic research. The knowledge extracted from such corpora helps us in cross-language information retrieval, topic detection and tracking, machine translation, and many other NLP tasks. Parallel or comparable corpora of JapaneseChinese language-pair are rare. We investigate an automatic...
متن کاملBilingual Parallel Active Learning Between Chinese and English
Active learning is an effective machine learning paradigm which can significantly reduce the amount of labor for manually annotating NLP corpora while achieving competitive perfor-mance. Previous studies on active learning are focused on corpora in one single language or two languages translated from each other. This paper proposes a Bilingual Parallel Active Learning paradigm (BPAL), where an ...
متن کاملCollocation Translation Acquisition Using Monolingual Corpora
Collocation translation is important for machine translation and many other NLP tasks. Unlike previous methods using bilingual parallel corpora, this paper presents a new method for acquiring collocation translations by making use of monolingual corpora and linguistic knowledge. First, dependency triples are extracted from Chinese and English corpora with dependency parsers. Then, a dependency ...
متن کاملBilingual Active Learning for Relation Classification via Pseudo Parallel Corpora
Active learning (AL) has been proven effective to reduce human annotation efforts in NLP. However, previous studies on AL are limited to applications in a single language. This paper proposes a bilingual active learning paradigm for relation classification, where the unlabeled instances are first jointly chosen in terms of their prediction uncertainty scores in two languages and then manually l...
متن کامل