ASPEC: Asian Scientific Paper Excerpt Corpus

نویسندگان

  • Toshiaki Nakazawa
  • Manabu Yaguchi
  • Kiyotaka Uchimoto
  • Masao Utiyama
  • Eiichiro Sumita
  • Sadao Kurohashi
  • Hitoshi Isahara
چکیده

In this paper, we describe the details of the ASPEC (Asian Scientific Paper Excerpt Corpus), which is the first large-size parallel corpus of scientific paper domain. ASPEC was constructed in the Japanese-Chinese machine translation project conducted between 2006 and 2010 using the Special Coordination Funds for Promoting Science and Technology. It consists of a Japanese-English scientific paper abstract corpus of approximately 3 million parallel sentences (ASPEC-JE) and a Chinese-Japanese scientific paper excerpt corpus of approximately 0.68 million parallel sentences (ASPEC-JC). ASPEC is used as the official dataset for the machine translation evaluation workshop WAT (Workshop on Asian Translation).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CKY-based Convolutional Attention for Neural Machine Translation

This paper proposes a new attention mechanism for neural machine translation (NMT) based on convolutional neural networks (CNNs), which is inspired by the CKY algorithm. The proposed attention represents every possible combination of source words (e.g., phrases and structures) through CNNs, which imitates the CKY table in the algorithm. NMT, incorporating the proposed attention, decodes a targe...

متن کامل

Translation Using JAPIO Patent Corpora: JAPIO at WAT2016

Japan Patent Information Organization (JAPIO) participates in scientific paper subtask (ASPEC-EJ/CJ) and patent subtask (JPC-EJ/CJ/KJ) with phrase-based SMT systems which are trained with its own patent corpora. Using larger corpora than those prepared by the workshop organizer, we achieved higher BLEU scores than most participants in EJ and CJ translations of patent subtask, but in crowdsourci...

متن کامل

Toshiba MT System Description for the WAT2014 Workshop

This paper provides a system description of Toshiba Machine Translation System for WAT2014. We participated in two tasks, namely Japanese-English translation and Japanese-Chinese translation. In each task, we submitted two results; one is a result of a rule-based translation system, and the other is a result which is an output of statistical post editing trained with the ASPEC training corpora....

متن کامل

Developing Asian language corpora: standards and practice

This paper first discusses standards for developing Asian language corpora so as to facilitate international data exchange. Following this, we present two corpora of Asian languages developed at Lancaster University – the EMILLE Corpus, which contains 14 South Asian languages, and the Lancaster Corpus of Mandarin Chinese. Finally, we will demonstrate how to explore these corpora using Xara and ...

متن کامل

Narrowing the Readability Gap Between Scientific Papers and the World Wide Web

As of today, publications are treated as self-contained entities, with usually a few tens of references to relevant papers in the field. References have a restricted semantic: they can only point to papers as a whole, rather than to a specific portion of the document (as anchor hyperlinks can do with HTML pages). The restriction is due in part to LATEX–i.e., papers indeed are not hypertexts– al...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016