ZT Corpus: Annotation and Tools for Basque Corpora

نویسندگان

  • Nerea Areta
  • Antton Gurrutxaga
  • Igor Leturia
  • Iñaki Alegria
  • Xabier Artola
  • Arantza Díaz de Ilarraza
  • Nerea Ezeiza
  • Aitor Sologaistoa
چکیده

The ZT Corpus (Basque Corpus of Science and Technology) is a tagged collection of specialised texts in Basque, which aims to be a major resource in research and development with respect to written technical Basque: terminology, syntax and style. It was released in December 2006 and can be queried at http://www.ztcorpusa.net. The ZT Corpus stands out among other Basque corpora for many reasons: it is the first specialised corpus in Basque, it has been designed to be a methodological and functional reference for new projects in the future (i.e. a national corpus for Basque), it is the first corpus in Basque annotated using a TEI-P4 compliant XML format, it is the first written corpus in Basque to be distributed by ELDA and it has a friendly and sophisticated query interface. The corpus has two kinds of annotation, a structural annotation and a stand-off linguistic annotation. It is composed of two parts, a 1.6 million-word balanced part, whose annotation has been revised by hand, and another automatically tagged 6 million-word part. The project is not closed, and we have the intention to gradually enlarge the corpus, along with making improvements to it. We also present the technology and the tools used to build this corpus. These tools, Corpusgile and Eulia, provide a flexible and extensible infrastructure for creating, visualising and managing corpora, and for consulting, visualising and modifying annotations generated by linguistic tools. And finally we will be introducing the web interface to query the ZT Corpus, which offers some interesting advanced features that are new in Basque corpora.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Structure, Annotation and Tools in the Basque ZT Corpus

The ZT corpus (Basque Corpus of Science and Technology) is a tagged collection of specialized texts in Basque, which wants to be a main resource in research and development about written technical Basque: terminology, syntax and style. It will be the first written corpus in Basque which will be distributed by ELDA (at the end of 2006) and it wants to be a methodological and functional reference...

متن کامل

QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages

This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and coreference. The corpora comprise both the well-known Europarl corpus and a domain-specific question-answer troubleshooting corpus on the IT domain. English is comm...

متن کامل

Improving the Basque WordNet by corpus annotation

This paper describes the methodology adopted to jointly develop the Basque WordNet and a hand annotated corpora (the Basque Semcor). This joint development allows for better motivated sense distinctions, and a tighter coupling between both resources. The methodology involves edition, tagging and refereeing tasks. We are currently half way though the nominal part of the 300.000 word corpus (roug...

متن کامل

Gross-grained RST through XML Metadata for Multilingual Document Generation

We present an RST-based discourse annotation proposal used in the construction of a trial multilingual XML-tagged corpus of teaching material in Basque, English and Spanish. The corpus feeds an experimental multilingual document generation system for the web. The main contributions of this paper are an implementation of RST through XML metadata and the adoption of gross-grained RST to avoid non...

متن کامل

Named Entities Translation Based On Comparable Corpora

In this paper we present a system for translating named entities from Basque to Spanish based on comparable corpora. For that purpose we have tried two approaches: one based on Basque linguistic features, and a language-independent tool. For both tools we have used BasqueSpanish comparable corpora, a bilingual dictionary and the web as resources.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007