Chinese Web Scale Linguistic Datasets and Toolkit

نویسندگان

  • Chi-Hsin Yu
  • Hsin-Hsi Chen
چکیده

The web provides a huge collection of web pages for researchers to study natural languages. However, processing web scale texts is not an easy task and needs many computational and linguistic resources. In this paper, we introduce two Chinese parts-of-speech tagged web-scale datasets and describe tools that make them easy to use for NLP applications. The first is a Chinese segmented and POS-tagged dataset, in which the materials are selected from the ClueWeb09 dataset. The second is a Chinese POS n-gram corpus extracted from the POS-tagged dataset. Tools to access the POS-tagged dataset and the POS n-gram corpus are presented. The two datasets will be released to the public along with their tools.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A toolkit for analysing large-scale plant small RNA datasets

UNLABELLED Recent developments in high-throughput sequencing technologies have generated considerable demand for tools to analyse large datasets of small RNA sequences. Here, we describe a suite of web-based tools for processing plant small RNA datasets. Our tools can be used to identify micro RNAs and their targets, compare expression levels in sRNA loci, and find putative trans-acting siRNA l...

متن کامل

MSR SPLAT, a language analysis toolkit

We describe MSR SPLAT, a toolkit for language analysis that allows easy access to the linguistic analysis tools produced by the NLP group at Microsoft Research. The tools include both traditional linguistic analysis tools such as part-of-speech taggers, constituency and dependency parsers, and more recent developments such as sentiment detection and linguistically valid morphology. As we expand...

متن کامل

Towards a Semantic Web of Relational Databases: A Practical Semantic Toolkit and an In-Use Case from Traditional Chinese Medicine

Integrating relational databases is recently acknowledged as an important vision of the Semantic Web research, however there are not many wellimplemented tools and not many applications that are in large-scale real use either. This paper introduces the Dartgrid which is an application development framework together with a set of semantic tools to facilitate the integration of heterogenous relat...

متن کامل

Annotation Tool Development for Large-Scale Corpus Creation Projects at the Linguistic Data Consortium

The Linguistic Data Consortium (LDC) creates a variety of linguistic resources – data, annotations, tools, standards and best practices – for many sponsored projects. The programming staff at LDC has created the tools and technical infrastructures to support the data creation efforts for these projects, creating tools and technical infrastructures for all aspects of data creation projects: data...

متن کامل

From Legacy Relational Databases to the Semantic Web: an In-Use Application for Traditional Chinese Medicine

Integrating relational databases is recently acknowledged as an important vision of the Semantic Web research, however there is not many applications that are well-implemented and in large-scale real use. This paper introduces an in-use application deployed at China Academy of Traditional Chinese Medicine (CATCM). In this application, over 70 legacy relational databases are semantically interco...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012