Sary: Reusable Components and Tools for Searching Large Corpora

نویسنده

  • Satoru Takabayashi
چکیده

Since corpus-based natural language processing has to deal with large corpora, efficient searching of the large corpora is inevitably necessary. For example, one might want to examine how a word or a phrase is used in the large corpora or to collect frequencies of all terms in the large corpora. Our system Sary solves these problems by providing fast full-text search facilities for a single large text on the order of 100 MB using a data structure called suffix array[2]. Sary provides not only useful tools for searching large corpora, but also provides well-implemented libraries as reusable components.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reusable Software Components

AbstructAn empirical study of methods for representing reusable software components is described. Thirty-five subjects searched for reusable components in a database of UNIX tools using four different representation methods: attribute-value, enumerated, faceted, and keyword. The study used Proteus, a reuse library system that supports multiple representation methods. Searching effectiveness was...

متن کامل

Intelligent Component Retrieval for Software Reuse

Our research centers around exploring methodologies for developing reusable software, and developing methods and tools for building with reusable software. Roughly speaking, developing with reusable components involves three steps: 1) searching and retrieving reusable components based on partial specifications, 2) assessing the reuse worth of the retrieved components, and, possibly, 3) tailorin...

متن کامل

Reusable Tagset Conversion Using Tagset Drivers

Part-of-speech or morphological tags are important means of annotation in a vast number of corpora. However, different sets of tags are used in different corpora, even for the same language. Tagset conversion is difficult, and solutions tend to be tailored to a particular pair of tagsets. We propose a universal approach that makes the conversion tools reusable. We also provide an indirect evalu...

متن کامل

A Neural Network based Method to Optimize the Software Component Searching Results in K-Model

Here we propose a storage and retrieval approach of reusable software components based on UML diagram, metadata repository and neural network. If we search the repository on the basis of attributes of MDL file descriptions, the search result would be better and thus giving higher precision, as compared to keyword based search, then apply neural network to searching results of reusable software ...

متن کامل

Multiple Annotations of Reusable Data Resources: Corpora for Topic Detection and Tracking

Responding to demands for very large, easily accessible, reusable news corpora to support research in the topic detection and tracking paradigm, the Linguistic Data Consortium created the TDT corpora. In addition to supporting research in the Topic Detection and Tracking program, the TDT corpora were collected and annotated with an eye toward reuse and re-annotation. Their value is confirmed in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001