Sary: Reusable Components and Tools for Searching Large Corpora
نویسنده
چکیده
Since corpus-based natural language processing has to deal with large corpora, efficient searching of the large corpora is inevitably necessary. For example, one might want to examine how a word or a phrase is used in the large corpora or to collect frequencies of all terms in the large corpora. Our system Sary solves these problems by providing fast full-text search facilities for a single large text on the order of 100 MB using a data structure called suffix array[2]. Sary provides not only useful tools for searching large corpora, but also provides well-implemented libraries as reusable components.
منابع مشابه
Reusable Software Components
AbstructAn empirical study of methods for representing reusable software components is described. Thirty-five subjects searched for reusable components in a database of UNIX tools using four different representation methods: attribute-value, enumerated, faceted, and keyword. The study used Proteus, a reuse library system that supports multiple representation methods. Searching effectiveness was...
متن کاملIntelligent Component Retrieval for Software Reuse
Our research centers around exploring methodologies for developing reusable software, and developing methods and tools for building with reusable software. Roughly speaking, developing with reusable components involves three steps: 1) searching and retrieving reusable components based on partial specifications, 2) assessing the reuse worth of the retrieved components, and, possibly, 3) tailorin...
متن کاملReusable Tagset Conversion Using Tagset Drivers
Part-of-speech or morphological tags are important means of annotation in a vast number of corpora. However, different sets of tags are used in different corpora, even for the same language. Tagset conversion is difficult, and solutions tend to be tailored to a particular pair of tagsets. We propose a universal approach that makes the conversion tools reusable. We also provide an indirect evalu...
متن کاملA Neural Network based Method to Optimize the Software Component Searching Results in K-Model
Here we propose a storage and retrieval approach of reusable software components based on UML diagram, metadata repository and neural network. If we search the repository on the basis of attributes of MDL file descriptions, the search result would be better and thus giving higher precision, as compared to keyword based search, then apply neural network to searching results of reusable software ...
متن کاملMultiple Annotations of Reusable Data Resources: Corpora for Topic Detection and Tracking
Responding to demands for very large, easily accessible, reusable news corpora to support research in the topic detection and tracking paradigm, the Linguistic Data Consortium created the TDT corpora. In addition to supporting research in the Topic Detection and Tracking program, the TDT corpora were collected and annotated with an eye toward reuse and re-annotation. Their value is confirmed in...
متن کامل