Full-text Search for Thai Information Retrieval Systems
نویسندگان
چکیده
While there have been a lot of efficient full-text search algorithms developed for English documents, these algorithms can be directly used for other languages, e.g. Chinese, Japanese, Thai and so on. However, due to idiosyncrasies of each individual language, directly applying such algorithms may not be suitable for the language considered. This paper proposes a simplification of Boyer-Moore algorithm, called BMT, in order to reduce computation and makes it appropriate for Thai full-text. To investigate the efficiency, the comparison of BMT with other search algorithms is evaluated. Moreover, we applied syllable-like segmentation, called Thai character clusters (TCCs), to improve searching efficiency in Thai documents by grouping Thai characters into inseparable units. The TCC is based on the spell features of Thai language. Comparing with traditional full-text searching methods, this approach can improve not only searching time and memory consumption but also searching accuracy. The experimental results evidence that searching methods using TCC outperform the traditional methods in full-text search algorithm.
منابع مشابه
Adopting the Information Retrieval Approach for Storing and Retrieving Thai-text Structured Data
This paper describes an approach of using full-text search engine in storing and retrieving structured data in Thai language. It discusses some limitations of database management system (DBMS) in querying Thai full-text based content. These limitations can result in degrading of retrieval performance both in terms of result accuracy and system response time. Information Retrieval (IR) system or...
متن کاملMore Accurate Fuzzy Text Search for Languages Using Abugida Scripts
Text search is a key step in any kind of information access. For doing it effectively, we can use knowledge about the concerned writing systems. Methods based on such knowledge can give significantly better results for searching text, at least for some languages. This can improve information retrieval in particular and information access in general. In this paper, we present a method for fuzzy ...
متن کاملOverview of the Full-Text Document Retrieval Benchmark
8.1 Introduction For most of recorded history, textual data have existed primarily in hardcopy format, and the related document retrieval process was essentially a manual task, possibly involving the assistance of cross-reference catalogs. By the mid-1960s, work was under way at the University of Pittsburgh to develop computer-assisted legal research systems [Harrington, 1984–85]. Also, during ...
متن کاملReview of ranked-based and unranked-based metrics for determining the effectiveness of search engines
Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...
متن کاملWWW Search Systems Using SQL*TextRetrieval and Parallel Server for Structured and Unstructured Data
We describe our experience in developing Web Search Systems using Oracle’s SQL*TextRetrieval. In the prototype system we store on-line books in the HTML and the HTML documents of a web site, SQL*TextRetrieval is used to index full text and other structured data in the ’web space’ and to provide an efficient search engine for free-text search. The Web enables global access to and maximum informa...
متن کامل