Keyword-driven Suffix Arrays for On-line Keyword Searching from Documents in Chinese
نویسنده
چکیده
On-line keyword searching from documents in Chinese tends to use inverted indexing as the main technique, which has its difficulties. Suffix Array is widely used for processing text in Western languages. However, it fails to get widely used in Chinese processing because of the speciality of Chinese. Suffix Array is a powerful tool. However it costs too much space. That is the major bottleneck of suffix Array. A data structure called Keyword-driven Suffix Array is proposed in this paper for on-line keyword searching from documents in Chinese, based on observation of on-line search pattern and traits of Chinese. Space efficiency is improved a lot using this data structure. When the document database is large enough, space efficiency is improved by about 5/6 using this data structure without sacrificing its time efficiency.
منابع مشابه
Acceleration of spoken term detection using a suffix array by assigning optimal threshold values to sub-keywords
We previously proposed a fast spoken term detection method that uses a suffix array data structure for searching large-scale speech documents. The method reduces search time via techniques such as keyword division and iterative lengthening search. In this paper, we propose a statistical method of assigning different threshold values to sub-keywords to further accelerate search. Specifically, th...
متن کاملCharacter confidence based on N-best list for keyword spotting in online Chinese handwritten documents
In keyword spotting from handwritten documents by text query, the word similarity is usually computed by combining character similarities, which are desired to approximate the logarithm of the character probabilities. In this paper, we propose to directly estimate the posterior probability (also called confidence) of candidate characters based on the N-best paths from the candidate segmentation...
متن کاملProximity Keyword Search in Xml Documents Using CTREE Index
Proximity Keyword Search is especially useful when searching on the web and in long unstructured documents such as XML. This system is designed to handle novel features of Proximity Keyword Search in XML documents. It concentrates mainly on producing ranked results efficiently for keyword search queries over XML documents. The proposed system is first of its kind in which the keyword string is ...
متن کاملA Survey on Various Word Spotting Techniques for Content Based Document Image Retrieval
Searching documents for information and retrieval of relevant documents is a basic activity. Various tools are readily available for searching and retrieval from digital documents, but not much robust methods are available for retrieval from historic documents and old manuscripts as they are not digitized but available in scanned formats. Conventional way of retrieval from scanned document imag...
متن کاملEvaluation of Fast Spoken Term Detection Using a Suffix Array
We previously proposed [1] fast spoken term detection that uses a suffix array as a data structure for searching a largescale speech documents. In this method, a keyword is divided into sub-keywords, and the phoneme sequences that contain two or more sub-keywords are output as results. Although the search is executed very quickly on a 10,000-h speech database, we only proposed a variety of matc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012