Pii: S0306-4573(01)00044-9
نویسندگان
چکیده
Most work in NLP requires that texts have been previously segmented into sentences and words. Segmenting a text into sentences and words, however, is a complex task, due to the ambiguity of many punctuation marks and spaces. Furthermore, Web texts such as HTML documents are more difficult to make into well refined and segmented texts because they are described in a more free style, with many sentence boundary and spacing errors. The objective of this paper introduces a multi-strategic integrated text preprocessing method for difficult problems of sentence boundary disambiguation and word boundary disambiguation of Web texts. We have applied a hybrid method (the regular expression rule, the heuristic rule, and the inductive learning of statistical decision trees, using a C4.5 learner) synergically to the task of raw corpus preprocessing. This work contributes to a more correct morphological analysis and guarantees a more stable working of application systems. We tackle easily definable problems with automatically acquired constraints and we use inductively learned decision trees to solve ill-defined ambiguity problems by incorporating multiple features (n-grams, relative frequency, entropy, tri-dictionary index). The multistrategy approach was thoroughly tested: it achieved approximately 99.12% (with punctuation marks) and 98.04% (without any punctuation marks) accuracy in sentence boundary disambiguation, 95.39% accuracy of word spacing correction, and 94.61% accuracy for whole intermixed text preprocessing problems, from Korean news script Web documents. 2002 Elsevier Science Ltd. All rights reserved.
منابع مشابه
A day in the life of Web searching: an exploratory study
Understanding Web searching behavior is important in developing more successful and cost-efficient Web search engines. We provide results from a comparative time-based Web study of US-based Excite and Norwegian-based Fast Web search logs, exploring variations in user searching related to changes in time of the day. Findings suggest: (1) fluctuations in Web user behavior over the day, (2) user i...
متن کاملFurther reflections on TREC
The paper reviews the TREC Programme up to TREC-6 (1997), considering the test results, the substantive ®ndings for IR that follow and the lessons TREC oers for IR evaluation. The paper focuses on the ad hoc retrieval task, with discussion of other test tracks as appropriate. The paper summarises the structure of the TREC work and analyses the experimental data in some detail. The analysis of ...
متن کاملGenetic algorithms in relevance feedback: a second test and new contributions
The present work is the continuation of an earlier study which reviewed the literature on relevance feedback genetic techniques that follow the vector space model (the model that is most commonly used in this type of application), and implemented them so that they could be compared with each other as well as with one of the best traditional methods of relevance feedback––the Ide dec-hi method. ...
متن کاملSome thoughts on the reported results of TREC
The periodic TRECs ± Text REtrieval Conferences ± have reported the results of a variety of recall studies in large-scale document retrieval. While the eorts of TREC are noteworthy and laudable, there are reasons why its results, especially the recall values which are central to its conclusions, should be accepted with some caution.
متن کامل