Analysis of the Usage of Japanese Segmented Texts in NTCIR Workshop 2
نویسندگان
چکیده
In this paper, we report on the usage of Japanese segmented texts and analyze the submitted search results to NTCIR Workshop 2, which used these texts. In these texts, each sentence is segmented into terms and term components (similar to phrases and words). However, the sizes of terms are inconsistent in the texts; e.g., some terms that should be decomposed into term components remain as terms. We analyze the effect of this inconsistency from the viewpoint of comparison between word-based indexing and phrasal indexing. Based on this analysis, we propose the desired specification of a morphological analyzer for Information Retrieval.
منابع مشابه
Preface of NTCIR-8
NTCIR-8 Meeting is where the groups who actively participated in one or more tasks set by NTCIR-8 report out their latest results obtained from the evaluation workshop. The NTCIR evaluation workshop series are designed to enhance research in information access technologies, including text retrieval, cross-language information access, question-answering, information extraction, text mining, etc....
متن کاملThe NTCIR Workshop : the First Evaluation Workshop on Japanese Text Retrieval and Cross-Lingual Information Retrieval
This paper introduces the outline of the first NTCIR Workshop, which is the first evaluation workshop designed to enhance research in Japanese text retrieval and cross-lingual information retrieval. The test collection used in the Workshop consists of more than 330,000 documents with more than half are EnglishJapanese paired. Twenty-three groups from four countries have conducted IR tasks and s...
متن کاملUsing Parallel Corpora to Automatically Generate Training Data for Chinese Segmenters in NTCIR PatentMT Tasks
Chinese texts do not contain spaces as word separators like English and many alphabetic languages. To use Moses to train translation models, we must segment Chinese texts into sequences of Chinese words. Increasingly more software tools for Chinese segmentation are populated on the Internet in recent years. However, some of these tools were trained with general texts, so might not handle domain...
متن کاملAutomatic Categorization of Japanese Patents based on Surrogate Texts
This paper describes our work at the fifth NTCIR workshop on the subtask of patent classification. We use KNN (K-Nearest Neighbors) as our classifier and the English PAJ (Patent Abstract Japan) as the patent surrogate for classification. Based on the knowledge and experience learned from our previous experiments with other document collections, we leverage on the parameters to achieve above-ave...
متن کاملIRCE at the NTCIR-12 IMine-2 Task
The IRCE team participated in the IMine-2 task at the NTCIR-12 workshop. We submitted one Chinese language run and five Japanese language runs for the Query Understanding subtask. Our methods exploited online text corpora BaiduPedia for the Chinese language run and Japanese Wikipedia for the Japanese language runs. The approaches employed in the Chinese and Japanese language topics are differed...
متن کامل