Chinese Language IR based on Term Extraction
نویسندگان
چکیده
In this paper, we’ll describe the core technology and modules we use in LIT (formerly KRDL)’s Chinese Language Information Retrieval System. The system mainly includes automatic term extraction from Chinese documents, query analysis based on the terms and finally measurement of the association between queries and documents. Compared with other methods, we try to use automatically acquired terms as well as their related terms as features to retrieve documents, so we don’t need any word segmentation procedures as prerequisite. The terms are more significant than words in representing the queries.
منابع مشابه
A Bottom-up Term Extraction Approach for Web-based Translation in Chinese-English IR Systems
The extraction of Multiword Lexical Units (MLUs) in lexica is important to language related methods such as Natural Language Processing (NLP) and machine translation. As one word in one language may be translated into an MLU in another language, the extraction of MLUs plays an important role in Cross-Language Information Retrieval (CLIR), especially in finding the translation for words that are...
متن کاملExploring Semantic Constraints For Document Retrieval
In this paper, we explore the use of structured content as semantic constraints for enhancing the performance of traditional term-based document retrieval in special domains. First, we describe a method for automatic extraction of semantic content in the form of attribute-value (AV) pairs from natural language texts based on domain models constructed from a semistructured web resource. Then, we...
متن کاملA Statistical Corpus-Based Term Extractor
Term extraction is an important problem in natural language processing. In this paper, we propose a language independent statistical corpus-based term extraction algorithm. In previous approaches, evaluation has been subjective, at best relying on a lexicographer’s judgement. We evaluate the quality of our term extractor by assessing its predictiveness on an unseen corpus using perplexity. Seco...
متن کاملTREC-9 CLIR Experiments at MSRCN
In TREC-9, we participated in the English-Chinese Cross-Language Information Retrieval (CLIR) track. Our work involved two aspects: finding good methods for Chinese IR, and finding effective translation means between English and Chinese. On Chinese monolingual retrieval, we investigated the use of different entities as indexes, pseudorelevance feedback, and length normalization, and examined th...
متن کاملImproving English and Chinese Ad-Hoc Retrieval: TIPSTER Text Phase 3 Final Report
We investigated both English and Chinese ad-hoc information retrieval (IR). Part of our objectives is to study the use of term, phrasal and topical concept level evidence, either individually or in combination, to improve retrieval accuracy. For short queries, we studied five term level techniques that together lead to improvements over standard ad-hoc 2-stage retrieval some 20% to 40% for TREC...
متن کامل