LoLo: A System Based On Terminology For Multilingual Extraction
نویسندگان
چکیده
An unsupervised learning method, based on corpus linguistics and special language terminology, is described that can extract time-varying information from text streams. The method is shown to be ‘language-independent’ in that its use leads to sets of regular-expressions that can be used to extract the information in typologically distinct languages like English and Arabic. The method uses the information related to the distribution of Ngrams, for automatically extracting ‘meaning bearing’ patterns of usage in a training corpus. The analysis of an English news wire corpus (1,720,142 tokens) and Arabic news wire corpus (1,720,154 tokens) show encouraging results.
منابع مشابه
Bilingual terminology extraction: an approach based on a multilingual thesaurus applicable to comparable corpora
This paper presents several methods for exploiting multiple resources in bilingual lexicon extraction, either from parallel or comparable corpora. First, a special attention is given to the use of multilingual thesauri, and different search strategies based on such thesauri are investigated. Then, a method to optimally combine the different resources for bilingual lexicon extraction is presente...
متن کاملExploiting a Multilingual Web-based Encyclopedia for Bilingual Terminology Extraction
Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and to exploit the idea of using multilingual web-based encyclopedias such as Wikipedia as comparable corpora for bilingual terminology e...
متن کاملLanguage-Independent Bilingual Terminology Extraction from a Multilingual Parallel Corpus
We present a language-pair independent terminology extraction module that is based on a sub-sentential alignment system that links linguistically motivated phrases in parallel texts. Statistical filters are applied on the bilingual list of candidate terms that is extracted from the alignment output. We compare the performance of both the alignment and terminology extraction module for three dif...
متن کاملTTC TermSuite - A UIMA Application for Multilingual Terminology Extraction from Comparable Corpora
This paper aims at presenting TTC TermSuite: a tool suite for multilingual terminology extraction from comparable corpora. This tool suite offers a userfriendly graphical interface for designing UIMA-based tool chains whose components (i) form a functional architecture, (ii) manage 7 languages of 5 different families, (iii) support standardized file formats, (iv) extract singleand multiword ter...
متن کاملAn Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction
This paper focuses on exploiting different models and methods in bilingual lexicon extraction, either from parallel or comparable corpora, in specialized domains. First, a special attention is given to the use of multilingual thesauri, and different search strategies based on such thesauri are investigated. Then, a method to combine the different models for bilingual lexicon extraction is prese...
متن کامل