INTEX: A Corpus Processing System

نویسنده

  • Max Silberztein
چکیده

INTEX is a text processor; it is usually used to parse corpora of several megabytes. It includes several built-in large coverage dictionaries and grammars represented by graphs; the user inay add his/her own dictionaries and gramnlars. These tools am applied to texts in order to locate lexical and syntactic patterns, remove ambiguities , and tag words. INTEX builds collcordances and indexes of all types of patterns; it is used by linguists to analyse corpora, but can also be viewed as an information feb'lewd system. INTROI)UCTION INTEX automatically identities words and mor-pho-syntactic patterns in large texts. By using INTEX, one can: .... build the dictionary of lhe words of the texts; words may be simple words (sequences of letters, e.g. table), compounds (sequences of simple words which include a separator, e.g. worU pro~ cessor) or complete expressions (sequences of words which accept insertions, bucket);-locate in texts all occurrences of a given word (even if inflected), a given category (e.g. all feminine plural adjectives) or a morpho-syniactic pat~ tern (a regular expression);-apply grammars represented by recursive graphs to texts; build indexes or concordances for all occurrences of the previous patterns; .... use local grammars to remove word and uller-ance ambiguities in texts, or to detect errors or deviant sequences. While INTEX already i,lcludes several built-in dictionaries and granunars, it allows tile user to create, c(lit and add his/her own tools, hi order to increase coverage of texts and to remove additional ambiguilies. The user th'st loads a text and selects the woi'kiug langl.iage I. INT[~.X counls lhc nulnbor of lokens in the lexl, lhe number of different ones, and sorts lhoni by frequency. Theil Ihe user selects linguistic tools to parse the text. Tools aye either diclio.. nnries or tinilO stale transducers (FSTs). INTEX is based on lwo large coverage builtqn dictionaries:-the I)IT:.LAI ~ diclio,mry contains over 700,000 simple words, basically all the simple words of the language 2. Each entry in the I)ELAI: is asso. cialed wilh explicit morphological infornlathm for each word: its canonical form (e.g. the intini-live for verbs), its part of speech (e.g. Noun), aud some inllectional information (e.g. th'st person singular present). I lere are three entries of the

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

NooJ: a Linguistic Annotation System for Corpus Processing

One characteristic of NooJ is that its corpus processing engine uses large-coverage linguistic lexical and syntactic resources. This allows NooJ users to perform sophisticated queries that include any of the available morphological, lexical or syntactic properties. In comparison with INTEX, NooJ uses a new technology (.NET), a new linguistic engine, and was designed with a new range of applicat...

متن کامل

ارایه یک پیکره‌ پرسش و پاسخ مذهبی در زبان فارسی

Question answering system is a field in natural language processing and information retrieval noticed by researchers in these decades. Due to a growing interest in this field of research, the need to have appropriate data sources is perceived. Most researches about developing question answering corpus area have been done in English so far, but in other languages as Persian, the lack of these co...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Electronic Dictionaries And Linguistic Analysis Of Italian Large Corpora

In thts paper we wdl show how Itahan electronic dictionaries have been budt within the methodological framework of Lexicon-grammar We wdl see the structure of electromc d~ctlonanes of simple and compound words, and we wdl show how to analyse texts employing these hngmst~c tools within INTEX. a morphological analyser Finally, we wdl show how electromc grammars (budt w~th INTEX) interact with dlc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994