Corpus based coreference resolution for Farsi text

نویسندگان

چکیده مقاله:

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be used in many natural language processing tasks such as machine translation, automatic text summarization, question answering, and information extraction systems. Adding coreference information can increase the power of natural language processing systems. The coreference resolution can be done through different ways. These methods include heuristic rule-based methods and supervised/unsupervised machine learning methods. Corpus based and machine learning based methods are widely used in coreference resolution task in recent years and has led to a good performance. For using such these methods, there is a need for manually labeled corpus with sufficient size. For Persian language, before this research, there exists no such corpus. One of the important targets here, was producing a through corpus that can be used in coreference resolution task and other associated fields in linguistics and computational linguistics. In this coreference resolution research, a corpus of coreference tagged phrases has been generated (manually annotated) that has about one million words. It also has named entity recognition (NER) tags. Named entity labels in this corpus include 7 labels and in coreference task, all noun phrases, pronouns and named entities have been tagged. Using this corpus, a coreference tool was created using a vector space machine, with precision of about 60% on golden test data. As mentioned before, this article presents the procedure for producing a coreference resolution tool. This tool is produced by machine learning method and is based on the tagged corpus of 900 thousand tokens. In the production of the system, several different features and tools have been used, each of which has an effect on the accuracy of the whole tool. Increasing the number of features, especially semantic features, can be effective in improving results. Currently, according to the sources available in the Persian language, there are no suitable syntactic and semantic tools, and this research suffers from this perspective. The coreference tagged corpus produced in this study is more than 500 times bigger than the previous Persian language corpora and at the same time it is quite comparable to the prominent ACE and Ontonotes corpora. The system produced has an f-measure of nearly 60 according to the CoNLL standard criterion. However, other limited studies conducted in Farsi have provided different accuracy from 40 to 90%, which is not comparable to the present study, because the accuracy of these studies has not been measured with standard criterion in the coreference resolution field.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corpus for Coreference Resolution on Scientific Papers

The ever-growing number of published scientific papers prompts the need for automatic knowledge extraction to help scientists keep up with the state-of-the-art in their respective fields. To construct a good knowledge extraction system, annotated corpora in the scientific domain are required to train machine learning models. As described in this paper, we have constructed an annotated corpus fo...

متن کامل

Corpus-Based Learning for Noun Phrase Coreference Resolution

In this paper, we present a learning approach for coreference resolution of noun phrases in unrestricted text. The approach learns from a small, annotated corpus and the task includes resolving not just pronouns but rather general noun phrases. In contrast to previous work, we attempt to evaluate our approach on a common data set, the MUC-6 coreference corpus. We obtained encouraging results, i...

متن کامل

A Search Based Method for Clinical Text Coreference Resolution

This paper describes a novel method and system developed to address the coreference task of the 2011 i2b2 NLP Challenge, which involved analyzing clinical texts of several types and identifying the coreferential mentions within them. The method uses an open source search library component to do the heavy lifting for string matching, allowing a lightweight rule-based processing engine to cluster...

متن کامل

Evaluation of Coreference Resolution for Biomedical Text

The accuracy of document processing activities such as retrieval or event extraction can be improved by resolution of lexical ambiguities. In this brief paper we investigate coreference resolution in biomedical texts, reporting on an experiment that shows the benefit of domain-specific knowledge. Comparison of a state-of-the-art general system with a purpose-built system shows that the latter i...

متن کامل

Marmara Turkish Coreference Corpus and Coreference Resolution Baseline

We describe the Marmara Turkish Coreference Corpus, which is an annotation of the whole METU-Sabanci Turkish Treebank with mentions and coreference chains. Collecting nine or more independent annotations for each document allowed for fully automatic adjudication. We provide a baseline system for Turkish mention detection and coreference resolution and evaluate it on the corpus.

متن کامل

منابع من

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}


عنوان ژورنال

دوره 17  شماره 1

صفحات  79- 98

تاریخ انتشار 2020-06

با دنبال کردن یک ژورنال هنگامی که شماره جدید این ژورنال منتشر می شود به شما از طریق ایمیل اطلاع داده می شود.

کلمات کلیدی

کلمات کلیدی برای این مقاله ارائه نشده است

میزبانی شده توسط پلتفرم ابری doprax.com

copyright © 2015-2023