Measures for Corpus Similarity and Homogeneity

نویسندگان

  • Adam Kilgarriff
  • Tony Rose
چکیده

How similar are two corpora? A measure of corpus similarity would be very useful for NLP for many purposes, such as estimating the work involved in porting a system from one domain to another. First, we discuss difficulties in identifying what we mean by 'corpus similariti: human similarity judgements are not finegrained enough, corpus similarity is inherently multidimensional, and similarity can only be interpreted in the light of corpus homogeneity. We then present an operational definition of corpus similarity \vhich addresses or circumvents the problems, using purpose-built sets of aknown-similarity corpora". These KSC sets can be used to evaluate the measures. We evaluate the measures described in the literature, including three variants of the information theoretic measure 'perplexity'. A x2-based measure, using word frequencies, is shnwn to be the best of those tested.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Measuring the homogeneity and similarity of language corpora

Corpus-based methods are now dominant in Natural Language Processing (NLP) . Creating big corpora is no longer difficult and the technology to analyze them is growing faster, more robust and more accurate. However, when an NLP application performs well on one corpus, it is unclear whether this level of performance would be maintained on others. To make progress on these questions, we need metho...

متن کامل

ITRI-98-07 Measures for corpus similarity and homogeneity

How similar are two corpora? A measure of corpus similarity would be very useful for NLP for many purposes, such as estimating the work involved in porting a system from one domain to another. First, we discuss difficulties in identifying what we mean by ‘corpus similarity’: human similarity judgements are not finegrained enough, corpus similarity is inherently multidimensional, and similarity ...

متن کامل

The influence of example-data homogeneity on EBMT quality

Homogeneity of large corpora is still a largely unclear notion. In this study we first make a link between the notions of similarity and homogeneity : a large corpus is made of sets of documents to which may be assigned a score in similarity defined by cross-entropic measures, such similarity being implicitly expressed in the data. The distribution of the similarity scores of such subcorpora ma...

متن کامل

A Method to Quantify Corpus Similarity and its Application to Quantifying the Degree of Literality in a Document

Comparing and quantifying corpora is a key issue in corpus based translation and corpus linguistics, for which there is still a notable lack of measures. This makes it difficult for a user to isolate, transpose, or extend the interesting features of a corpus to other NLP systems. In this work we address the issue of measuring similarity between corpora. We suggest a scale between two user chose...

متن کامل

ITRI-97-07 Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora

How similar are two corpora? A measure of corpus similarity would be very useful for language engineering. Word frequency lists are cheap and easy to generate so a measure based on them would be of use as a quick guide in many circumstances; for example, to judge how a newly available corpus related to existing resources, or how easy it might be to port an NLP system designed to work with one t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998