An Empirical Investigation of Statistical Significance in NLP

نویسندگان

  • Taylor Berg-Kirkpatrick
  • David Burkett
  • Dan Klein
چکیده

We investigate two aspects of the empirical behavior of paired significance tests for NLP systems. First, when one system appears to outperform another, how does significance level relate in practice to the magnitude of the gain, to the size of the test set, to the similarity of the systems, and so on? Is it true that for each task there is a gain which roughly implies significance? We explore these issues across a range of NLP tasks using both large collections of past systems’ outputs and variants of single systems. Next, once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed, such as across domains? We explore this question using a range of test set variations for constituency parsing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Investigation and Statistical comparison of the soil empirical desalinization models for salin-sodic soils (Case study: Khuzestan province)

Accumulation of soluble salts in arid areas which are similar to most regions of Iran is inevitable in soil surface and profile because of low precipitation and high evaporation. High concentration of soluble salts in soil profile caused severe problems for root water uptake thus plant growth stopped. Reducing soil salinity to optimized content by leaching and avoiding soil pounding must be con...

متن کامل

Significant elationships

Statistical NLP inevitably deals with a large number of rare events. As a consequence, NLP data often violates the assumptions implicit in traditional statistical procedures such as significance testing. We describe a significance test, an exact conditional test, that is appropriate for NLP data and can be performed using freely available software. We apply this test to the study of lexical rel...

متن کامل

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...

متن کامل

An experimental investigation on the effect of acid treatment of MWCNTs on the viscosity of water based nanofluids and statistical analysis of viscosity in prepared nanofluids

The effect of temperature (25, 40, 55 and 70°C) and weight fraction of MWCNTs (0.125, 0.25 and 0.5 %wt) on the viscosity of nanofluids containing pristine and functionalized MWCNTs have been investigated. For this purpose, all of the measurements were carried out in triplicate and were analyzed using two factors completely randomized design and comparison of data means is carried out with Dunca...

متن کامل

Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

With the ever growing amount of textual data from a large variety of languages, domains, and genres, it has become standard to evaluate NLP algorithms on multiple datasets in order to ensure a consistent performance across heterogeneous setups. However, such multiple comparisons pose significant challenges to traditional statistical analysis methods in NLP and can lead to erroneous conclusions....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012