Measures of dispersion for corpus data: an overview, a suggestion, and a research program I

ثبت نشده
چکیده

The most frequently used statistic in corpus linguistics are the frequency of occurrence of some linguistic variable and the frequency of co-occurrence of two or more linguistic variables. However, as has been pointed out correctly and repeatedly, frequencies of (co-)occurrence in isolation may sometimes be severely misleading given that they alone to not take into consideration the degree of dispersion of the relevant linguistic variable in question. In order to handle such problems, several scholars suggested a variety of dispersion measures and adjusted frequency measures. Unfortunately, however, few scholars appear to be aware of this issue and measures of dispersion are apparently not widely know and even less widely used. Another unfortunate aspect of such measures is that many of them also come with a variety of problems. I pursue three objectives with this article. First, in order to raise awareness of this thematic complex and make the available measures more widely known, I present an overview of a large number of dispersion measures and adjusted frequency measures – including some more recent measures that have not found their way into research articles and textbooks yet – and summarily discuss some of their advantages and disadvantages. Second, I propose for discussion a conceptually very simple alternative measure, DP, explain and exemplify its properties, and compare it to the more or less established measures on the basis of fictitious distributions from the literature, word frequencies from the BNC Sampler, and co-occurrence data from the ICEGB. I will conclude that DP is at least as discriminatory as most if not all other existing measures but conceptually simpler and even better in some respects.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Measures of dispersion for corpus data: an overview, a suggestion, and a research program II

In order to adjust observed frequencies of occurrence, previous studies have suggested a variety of measures of dispersion and adjusted frequencies. In part I of this article, I first summarily reviewed many of these measures as well as a variety of their shortcomings and then suggested an alternative measure, DP, for deviation of proportions, which I argued to be conceptually simpler, but at t...

متن کامل

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...

متن کامل

Developing a Corpus-Based Word List in Pharmacy Research ‎Articles: A Focus on Academic Culture

The present corpus-based lexical study reports the development of a Pharmacy Academic Word List (PAWL); a list of the most frequent words from a corpus of 3,458,445 tokens made up of 800 most recent pharmacy texts including research articles, review articles, and short communications in four sub-disciplines of pharmacy. WordSmith (Scott, 2017) and AntWordProfiler (Anthony, 2014) were used to sc...

متن کامل

The Influence of Data-Driven Exercises Through Using a Computer Program on Vocabulary Improvement in an EFL Context

The present study was conducted to evaluate data driven learning (DDL) combined with Computer Assisted Language Learning (CALL) as an approach to improving vocabulary knowledge of Iranian postgraduates majoring in teaching English, English literature and translation. The purpose was to help language learners get familiar with DDL as a student-centered method taking advantage of a computer progr...

متن کامل

The Influence of Data-Driven Exercises Through Using a Computer Program on Vocabulary Improvement in an EFL Context

The present study was conducted to evaluate data driven learning (DDL) combined with Computer Assisted Language Learning (CALL) as an approach to improving vocabulary knowledge of Iranian postgraduates majoring in teaching English, English literature and translation. The purpose was to help language learners get familiar with DDL as a student-centered method taking advantage of a computer progr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007