Measures of dispersion for corpus data: an overview, a suggestion, and a research program I
ثبت نشده
چکیده
The most frequently used statistic in corpus linguistics are the frequency of occurrence of some linguistic variable and the frequency of co-occurrence of two or more linguistic variables. However, as has been pointed out correctly and repeatedly, frequencies of (co-)occurrence in isolation may sometimes be severely misleading given that they alone to not take into consideration the degree of dispersion of the relevant linguistic variable in question. In order to handle such problems, several scholars suggested a variety of dispersion measures and adjusted frequency measures. Unfortunately, however, few scholars appear to be aware of this issue and measures of dispersion are apparently not widely know and even less widely used. Another unfortunate aspect of such measures is that many of them also come with a variety of problems. I pursue three objectives with this article. First, in order to raise awareness of this thematic complex and make the available measures more widely known, I present an overview of a large number of dispersion measures and adjusted frequency measures – including some more recent measures that have not found their way into research articles and textbooks yet – and summarily discuss some of their advantages and disadvantages. Second, I propose for discussion a conceptually very simple alternative measure, DP, explain and exemplify its properties, and compare it to the more or less established measures on the basis of fictitious distributions from the literature, word frequencies from the BNC Sampler, and co-occurrence data from the ICEGB. I will conclude that DP is at least as discriminatory as most if not all other existing measures but conceptually simpler and even better in some respects.
منابع مشابه
Measures of dispersion for corpus data: an overview, a suggestion, and a research program II
In order to adjust observed frequencies of occurrence, previous studies have suggested a variety of measures of dispersion and adjusted frequencies. In part I of this article, I first summarily reviewed many of these measures as well as a variety of their shortcomings and then suggested an alternative measure, DP, for deviation of proportions, which I argued to be conceptually simpler, but at t...
متن کاملAn Empirical Comparison of Distance Measures for Multivariate Time Series Clustering
Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...
متن کاملDeveloping a Corpus-Based Word List in Pharmacy Research Articles: A Focus on Academic Culture
The present corpus-based lexical study reports the development of a Pharmacy Academic Word List (PAWL); a list of the most frequent words from a corpus of 3,458,445 tokens made up of 800 most recent pharmacy texts including research articles, review articles, and short communications in four sub-disciplines of pharmacy. WordSmith (Scott, 2017) and AntWordProfiler (Anthony, 2014) were used to sc...
متن کاملThe Influence of Data-Driven Exercises Through Using a Computer Program on Vocabulary Improvement in an EFL Context
The present study was conducted to evaluate data driven learning (DDL) combined with Computer Assisted Language Learning (CALL) as an approach to improving vocabulary knowledge of Iranian postgraduates majoring in teaching English, English literature and translation. The purpose was to help language learners get familiar with DDL as a student-centered method taking advantage of a computer progr...
متن کاملThe Influence of Data-Driven Exercises Through Using a Computer Program on Vocabulary Improvement in an EFL Context
The present study was conducted to evaluate data driven learning (DDL) combined with Computer Assisted Language Learning (CALL) as an approach to improving vocabulary knowledge of Iranian postgraduates majoring in teaching English, English literature and translation. The purpose was to help language learners get familiar with DDL as a student-centered method taking advantage of a computer progr...
متن کامل