corpus linguistic

Languages under the influence: Building a database of Uralic languages

2017

Eszter Simon Nikolett Mus

For most of the Uralic languages, there is a lack of systematically collected, consequently transcribed and morphologically annotated text corpora. This paper sums up the steps, the preliminary results and the future directions of building a linguistic corpus of some Uralic languages, namely Tundra Nenets, Udmurt, Synya Khanty, and Surgut Khanty. The experiences of building a corpus containing ...

متن کامل

Linguistic Profiling for Author Recognition and Verification

2017

Hans van Halteren

A new technique is introduced, linguistic profiling, in which large numbers of counts of linguistic features are used as a text profile, which can then be compared to average profiles for groups of texts. The technique proves to be quite effective for authorship verification and recogni tion. The best parameter settings yield a False Accept Rate of 8.1% at a False Re ject Rate equal to zero f...

متن کامل

CHAPTER 10 Statistical Measures for UsageBased Linguistics

2015

Stefan Th. Gries Nick C. Ellis

The advent of usage-/exemplar-based approaches has resulted in a major change in the theoretical landscape of linguistics, but also in the range of methodologies that are brought to bear on the study of language acquisition/learning, structure, and use. In particular, methods from corpus linguistics are now frequently used to study distributional characteristics of linguistics units and what th...

متن کامل

Can Corpus Based Measures be Used for Comparative Study of Languages?

2007

Anil Kumar Singh Harshit Surana

Quantitative measurement of inter-language distance is a useful technique for studying diachronic and synchronic relations between languages. Such measures have been used successfully for purposes like deriving language taxonomies and language reconstruction, but they have mostly been applied to handcrafted word lists. Can we instead use corpus based measures for comparative study of languages?...

متن کامل

How to Avoid Burning Ducks: Combining Linguistic Analysis and Corpus Statistics for German Compound Processing

2010

Fabienne Fritzinger Alexander Fraser

Compound splitting is an important problem in many NLP applications which must be solved in order to address issues of data sparsity. Previous work has shown that linguistic approaches for German compound splitting produce a correct splitting more often, but corpus-driven approaches work best for phrase-based statistical machine translation from German to English, a worrisome contradiction. We ...

متن کامل

The Inner Circle vs. the Outer Circle or British English vs. American English

2016

Yong-Hun Lee Ki-Suk Jun

In this paper, the use of two modals (can and may) in four varieties of English (British, India, Philippines, and USA) was compared and the characteristics of each variety were statistically analyzed. After all the sample sentences were extracted from each component of the ICE corpus, a total of twenty linguistic factors were encoded. Then, the collected data were statistically analyzed with R....

متن کامل

Indexing and Querying Linguistic Metadata and Document Content

2005

Niraj Aswani Valentin Tablan Hamish Cunningham

The need for efficient corpus indexing and querying arises frequently both in machine learning-based and human-engineered natural language processing systems. This paper presents the ANNIC system, which can index documents not only by content, but also by their linguististic annotations and features. It also enables users to formulate versatile queries mixing keywords and linguistic information...

متن کامل

rre STC-TIMIT: Generation of a Single-channel Telephone Corpus

2008

Nicolás Morales Javier Tejedor Javier Garrido Salas José Colás Pasamontes Doroteo Torre Toledano

This paper describes a new speech corpus, STC-TIMIT, and discusses the process of design, development and its distribution through LDC. The STC-TIMIT corpus is derived from the widely used TIMIT corpus by sending it through a real and single telephone channel. TIMIT is phonetically balanced, covers the dialectal diversity in continental USA and has been extensively used as a benchmark for speec...

متن کامل

Multiple Annotations of Reusable Data Resources: Corpora for Topic Detection and Tracking

2000

Christopher Cieri

Responding to demands for very large, easily accessible, reusable news corpora to support research in the topic detection and tracking paradigm, the Linguistic Data Consortium created the TDT corpora. In addition to supporting research in the Topic Detection and Tracking program, the TDT corpora were collected and annotated with an eye toward reuse and re-annotation. Their value is confirmed in...

متن کامل

LEA - Linguistic Exercises with Annotation Tools

2017

Fabian Barteld Johanna Flick

In this paper we present LEA (Linguistic Exercises with Annotation tools). LEA is a new didactic concept helping students to become familiar with corpus linguistic methods and annotation tools. The main idea behind LEA is that classical linguistic exercises are being solved with annotation tools. We will present the advantages of this method (e.g. didactic benefits, automatic correction) and de...

متن کامل