corpus linguistic

A Tangled Web: The Faint Signals of Deception in Text - Boulder Lies and Truth Corpus (BLT-C)

2016

Franco Salvetti John B. Lowe James H. Martin

We present an approach to creating corpora for use in detecting deception in text, including a discussion of the challenges peculiar to this task. Our approach is based on soliciting several types of reviews from writers and was implemented using Amazon Mechanical Turk. We describe the multi-dimensional corpus of reviews built using this approach, available free of charge from LDC as the Boulde...

متن کامل

SINICA CORPUS : Design Methodology for Balanced Corpora

1996

Keh-Jiann Chen Chu-Ren Huang Li-Ping Chang Hui-Li Hsu

The Academia Sinica Balanced Corpus (Sinica Corpus) is the first balanced Chinese corpus with part-of-speech tagging. The corpus (Sinica 2.0) is open to the research community through the WWW (http://www.sinica.edu.twiftms-binikiwi.sh). Current size of the corpus is 3.5 million words, and the immediate expansion target is five million words. Each text in the corpus is classified and marked acco...

متن کامل

Finding Hedges by Chasing Weasels: Hedge Detection Using Wikipedia Tags and Shallow Linguistic Features

2009

Viola Ganter Michael Strube

We investigate the automatic detection of sentences containing linguistic hedges using corpus statistics and syntactic patterns. We take Wikipedia as an already annotated corpus using its tagged weasel words which mark sentences and phrases as non-factual. We evaluate the quality of Wikipedia as training data for hedge detection, as well as shallow linguistic features.

متن کامل

Application of a Corpus to Identify Gaps between English Learners and Native Speakers

2015

Katsunori Kotani Takehiko Yoshimi

In order to develop effective computerassisted language teaching systems for learners of English as a foreign language, it is first necessary to identify gaps between learners and native speakers in the four basic linguistic skills (reading, writing, pronunciation, and listening). To identify these gaps, the accuracy and fluency in language use between learners and native speakers should be com...

متن کامل

Network of Data Centres (NetDC): BNSC - An Arabic Broadcast News Speech Corpus

2004

Khalid Choukri Mahtab Nikkhou Niklas Paulsson

Broadcast news is a very rich source of Language Resources that has been exploited to develop and assess a large set of Human Language Technologies. Some examples include systems to: automatically produce text transcriptions of spoken data; identify the language of a text; translate a text from one language to another; identify topics in the news and retrieve all stories discussing a target top...

متن کامل

Corpus, Lexicon, and Construction: A Quantitative Corpus Approach to Mandarin Possessive Construction

Journal: :IJCLCLP 2009

Cheng-Hsien Chen

Taking Mandarin Possessive Construction (MPC) as an example, the present study investigates the relation between lexicon and constructional schemas in a quantitative corpus linguistic approach. We argue that the wide use of raw frequency distribution in traditional corpus linguistic studies may undermine the validity of the results and reduce the possibility for interdisciplinary communication....

متن کامل

ZAC.PB: An Annotated Corpus for Zero Anaphora Resolution in Portuguese

2009

Simone Pereira

This paper describes the methodology adopted in the construction of an annotated corpus for the study of zero anaphora in Portuguese, the ZAC corpus. To our knowledge, no such corpus exists at this time for the Portuguese language. The purpose of this linguistic resource is to promote the use of automatic discovery of linguistic parameters for anaphora resolution systems. Because of the complex...

متن کامل

A Corpus-Based Study on Mapping Principles of Metaphors in Politics

2003

Shu-Ping Gong

This study proposes a corpus-based method to generate Mapping Principle of metaphors. In particular, Ahrens's (2002) Mapping Principle in the Conceptual Mapping Model (CM model) is simply based on the native speakers' intuition instead of analyzing it from huge linguistic data. In order to provide more convincing evidence to support the CM model, we adopt the corpus method to extract out the me...

متن کامل

The American National Corpus: A Standardized Resource for American English

2000

Catherine Macleod Nancy Ide Ralph Grishman

Linguistic research has become heavily reliant on text corpora over the past ten years. Such resources are becoming increasingly available through efforts such as the Linguistic Data Consortium (LDC) in the US and the European Language Resources Association (ELRA) in Europe. However, in the main the corpora that are gathered and distributed through these and other mechanisms consist of texts wh...

متن کامل

MACROPHONE: An American English Telephone Speech Corpus

1994

Kelsey Taussig Jared Bernstein

Macrophone is a corpus of approximately 200,000 utterances, recorded over the telephone from a broad sample of about 5,000 American speakers. Sponsored by the Linguistic Data Consortium (LDC), it is the first of a series of similar data sets that will be colected for major languages of the world in a cooperative project called Polyphone. It is designed to provide telephone speech suitable for t...

متن کامل