text linguistic

The RATS Collection: Supporting HLT Research with Degraded Audio Data

2014

David Graff Kevin Walker Stephanie Strassel Xiaoyi Ma Karen Jones Ann Sawyer

The DARPA RATS program was established to foster development of language technology systems that can perform well on speaker-to-speaker communications over radio channels that evince a wide range in the type and extent of signal variability and acoustic degradation. Creating suitable corpora to address this need poses an equally wide range of challenges for the collection, annotation and qualit...

متن کامل

Linguistic Resources for Entity Linking Evaluation: from Monolingual to Cross-lingual

2012

Xuansong Li Stephanie Strassel Heng Ji Kira Griffitt Joe Ellis

To advance information extraction and question answering technologies toward a more realistic path, the U.S. NIST (National Institute of Standards and Technology) initiated the KBP (Knowledge Base Population) task as one of the TAC (Text Analysis Conference) evaluation tracks. It aims to encourage research in automatic information extraction of named entities from unstructured texts with the ul...

متن کامل

Exploring Lexical Patterns in Text: Lexical Cohesion Analysis with WordNet

2005

Elke Teich Peter Fankhauser

We present a system for the linguistic exploration and analysis of lexical cohesion in English texts. Using an electronic thesaurus-like resource, Princeton WordNet, and the Brown Corpus of English, we have implemented a process of annotating text with lexical chains and a graphical user interface for inspection of the annotated text. We describe the system and report on some sample linguistic ...

متن کامل

مقایسه متون ترجمه شده و متو ن اصلی: آزمون فرضیه ساده سازی در ترجمه متون فنی همسان

پایان نامه :وزارت علوم، تحقیقات و فناوری - دانشگاه شیخ بهایی - دانشکده زبانهای خارجی 1391

نرجس طبیبی, محمدرضا طالبی نژاد,

simplification universal as a universal feature of translation means translated texts tend to use simpler language than original texts in the same language and it can be critically investigated through common concepts: type/token ratio, lexical density, and mean sentence length. although steps have been taken to test this hypothesis in various text types in different linguistic communities, in ...

15 صفحه اول

Modeling Comma Placement in Chinese Text for Better Readability using Linguistic Features and Gaze Information

2013

Tadayoshi Hara Chen Chen Yoshinobu Kano Akiko Aizawa

Comma placements in Chinese text are relatively arbitrary although there are some syntactic guidelines for them. In this research, we attempt to improve the readability of text by optimizing comma placements through integration of linguistic features of text and gaze features of readers. We design a comma predictor for general Chinese text based on conditional random field models with linguisti...

متن کامل

MACROPHONE: An American English Telephone Speech Corpus

1994

Kelsey Taussig Jared Bernstein

Macrophone is a corpus of approximately 200,000 utterances, recorded over the telephone from a broad sample of about 5,000 American speakers. Sponsored by the Linguistic Data Consortium (LDC), it is the first of a series of similar data sets that will be colected for major languages of the world in a cooperative project called Polyphone. It is designed to provide telephone speech suitable for t...

متن کامل

Call My Net Corpus: A Multilingual Corpus for Evaluation of Speaker Recognition Technology

2017

Karen Jones Stephanie Strassel Kevin Walker David Graff Jonathan Wright

The Call My Net 2015 (CMN15) corpus presents a new resource for Speaker Recognition Evaluation and related technologies. The corpus includes conversational telephone speech recordings for a total of 220 speakers spanning 4 languages: Tagalog, Cantonese, Mandarin and Cebuano. The corpus includes 10 calls per speaker made under a variety of noise conditions. Calls were manually audited for langua...

متن کامل

Developing a Deep Linguistic Databank Supporting a Collection of Treebanks: the CINTIL DeepGramBank

2010

António Branco Francisco Costa João Ricardo Silva Sara Silveira Sérgio Castro Mariana Avelãs Clara Pinto João Graça

Corpora of sentences annotated with grammatical information have been deployed by extending the basic lexical and morphological data with increasingly complex information, such as phrase constituency, syntactic functions, semantic roles, etc. As these corpora grow in size and the linguistic information to be encoded reaches higher levels of sophistication, the utilization of annotation tools an...

متن کامل

Prague Czech-English Dependency Treebank. Syntactically Annotated Resources for Machine Translation

2004

Martin Cmejrek Jan Curín Jirí Havelka Jan Hajic Vladislav Kubon

This paper introduces the Prague Czech-English Dependency Treebank (PCEDT), a new Czech-English parallel resource suitable for experiments in structural machine translation. We describe the process of building the core parts of the resources – a bilingual syntactically annotated corpus and translation dictionaries. A part of the Penn Treebank has been translated into Czech, the dependency annot...

متن کامل

The Open Linguistics Working Group: Developing the Linguistic Linked Open Data Cloud

2016

John P. McCrae Christian Chiarcos Francis Bond Philipp Cimiano Thierry Declerck Gerard de Melo Jorge Gracia Sebastian Hellmann Bettina Klimek Steven Moran Petya Osenova Antonio Pareja-Lora Jonathan Pool

The Open Linguistics Working Group (OWLG) brings together researchers from various fields of linguistics, natural language processing, and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking linguistic data collections. A major outcome of our work is the Linguistic Linked Open Data (LLOD) cloud, an LOD (sub-)cloud o...

متن کامل