linguistic corpus

Linguistic Knowledge Acquisition from Parsing Failures

1993

Masaki Kiyono Jun'ichi Tsujii

A semi-automatic procedure of linguistic knowledge acquisition is proposed, which combines corpus-based techniques with the conventional rule-based approach. The rule-based component generates all the possible hypotheses of defects which the existing linguistic knowledge might contain, when it fails to parse a sentence. The rule-based component does not try to identify the defects, but generate...

متن کامل

From Dependencies to Constituents in the Reference Corpus for the Processing of Basque (EPEC)

Journal: :Procesamiento del Lenguaje Natural 2008

Arantza Díaz de Ilarraza Enrique Fernández-Terrones Izaskun Aldezabal María Jesús Aranzabe

In this paper the process for turning a dependency-based corpus to a constituentbased one is explained. For this purpose, first both the Dependency and the Constituent formalism are analized and then the corresponding equivalences of linguistic phenomena are treated. This process has had different phases in which the linguistic equivalences have been improved. Finally, the evaluation process is...

متن کامل

Useful statistics for corpus linguistics

2009

Stefan Th. Gries

• frequencies of occurrence of linguistic elements, which can be studied from two different perspectives: o how frequent are morphemes or words or patterns/constructions in (parts of) a corpus? This information can be provided in various different forms of frequency lists; o how evenly are morphemes or words or patterns/constructions distributed across (parts of) a corpus? This information can ...

متن کامل

The development of a web corpus of Hindi language and corpus-based comparative studies to Japanese

2016

Miki Nishioka Shiro Akasegawa

In this paper, we discuss our creation of a web corpus of spoken Hindi (COSH), one of the Indo-Aryan languages spoken mainly in the Indian subcontinent. We also point out notable problems we’ve encountered in the web corpus and the special concordancer. After observing the kind of technical problems we encountered, especially regarding annotation tagged by Shiva Reddy’s tagger, we argue how the...

متن کامل

Collocational Translation Memory Extraction Based on Statistical and Linguistic Information

Journal: :IJCLCLP 2004

Jia-Yan Jian Yu-Chia Chang Jason S. Chang

In this paper, we propose a new method for extracting bilingual collocations from a parallel corpus to provide phrasal translation memories. The method integrates statistical and linguistic information to achieve effective extraction of bilingual collocations. The linguistic information includes parts of speech, chunks, and clauses. The method involves first obtaining an extended list of Englis...

متن کامل

Why and how to control the authentic emotional speech corpora

2003

Véronique Aubergé Nicolas Audibert Albert Rilliard

The affects are expressed in different levels of speech: metalinguistic (expressiveness), linguistic (attitudes), both anchored in the “linguistic time”, and para-linguistic (emotions expressions) that is anchored in the emotional causes timing. In an experimental approach, the corpus are the base of analysis. Main of emotional corpus have been produced by acting/elicitating speakers on one sid...

متن کامل

Corpus-based pronunciation variation rule analysis for singapore English

2015

Wenda Chen Nancy F. Chen Boon Pang Lim Bin Ma

In this paper, we evaluate a set of linguistic rules for pronunciation variations in Singapore English. We collect and annotate a speech corpus for Singapore English and label it with IPA narrow transcriptions. Data driven pronunciation rules are derived using American English (Buckeye corpus) as a reference. We compare the data driven rules with linguistic rules proposed by phoneticians, and f...

متن کامل

Inference in the Resolution of Ellipsis.

1995

I Lewin S G Pulman

We discuss the treatment of ellipsis in a spoken language route planning enquiry service which uses the Core Language Engine (CLE) as its linguistic processor. We show how use of the CLE allows us to separate the interpretation of ellipsis in a dialogue context from the more general issue of dialogue management in a dialogue context and, especially, to factor out the linguistic innuences on suc...

متن کامل

Web as Huge Information Source for Noun Phrases Integration in the Information Retrieval Process

2002

Mathias Géry Dominique Vaufreydaz

Web is a rich and diversified source of information. In this article, we propose to benefit from this richness to collect and analyze documents, with the aim of a relational indexation based on noun phrases. Proposed data processing chain includes a spider collecting data to build textual corpora, and a linguistic module analyzing text to extract information. Comparison of obtained corpus with ...

متن کامل

CINTIL DependencyBank PREMIUM - A Corpus of Grammatical Dependencies for Portuguese

2016

Rita de Carvalho Andreia Querido Marisa Campos Rita Valadas Pereira João Ricardo Silva António Branco

This paper presents a new linguistic resource for the study and computational processing of Portuguese. CINTIL DependencyBank PREMIUM is a corpus of Portuguese news text, accurately manually annotated with a wide range of linguistic information (morpho-syntax, named-entities, syntactic function and semantic roles), making it an invaluable resource specially for the development and evaluation of...

متن کامل