text coverage

The importance of precise tokenizing for deep grammars

2006

Martin Forst Ronald M. Kaplan

We present a non-deterministic finite-state transducer that acts as a tokenizer and normalizer for free text that is input to a broad-coverage LFG of German. We compare the basic tokenizer used in an earlier version of the grammar and the more sophisticated tokenizer that we now use. The revised tokenizer increases the coverage of the grammar in terms of full parses from 68.3% to 73.4% on sente...

متن کامل

Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

2012

Nathan Schneider Behrang Mohit Kemal Oflazer Noah A. Smith

“Lightweight” semantic annotation of text calls for a simple representation, ideally without requiring a semantic lexicon to achieve good coverage in the language and domain. In this paper, we repurpose WordNet’s supersense tags for annotation, developing specific guidelines for nominal expressions and applying them to Arabic Wikipedia articles in four topical domains. The resulting corpus has ...

متن کامل

Using Directed Graph Based BDMM Algorithm for Chinese Word Segmentation

2005

Yaodong Chen Ting Wang Huowang Chen

Word segmentation is a key problem for Chinese text analysis. In this paper, with the consideration of both word-coverage rate and sentencecoverage rate, based on the classic Bi-Directed Maximum Match (BDMM) segmentation method, a character Directed Graph with ambiguity mark is designed for searching multiple possible segmentation sequences. This method is compared with the classic Maximum Matc...

متن کامل

Names and Similarities on the Web: Fact Extraction in the Fast Lane

2006

Marius Pasca Dekang Lin Jeffrey Bigham Andrei Lifchits Alpa Jain

In a new approach to large-scale extraction of facts from unstructured text, distributional similarities become an integral part of both the iterative acquisition of high-coverage contextual extraction patterns, and the validation and ranking of candidate facts. The evaluation measures the quality and coverage of facts extracted from one hundred million Web documents, starting from ten seed fac...

متن کامل

Printer Modeling for Document Imaging

2004

Margaret Norris Elisa H. Barney Smith

The microscopic details of printing often are unnoticed by humans, but can make differences that affect machine recognition of printed text. Models of the defects introduced into images by printing can be used to improve machine recognition. A probabilistic model used to generate images showing toner placement bears similarities to actual printed images. An equation derived for the average cove...

متن کامل

LOTUS: Linked Open Text UnleaShed

2015

Filip Ilievski Wouter Beek Marieke van Erp Laurens Rietveld Stefan Schlobach

It is difficult to find resources on the Semantic Web today, in particular if one wants to search for resources based on natural language keywords and across multiple datasets. In this paper, we present LOTUS: Linked Open Text UnleaShed, a full-text lookup index over a huge Linked Open Data collection. We detail LOTUS’ approach, its implementation, its coverage, and demonstrate the ease with wh...

متن کامل

Text-based Legal Ontology Enrichment

2009

Wim Peters

The acquisition of knowledge from text is an incomplete and incremental process. When anchored to a particular knowledge model it provides potentially useful information to the legal expert in the form of new concepts and relations, in order to improve the domain coverage. This paper explores the feasibility of various legal text-based ontology enrichment techniques, and discusses the transform...

متن کامل

Bootstrapping Statistical Processing Into A Rule-Based Natural Language Parser

1994

Stephen D. Richardson

This paper describes a "bootstrapping" method which uses a broad-coverage, rule-based parser to compute probabilities while parsing an untagged corpus of NL text, and which then incorporates those probabilities into the processing of the same parser as it analyzes new text. Results are reported which show that this method can significantly improve the speed and accuracy of the parser without re...

متن کامل

Towards Generating Text from Discourse Representation Structures

2011

Valerio Basile Johan Bos

We argue that Discourse Representation Structures form a suitable level of languageneutral meaning representation for micro planning and surface realisation. DRSs can be viewed as the output of macro planning, and form the rough plan and structure for generating a text. We present the first ideas of building a large DRS corpus that enables the development of broad-coverage, robust text generato...

متن کامل

An Intermediate Representation for the Interpretation of Temporal Expressions

2006

Pawel P. Mazur Robert Dale

The interpretation of temporal expressions in text is an important constituent task for many practical natural language processing tasks, including question-answering, information extraction and text summarisation. Although temporal expressions have long been studied in the research literature, it is only more recently, with the impetus provided by exercises like the ACE Program, that attention...

متن کامل