paper based texts

Identifying and Tagging Titles in Web Texts

2008

Clémentine Adam Estelle Delpech Patrick Saint-Dizier

In this paper, we present an analysis based on linguistic and typographic features that allows for the identification of titles in web documents. We focus in particular on procedural texts. Identifying texts is a difficult task because ways pf encoding them are very diverse. A number of titles are also incomplete because fo context, we propose also a way to retrieve the missing elements, in par...

متن کامل

Towards an Error Correction Memory to Enhance Technical Texts Authoring in LELIE

2014

Juyeon Kang Patrick Saint-Dizier

In this paper, we investigate and experiment the notion of error correction memory applied to error correction in technical texts. The main purpose is to induce relatively generic correction patterns associated with more contextual correction recommendations, based on previously memorized and analyzed corrections. The notion of error correction memory is developed within the framework of the LE...

متن کامل

Graph-Structures Matching for Review Relevance Identification

2013

Lakshmi Ramachandran Edward F. Gehringer

Review quality is determined by identifying the relevance of a review to a submission (the article or paper the review was written for). We identify relevance in terms of the semantic and syntactic similarities between two texts. We use a word order graph, whose vertices, edges and double edges help determine structure-based match across texts. We use WordNet to determine semantic relatedness. ...

متن کامل

Building a Discourse-Annotated Dutch Text Corpus

2011

Nynke van der Vliet Ildikó Berzlánovich Gosse Bouma Markus Egg Gisela Redeker

We are compiling a corpus of Dutch texts annotated with discourse structure and lexical cohesion, containing initially 80 texts from expository and persuasive genres. We are using this resource for corpus-based studies of discourse relations, discourse markers, cohesion, and genre differences. We are also exploring the possibilities of automatic text segmentation and semi-automatic discourse an...

متن کامل

The Horizontal Segmentation of Lines in Chinese Handwritten Texts Based on the Intervals (Distances) in Fuzzy Triangles

2013

Hossein Kardan Moghaddam

The horizontal segmentation of handwritten text lines is a key step to detect handwritten texts has slant. In this paper, a novel method is proposed based on the fuzzy triangles to bring together and connecting the text lines. This proposed method has been tested on data banks in Chinese languages. In the experiments on the Chinese handwritten texts, a performance of 94.53% was obtained. Abbrev...

متن کامل

Reading Multimodal Texts in the 21st Century

2012

Frank Serafini FRANK SERAFINI

Discussions concerning which literacy skills will be required of students in the 21st century have appeared in numerous educational publications recently and have been greeted with mixed reactions (Bellanca & Brandt, 2010; Trilling & Fadel, 2009). It has been proposed that the skills necessary to be a literate citizen in the new millennium have expanded from simply being able to read and write ...

متن کامل

نقش ارتباطات معنایی در بهبود نتایج یک سیستم پیشنهاد استناد- مقاله برگزیده هفدهمین کنفرانس ملی انجمن کامپیوتر ایران

ژورنال: محاسبات نرم 2013

زرین کلام, فتانه, کاهانی, محسن,

With the increasingly growth of scientific documents in the Web, it is difficult to select a concerned document. A citation recommendation system receives a text and recommends documents to be cited by the text. Such recommendation helps a researcher in hitting his/her concerned texts. Based on sematic relations, this paper presents a new indicator to measure the similarity between documents an...

متن کامل

Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content

2010

Yulia Tsvetkov Shuly Wintner

Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of constantly growing parallel corpora. We propose a simple and effective dictionary-based algorithm to extract parallel document pairs from a large collection of articles retrieved from the Internet, potentially containin...

متن کامل

Finding Answers in Large Collections of Texts: Paragraph Indexing + Abductive Inference

2002

Sanda M. Harabagiu Steven J. Maiorano

This paper describes a methodology of answering questions by using information retrieved from very large collections of texts. We argue that combinations of information retrieval and extractions techniques cannot be used, due to the open-domain nature of the task. We propose a solution based on indexing techniques that identify paragraphs from texts where the answers can be found. The validity ...

متن کامل

(German) Language Processing for Lucene

2015

Bastian Entrup

This paper introduces an open-source Java-package called German Language Processing for Lucene (glp4lucene). Although it was originally developed to work with German texts, it is to a large degree language independent. It aims at facilitating four language processing steps for working with non-English texts and Apache Lucene/Solr: lemmatizing words, weighting terms based on their part-of-speech...

متن کامل