n grams

Tweets Classification using Corpus Dependent Tags, Character and POS N-grams

2015

Carlos E. González-Gallardo Azucena Montes Rendón Gerardo Sierra J. Antonio Nuñez-Juárez Adolfo Jonathan Salinas-López Juan Ek

This paper is part of the Author Profiling task at PAN 2015 contest; in witch participants had to predict the gender, age and personality traits of Twitter users in four different languages (Spanish, English, Italian and Dutch). Our approach takes into account stylistic features represented by character Ngrams and POS N-grams to classify tweets. The main idea of using character Ngrams is to ext...

متن کامل

Handwriting Recognition Using Position Sensitive Letter N-Gram Matching

2003

Adnan El-Nasan Sriharsha Veeramachaneni George Nagy

We propose further improvement of a handwriting recognition method that avoids segmentation while able to recognize words that were never seen before in handwritten form. This method is based on the fact that few pairs of English words share exactly the same set of letter bigrams and even fewer share longer n-grams. The lexical n-gram matches between every word in a lexicon and a set of referen...

متن کامل

A tool to build a treebank for conversational Chinese

2000

Yves Lepage Nicolas Auclerc Satoshi Shirai

N-grams have been extensively used with phonemes or words as basic units in speech recognition. Recently, it has been proposed to use n-grams with phrase tree structures as units to increase speech recognition quality. In order to test this idea on Chinese, a treebank of Chinese hotel reservation conversation utterances is needed. Because no such treebank is yet available, we have to build it. ...

متن کامل

Distribution-Based Pruning of Backoff Language Models

2000

Jianfeng Gao Kai-Fu Lee

We propose a distribution-based pruning of n-gram backoff language models. Instead of the conventional approach of pruning n-grams that are infrequent in training data, we prune n-grams that are likely to be infrequent in a new document. Our method is based on the n-gram distribution i.e. the probability that an n-gram occurs in a new document. Experimental results show that our method performe...

متن کامل

Using N-gram based Features for Machine Translation System Combination

2009

Yong Zhao Xiaodong He

Conventional confusion network based system combination for machine translation (MT) heavily relies on features that are based on the measure of agreement of words in different translation hypotheses. This paper presents two new features that consider agreement of n-grams in different hypotheses to improve the performance of system combination. The first one is based on a sentence specific onli...

متن کامل

Enhancing Authorship Attribution By Utilizing Syntax Tree Profiles

2014

Michael Tschuggnall Günther Specht

The aim of modern authorship attribution approaches is to analyze known authors and to assign authorships to previously unseen and unlabeled text documents based on various features. In this paper we present a novel feature to enhance current attribution methods by analyzing the grammar of authors. To extract the feature, a syntax tree of each sentence of a document is calculated, which is then...

متن کامل

Automatically Assessing Whether a Text Is Cliched, with Applications to Literary Analysis

2013

Paul Cook Graeme Hirst

Clichés, as trite expressions, are predominantly multiword expressions, but not all MWEs are clichés. We conduct a preliminary examination of the problem of determining how clichéd a text is, taken as a whole, by comparing it to a reference text with respect to the proportion of more-frequent n-grams, as measured in an external corpus. We find that more-frequent n-grams are over-represented in ...

متن کامل

Enhanced Twitter Sentiment Classification Using Contextual Information

2015

Soroush Vosoughi Helen Zhou Deb Roy

The rise in popularity and ubiquity of Twitter has made sentiment analysis of tweets an important and well-covered area of research. However, the 140 character limit imposed on tweets makes it hard to use standard linguistic methods for sentiment classification. On the other hand, what tweets lack in structure they make up with sheer volume and rich metadata. This metadata includes geolocation,...

متن کامل

Author Verification Using Common N-Gram Profiles of Text Documents

2014

Magdalena Jankowska Evangelos E. Milios Vlado Keselj

Authorship verification is the problem of answering the question whether or not a sample text document was written by a specific person, given a few other documents known to be authored by them. We propose a proximity based method for one-class classification that applies the Common N-Gram (CNG) dissimilarity measure. The CNG dissimilarity (Kešelj et al., 2003) is based on the differences in th...

متن کامل

Detecting Text Reuse with Modified and Weighted N-grams

2012

Rao Muhammad Adeel Nawab Mark Stevenson Paul D. Clough

Text reuse is common in many scenarios and documents are often based, at least in part, on existing documents. This paper reports an approach to detecting text reuse which identifies not only documents which have been reused verbatim but is also designed to identify cases of reuse when the original has been rewritten. The approach identifies reuse by comparing word n-grams in documents and modi...

متن کامل