n grams

Idiom Savant at Semeval-2017 Task 7: Detection and Interpretation of English Puns

2017

Samuel Doogan Aniruddha Ghosh Hanyang Chen Tony Veale

This paper describes our system, entitled Idiom Savant, for the 7th Task of the Semeval 2017 workshop, “Detection and interpretation of English Puns”. Our system consists of two probabilistic models for each type of puns using Google n-grams and Word2Vec. Our system achieved fscore of 0.84, 0.663, and 0.07 in homographic puns and 0.8439, 0.6631, and 0.0806 in heterographic puns in task 1, task ...

متن کامل

Mirex 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Representations

2010

Julián Urbano Juan Lloréns Jorge Morato Sonia Sánchez-Cuadrado

This short paper describes four submissions to the Symbolic Melodic Similarity task of the MIREX 2010 edition. All four submissions rely on a local-alignment approach between sequences of n-grams, and they differ mainly on the substitution score between two n-grams. This score is based on a geometric representation that shapes musical pieces as curves in the pitch-time plane. One of the systems...

متن کامل

Language Variety and Gender Classification for Author Profiling in PAN 2017

2017

Alexander Ogaltsov Alexey Romanov

We describe the method of Author Profiling task. The task deals with study of profile aspects like gender and language variety. We explore an approach of using high-order char n-grams as features and logistic regression as a classifier for all subtasks. This approach appears to be simple and effective for the task. We also investigated feature importances and low-dimensional embeddings of the d...

متن کامل

Reduced n-gram Models for English and Chinese Corpora

2006

Le Quan Ha Philip Hanna Darryl Stewart Francis Jack Smith

Statistical language models should improve as the size of the n-grams increases from 3 to 5 or higher. However, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems, the reduced n-grams’ approach previously developed by O’Boyle (1993) can be applied. A reduced n-gram lang...

متن کامل

Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus

2017

Miguel A. Sánchez-Pérez Ilia Markov Helena Gómez-Adorno Grigori Sidorov

We compare the performance of character n-gram features (n = 3–8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We...

متن کامل

Dissimilarities Detections in Texts Using Symbol n-grams and Word Histograms

Journal: :Open Computer Science 2016

متن کامل

Using N-grams to Process Hindi Queries with Transliteration Variations

1997

Anand Natrajan Allison L. Powell James C. French

Retrieval systems based on N-grams have been used as alternatives to word-based systems. N-grams offer a language-independent technique that allows retrieval based on portions of words. A query that contains misspellings or differences in transliteration can defeat word-based systems. N-gram systems are more resistant to these problems. We present a retrieval system based on N-grams that uses a...

متن کامل

Native Language Identification using Phonetic Algorithms

2017

Charese Smiley Sandra Kübler

In this paper, we discuss the results of the IUCL system in the NLI Shared Task 2017. For our system, we explore a variety of phonetic algorithms to generate features for Native Language Identification. These features are contrasted with one of the most successful type of features in NLI, character n-grams. We find that although phonetic features do not perform as well as character n-grams alon...

متن کامل

Language modeling using x-grams

1996

Antonio Bonafonte José B. Mariño

In this paper, an extension of n-grams is proposed. In this extension, the memory of the model (n) is not fixed a priori. Instead, first, large memories are accepted and afterwards, merging criteria are applied to reduce complexity and to ensure reliable estimations. The results show how the perplexity obtained with x-grams is smaller than that of n-grams. Furthermore, the complexity is smaller...

متن کامل

Multiresolution Document Analysis with Wavelets

1996

Amen Zwa David S. Ebert Ethan L. Miller

The n-gram analysis technique breaks up a text document into several n-character long unique grams, and produces a vector whose components are the counts of these grams. A typical corpus contains hundreds of thousands of such grams. Wavelet compression reduces the dimension of the n-gram vectors, and speeds up document query operations. Document vectors with their dimensions reduced to four com...

متن کامل