grams

A Online Appendix to: Analysis and Optimization for Boolean Expression Indexing

2011

Mohammad Sadoghi Hans-Arno Jacobsen

String tokenization using q-grams maps the string into a high-dimensional vector space model, in which the domain of each dimension is binary. The size of this space is exponential in the length of q-grams. For instance, q-grams of size three results in a space with 26 dimensions. The vector space model representation of a tokenized string (e.g., {‘str’, ‘tri’, ‘rin’, ‘ing’}) can be expressed b...

متن کامل

Developing High-resolution Universal Multy-type n-gram Text Similarity Detector

2014

Yurii Palkovskii Alexei Belov

This paper describes approaches used for the Plagiarism Detection task during PAN 2014 International Competition on Uncovering Plagiarism, Authorship, and Social Software Misuse, that scored 1-st place with plagdet score (0.907) for test corpus no.3 and 3-rd place score (0.868) for test corpus no. 2. In this work we aggregated all the previously researched experience from PAN12 and PAN 13 resea...

متن کامل

Vincent Etter - Master Thesis - Semantic Vector Machines

2011

Etter Vincent

We first present our work in machine translation, during which we used aligned sentences to train a neural network to embed n-grams of different languages into an d-dimensional space, such that n-grams that are the translation of each other are close with respect to some metric. Good n-grams to n-grams translation results were achieved, but full sentences translation is still problematic. We re...

متن کامل

Using N-grams to Process Hindi Queries with Transliteration Variations

1997

Anand Natrajan Allison L. Powell James C. French

Retrieval systems based on N-grams have been used as alternatives to word-based systems. N-grams offer a language-independent technique that allows retrieval based on portions of words. A query that contains misspellings or differences in transliteration can defeat word-based systems. N-gram systems are more resistant to these problems. We present a retrieval system based on N-grams that uses a...

متن کامل

StringNet as a Computational Resource for Discovering and Investigating Linguistic Constructions

2010

David Wible Nai-Lung Tsao

We describe and motivate the design of a lexico-grammatical knowledgebase called StringNet and illustrate its significance for research into constructional phenomena in English. StringNet consists of a massive archive of what we call hybrid n-grams. Unlike traditional n-grams, hybrid n-grams can consist of any co-occurring combination of POS tags, lexemes, and specific word forms. Further, we d...

متن کامل

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution

2015

Upendra Sapkota Steven Bethard Manuel Montes-y-Gómez Thamar Solorio

Character n-grams have been identified as the most successful feature in both singledomain and cross-domain Authorship Attribution (AA), but the reasons for their discriminative value were not fully understood. We identify subgroups of character n-grams that correspond to linguistic aspects commonly claimed to be covered by these features: morphosyntax, thematic content and style. We evaluate t...

متن کامل

Dialogue act classification using a Bayesian approach∗

2004

Sergio Grau

In this work, we make a contribution to natural speech dialogue act detection. We focus our attention on the dialogue act classification using a Bayesian approach. Our classifier is tested on two corpora, the Switchboard and the Basurde tasks. A combination of a naive Bayes classifier and n-grams is used. The impact of different smoothing methods (Laplace and Witten Bell) and n-grams in classif...

متن کامل

Revisiting the Case for Explicit Syntactic Information in Language Models

2012

Ariya Rastrow Sanjeev Khudanpur Mark Dredze

Statistical language models used in deployed systems for speech recognition, machine translation and other human language technologies are almost exclusively n-gram models. They are regarded as linguistically naı̈ve, but estimating them from any amount of text, large or small, is straightforward. Furthermore, they have doggedly matched or outperformed numerous competing proposals for syntactical...

متن کامل

Multiword expressions in spoken language: An exploratory study on pronunciation variation

Journal: :Computer Speech & Language 2005

Diana Binnenpoorte Catia Cucchiarini Lou Boves Helmer Strik

The study presented in this paper was aimed at exploring the possibilities of modelling specific pronunciation characteristics of multiword expressions (MWEs) for both automatic speech recognition (ASR) and automatic phonetic transcription (APT). For this purpose, we first drew up an inventory of frequently found N-grams extracted from orthographic transcriptions of spontaneous speech contained...

متن کامل

Comparaison des stratégies d'indexation pour les langues asiatiques

2006

Samir Abdou

In information retrieval, Chinese and Japanese present many challenging problems. Unlike most European languages, the lack of explicit word boundaries represents one of the most important issue for indexing. For this reason, many works proposed different approaches to index documents or requests written in these languages. This article presents a comparison of the common indexing strategies. Mo...

متن کامل