n grams

Generalized N-gram Measures for Melodic Similarity

2006

Klaus Frieler

In this paper we propose three generalizations of well-known N-gram approaches for measuring similarity of single-line melodies. In a former paper we compared around 50 similarity measures for melodies with empirical data from music psychological experiments. Similarity measures based on edit distances and N-grams always showed the best results for different contexts. This paper aims at a gener...

متن کامل

Ensembles of Proximity-Based One-Class Classifiers for Author Verification Notebook for PAN at CLEF 2014

2014

Magdalena Jankowska Vlado Keselj Evangelos E. Milios

We use ensembles of proximity based one-class classifiers for authorship verification task. The one-class classifiers compare, for each document of the known authorship, the dissimilarity between this document and the most dissimilar other document of this authorship to the dissimilarity between this document and the questioned document. As the dissimilarity measure between documents we use Com...

متن کامل

Twitter Trends Detection by Identifying Grammatical Relations

2013

Mikhail Aleksandrovich Dykov Pavel Nikolaevich Vorobkalov

The problem considered in this paper relates to identification of trends in a given area based on analysis of Twitter messages. The approaches currently used for Twitter trends detection are based on n-grams. We propose another approach of trend detection based on identifying trend as grammatical relation and perform the identification of trending relations on the basis of their frequency chang...

متن کامل

Using a machine learning model to assess the complexity of stress systems

2014

Liviu P. Dinu Alina Maria Ciobanu Ioana Chitoran Vlad Niculae

We address the task of stress prediction as a sequence tagging problem. We present sequential models with averaged perceptron training for learning primary stress in Romanian words. We use character n-grams and syllable n-grams as features and we account for the consonant-vowel structure of the words. We show in this paper that Romanian stress is predictable, though not deterministic, by using ...

متن کامل

Language Recognition using Random Indexing

Journal: :CoRR 2014

Aditya Joshi Johan T. Halseth Pentti Kanerva

Random Indexing is a simple implementation of Random Projections with a wide range of applications. It can solve a variety of problems with good accuracy without introducing much complexity. Here we demonstrate its use for identifying the language of text samples, based on a novel method of encoding letter n-grams into high-dimensional Language Vectors. Further, we show that the method is easil...

متن کامل

The role of personality, age, and gender in tweeting about mental illness

2015

Daniel Preotiuc-Pietro Johannes C. Eichstaedt Gregory J. Park Maarten Sap Laura Smith Victoria Tobolsky H. Andrew Schwartz Lyle H. Ungar

Mental illnesses, such as depression and post traumatic stress disorder (PTSD), are highly underdiagnosed globally. Populations sharing similar demographics and personality traits are known to be more at risk than others. In this study, we characterise the language use of users disclosing their mental illness on Twitter. Language-derived personality and demographic estimates show surprisingly s...

متن کامل

Syntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish

2014

Jenna Kanerva Juhani Luotolahti Veronika Laippala Filip Ginter

In this paper, we report on the development of a large-scale Finnish Internet parsebank, currently consisting of 1.5 billion tokens in 116 million sentences. The data is fully morphologically and syntactically analyzed and it has been used to extract flat and syntactic n-gram collections, as well as verb-argument and nounargument n-grams. Additionally, distributional vector space representation...

متن کامل

Using Character n-grams and Style Features for Gender and Language Variety Classification

2017

Rodrigo Ribeiro Oliveira Rosalvo Ferreira Oliveira Neto

Author profiling is the problem of determining the characteristics of an author of an anonymous text. In this paper, we detail a method to determine the language variety and the gender of the authors of tweets, as a submission for the Author Profiling Task at PAN 2017. This method seeks to select the most significant character n-grams for each class considered, combining them with style feature...

متن کامل

The latent words language model

Journal: :Computer Speech & Language 2012

Koen Deschacht Jan De Belder Marie-Francine Moens

Statistical language models have found many applications in information retrieval since their introduction almost three decades ago. Currently the most popular models are n-gram models, which are known to suffer from serious sparseness issues, which is a result of the large vocabulary size |V | of any given corpus and of the exponential nature of n-grams, where potentially |V | n-grams can occu...

متن کامل

Correlations between dialogue acts and learning in spoken tutoring dialogues

Journal: :Natural Language Engineering 2006

Diane J. Litman Katherine Forbes-Riley

We examine correlations between dialogue behaviors and learning in tutoring, using two corpora of spoken tutoring dialogues: a human-human corpus and a human-computer corpus. To formalize the notion of dialogue behavior, we manually annotate our data using a tagset of student and tutor dialogue acts relative to the tutoring domain. A unigram analysis of our annotated data shows that student lea...

متن کامل