n grams

A Variant of N-Gram Based Language Classification

2007

Andrija Tomovic Predrag Janicic

Rapid classification of documents is of high-importance in many multilingual settings (such as international institutions or Internet search engines). This has been, for years, a well-known problem, addressed by different techniques, with excellent results. We address this problem by a simple n-grams based technique, a variation of techniques of this family. Our n-grams-based classification is ...

متن کامل

Serbian Text Categorization Using Byte Level n-Grams

2012

Jelena Graovac

This paper presents the results of classifying Serbian text documents using the byte-level n-gram based frequency statistics technique, employing four different dissimilarity measures. Results show that the byte-level n-grams text categorization, although very simple and language independent, achieves very good accuracy.

متن کامل

Valence, arousal and dominance estimation for English, German, Greek, Portuguese and Spanish lexica using semantic models

2015

Elisavet Palogiannidi Elias Iosif Polychronis Koutsakis Alexandros Potamianos

We propose and evaluate the use of an affective-semantic model to expand the affective lexica of German, Greek, English, Spanish and Portuguese. Motivated by the assumption that semantic similarity implies affective similarity, we use word level semantic similarity scores as semantic features to estimate their corresponding affective scores. Various context-based semantic similarity metrics are...

متن کامل

Experiments in the Retrieval of Unsegmented Japanese Text at the NTCIR-2 Workshop

2001

Paul McNamee

Our work with the Hopkins Automated Information Retriever for Combing Unstructured Text (HAIRCUT) system has made use of overlapping character n-grams in the indexing and retrieval of text. In previous experiments with Western European languages we have shown that longer length n-grams (e.g., n=6) are capable of providing an effective form of alinguistic term normalization. We have wanted to in...

متن کامل

Extension of Zipf's Law to Word and Character N-grams for English and Chinese

Journal: :IJCLCLP 2003

Le Quan Ha Elvira I. Sicilia-Garcia Ji Ming Francis Jack Smith

It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about 1,000. However, when single words or characters are combined together with n-gram words or chara...

متن کامل

To Memorize or to Predict: Prominence labeling in Conversational Speech

2007

Ani Nenkova Jason M. Brenier Anubha Kothari Sasha Calhoun Laura Whitton David Beaver Daniel Jurafsky

The immense prosodic variation of natural conversational speech makes it challenging to predict which words are prosodically prominent in this genre. In this paper, we examine a new feature, accent ratio, which captures how likely it is that a word will be realized as prominent or not. We compare this feature with traditional accent-prediction features (based on part of speech and N-grams) as w...

متن کامل

Predicting Translation Performance with Referential Translation Machines

2017

Ergun Biçici

Referential translation machines achieve top performance in both bilingual and monolingual settings without accessing any task or domain specific information or resource. RTMs achieve the 3rd system results for German to English sentence-level prediction of translation quality and the 2nd system results according to root mean squared error. In addition to the new features about substring distan...

متن کامل

Transliterated arabic name search

2004

David Holmes Samsum Kashfi Syed Uzair Aqeel

We address name search for transliterated Arabic given names. In previous work, we addressed similar problems with English and Arabic surnames. In each previous case, we used a variant of Soundex and n-grams to improve precision and recall of name matching compared against well known approaches such as the Russell Soundex algorithm. Unlike prior work, the proposed approach does not rely upon So...

متن کامل

Using a Partially Annotated Corpus to Build a Dependency Parser for Japanese

2005

Manabu Sassano

We explore the use of a partially annotated corpus to build a dependency parser for Japanese. We examine two types of partially annotated corpora. It is found that a parser trained with a corpus that does not have any grammatical tags for words can demonstrate an accuracy of 87.38%, which is comparable to the current state-of-the-art accuracy on the Kyoto University Corpus. In contrast, a parse...

متن کامل

CNG Text Classification for Authorship Profiling Task Notebook for PAN at CLEF 2013

2013

Magdalena Jankowska Vlado Keselj Evangelos E. Milios

We describe our participation in the Author Profiling task of the PAN 2013 competition. The task objective is to determine the age and the gender of an author of a document. We applied the Common N-Gram (CNG) classifier (Kešelj et al., 2003) to this task. The CNG classifier uses a dissimilarity measure based on the differences in the frequencies of the character n-grams that are most common in ...

متن کامل