n grams

The University of Amsterdam at the CLEF Cross Language Speech Retrieval Track 2007

2007

Bouke Huurnink

In this paper we present the contents of the University of Amsterdam submission in the CLEF Cross Language Speech Retrieval 2007 English task. We describe the effects of using character n-grams and field combinations on both monolingual English retrieval, and crosslingual Dutch to English retrieval.

متن کامل

Using Web-scale N-grams to Improve Base NP Parsing Performance

2010

Emily Pitler Shane Bergsma Dekang Lin Kenneth Ward Church

We use web-scale N-grams in a base NP parser that correctly analyzes 95.4% of the base NPs in natural text. Web-scale data improves performance. That is, there is no data like more data. Performance scales log-linearly with the number of parameters in the model (the number of unique N-grams). The web-scale N-grams are particularly helpful in harder cases, such as NPs that contain conjunctions.

متن کامل

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution

2015

Upendra Sapkota Steven Bethard Manuel Montes-y-Gómez Thamar Solorio

Character n-grams have been identified as the most successful feature in both singledomain and cross-domain Authorship Attribution (AA), but the reasons for their discriminative value were not fully understood. We identify subgroups of character n-grams that correspond to linguistic aspects commonly claimed to be covered by these features: morphosyntax, thematic content and style. We evaluate t...

متن کامل

LingPars, a Linguistically Inspired, Language-Independent Machine Learner for Dependency Treebanks

2006

Eckhard Bick

This paper presents a Constraint Grammarinspired machine learner and parser, Ling Pars, that assigns dependencies to morpho logically annotated treebanks in a functioncentred way. The system not only bases at tachment probabilities for PoS, case, mood, lemma on those features' function probabili ties, but also uses topological features like function/PoS n-grams, barrier tags and daughter-se...

متن کامل

Research on N-grams feature selection methods for text classification

Journal: :IOP Conference Series: Materials Science and Engineering 2021

متن کامل

Multi-class composite n-gram language model using multiple word clusters and word successions

2001

Shuntaro Isogai Katsuhiko Shirai Hirofumi Yamamoto Yoshinori Sagisaka

In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem in small amount of training data. The Multi-Class Composite Ngram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, so-called Multi-Classes. In the Multi-Class, the statistical connectivity...

متن کامل

Web Catchphrase Improve System Employing Onomatopoeia and Large-Scale N-gram Corpus

Journal: :Int. J. Fuzzy Logic and Intelligent Systems 2012

Hiroaki Yamane Masafumi Hagiwara

In this paper, we propose a system which improves text catchphrases on the web using onomatopoeia and the Japanese Google N-grams. Onomatopoeia is regarded as a fundamental tool in daily communication for people. The proposed system inserts an onomatopoetic word into plain text catchphrases. Being based on a large catchphrase encyclopedia, the proposed system evaluates each catchphrase’s candid...

متن کامل

Properties of phoneme N -grams across the world's language families

Journal: :CoRR 2014

Taraka Rama Lars Borin

In this article, we investigate the properties of phoneme N-grams across half of the world's languages. We investigate if the sizes of three different N-gram distributions of the world's language families obey a power law. Further, the N-gram distributions of language families parallel the sizes of the families, which seem to obey a power law distribution. The correlation between N-gram distrib...

متن کامل

Automatic Generation of Context-Based Fill-in-the-Blank Exercises Using Co-occurrence Likelihoods and Google n-grams

2016

Jennifer Hill Rahul Simha

In this paper, we propose a method of automatically generating multiple-choice fill-inthe-blank exercises from existing text passages that challenge a reader’s comprehension skills and contextual awareness. We use a unique application of word co-occurrence likelihoods and the Google n-grams corpus to select words with strong contextual links to their surrounding text, and to generate distractor...

متن کامل

Skip N-grams and Ranking Functions for Predicting Script Events

2012

Bram Jans Steven Bethard Ivan Vulic Marie-Francine Moens

In this paper, we extend current state-of-theart research on unsupervised acquisition of scripts, that is, stereotypical and frequently observed sequences of events. We design, evaluate and compare different methods for constructing models for script event prediction: given a partial chain of events in a script, predict other events that are likely to belong to the script. Our work aims to answ...

متن کامل