نتایج جستجو برای: n grams
تعداد نتایج: 982486 فیلتر نتایج به سال:
Character n-gram F-score (CHRF) is shown to correlate very well with human relative rankings of different machine translation outputs, especially for morphologically rich target languages. However, its relation with direct human assessments is not yet clear. In this work, Pearson’s correlation coefficients for direct assessments are investigated for two currently available target languages, Eng...
This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional 'bag of words' representation, we use a 'bag of character n-grams' representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or 'deep' text preprocessing. Based on ex...
OBJECTIVE We aim to identify duplicate pairs of Medline citations, particularly when the documents are not identical but contain similar information. MATERIALS AND METHODS Duplicate pairs of citations are identified by comparing word n-grams in pairs of documents. N-grams are modified using two approaches which take account of the fact that the document may have been altered. These are: (1) d...
In this paper, a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses content terms to represent documents, the proposed method is based on a small list of stopwords (i.e., very frequent words). We show that stopword n-grams reveal important information for plagiarism detection since they are able to capture s...
Acquaintance is the name of a technique for information processing that combines the robustness of an n-gram-based algorithm with a novel vector-space model. Acquaintance gauges similarity among documents on the basis of common features, permitting document categorization based on a common language, a common topic, or common subtopics. The algorithm is completely languageand topicindependent, a...
We introduce a new set of tools for working with web-scale N-gram data. These tools lower the barrier for working with web-scale text, and create a new platform for acquiring large-scale linguistic knowledge. They will allow novel sources of information to be applied to long-standing natural language challenges.
Goals The Johns Hopkins University Applied Physics Laboratory (JHU/APL) is a first-time entrant in the TREC Category A evaluation. The focus of our information retrieval research is on the relative value of and interaction among multiple term types. In particular, we are interested in examining both words and n-grams as indexing terms. The relative values of words and n-grams have been disputed...
In this work we investigate the usefulness of n-grams for document indexing in text categorization (TC). We call n-gram a set tk of n word stems, and we say that tk occurs in a document dj when a sequence of words appears in dj that, after stop word removal and stemming, consists exactly of the n stems in tk, in some order. Previous researches have investigated the use of n-grams (or some varia...
In languages with high word inflation such as Arabic, stemming improves text retrieval performance by reducing words variants. We propose a change in the corpus-based stemming approach proposed by Xu and Croft for English and Spanish languages in order to stem Arabic words. We generate the conflation classes by clustering 3-gram representations of the words found in only 10% of the data in the ...
Theoretically, an improvement in a language model occurs as the size of the n-grams increases from 3 to 5 or higher. As the n-gram size increases, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems, the reduced n-grams’ approach previously developed by O’ Boyle and Smit...
نمودار تعداد نتایج جستجو در هر سال
با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید