From digital library to n-grams: NB N-gram
نویسندگان
چکیده
At the National Library of Norway, we are currently developing a service comparable to the Google Ngram Viewer (Michel et al., 2010; Lin et al., 2012; Aiden and Michel, 2013) called NB Ngram. It is based on all books and newspapers digitized up to and including 2013, as part of the large scale digitization project at the National Library of Norway. Uni-, biand trigams have been generated on the basis of this text corpus containing some 34 billion words. In this paper, we sketch the background of NB N-gram and illustrate some applications of it.
منابع مشابه
Weighted Neural Bag-of-n-grams Model: New Baselines for Text Classification
NBSVM is one of the most popular methods for text classification and has been widely used as baselines for various text representation approaches. It uses Naive Bayes (NB) feature to weight sparse bag-of-n-grams representation. N-gram captures word order in short context and NB feature assigns more weights to those important words. However, NBSVM suffers from sparsity problem and is reported to...
متن کاملCitations in the Digital Library of Classics: Extracting Canonical References by Using Conditional Random Fields
Scholars of Classics cite ancient texts by using abridged citations called canonical references. In the scholarly digital library, canonical references create a complex textile of links between ancient and modern sources reflecting the deep hypertextual nature of texts in this field. This paper aims to demonstrate the suitability of Conditional Random Fields (CRF) for extracting this particular...
متن کاملA New Domain Independent Keyphrase Extraction System
In this paper we present a keyphrase extraction system that can extract potential phrases from a single document in an unsupervised, domain-independent way. We extract word n-grams from input document. We incorporate linguistic knowledge (i.e., part-of-speech tags), and statistical information (i.e., frequency, position, lifespan) of each n-gram in defining candidate phrases and their respectiv...
متن کاملFirst Person Singular: A Digital Library Collection that Helps Second Language Learners Express Themselves
We are using digital library technology to help language learners express themselves by capitalizing on all the human-generated text available on the Web. From a massive collection of n-grams and their occurrence frequencies we extract sequences that begin with the word “I”, sequences that begin a question, and sequences containing statistically significant collocations. These are preprocessed,...
متن کاملLanguages of Mathematics
An essay about mathematics being a sublanguage of other natural languages: how it may be represented, stored, searched and handled in several projects of (European) Digital Mathematics Libraries as DML-CZ or EuDML. A framework for solving problem of computing of similar papers in a digital library is proposed, allowing several types of similarity type definitions: plagiarity counting on common ...
متن کامل