From digital library to n-grams: NB N-gram

نویسندگان

Magnus Breder Birkenes

Lars G. Johnsen

Arne Martinus Lindstad

Johanne Ostad

چکیده

At the National Library of Norway, we are currently developing a service comparable to the Google Ngram Viewer (Michel et al., 2010; Lin et al., 2012; Aiden and Michel, 2013) called NB Ngram. It is based on all books and newspapers digitized up to and including 2013, as part of the large scale digitization project at the National Library of Norway. Uni-, biand trigams have been generated on the basis of this text corpus containing some 34 billion words. In this paper, we sketch the background of NB N-gram and illustrate some applications of it.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Weighted Neural Bag-of-n-grams Model: New Baselines for Text Classification

NBSVM is one of the most popular methods for text classification and has been widely used as baselines for various text representation approaches. It uses Naive Bayes (NB) feature to weight sparse bag-of-n-grams representation. N-gram captures word order in short context and NB feature assigns more weights to those important words. However, NBSVM suffers from sparsity problem and is reported to...

متن کامل

Citations in the Digital Library of Classics: Extracting Canonical References by Using Conditional Random Fields

Scholars of Classics cite ancient texts by using abridged citations called canonical references. In the scholarly digital library, canonical references create a complex textile of links between ancient and modern sources reflecting the deep hypertextual nature of texts in this field. This paper aims to demonstrate the suitability of Conditional Random Fields (CRF) for extracting this particular...

متن کامل

A New Domain Independent Keyphrase Extraction System

In this paper we present a keyphrase extraction system that can extract potential phrases from a single document in an unsupervised, domain-independent way. We extract word n-grams from input document. We incorporate linguistic knowledge (i.e., part-of-speech tags), and statistical information (i.e., frequency, position, lifespan) of each n-gram in defining candidate phrases and their respectiv...

متن کامل

First Person Singular: A Digital Library Collection that Helps Second Language Learners Express Themselves

We are using digital library technology to help language learners express themselves by capitalizing on all the human-generated text available on the Web. From a massive collection of n-grams and their occurrence frequencies we extract sequences that begin with the word “I”, sequences that begin a question, and sequences containing statistically significant collocations. These are preprocessed,...

متن کامل

Languages of Mathematics

An essay about mathematics being a sublanguage of other natural languages: how it may be represented, stored, searched and handled in several projects of (European) Digital Mathematics Libraries as DML-CZ or EuDML. A framework for solving problem of computing of similar papers in a digital library is proposed, allowing several types of similarity type definitions: plagiarity counting on common ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

From digital library to n-grams: NB N-gram

نویسندگان

چکیده

منابع مشابه

Weighted Neural Bag-of-n-grams Model: New Baselines for Text Classification

Citations in the Digital Library of Classics: Extracting Canonical References by Using Conditional Random Fields

A New Domain Independent Keyphrase Extraction System

First Person Singular: A Digital Library Collection that Helps Second Language Learners Express Themselves

Languages of Mathematics

عنوان ژورنال:

اشتراک گذاری