Distributed Distributional Similarities of Google Books over the Centuries

نویسندگان

Martin Riedl

Richard Steuer

Christian Biemann

چکیده

This paper introduces a distributional thesaurus and sense clusters computed on the complete Google Syntactic N-grams, which is extracted from Google Books, a very large corpus of digitized books published between 1520 and 2008. We show that a thesaurus computed on such a large text basis leads to much better results than using smaller corpora like Wikipedia. We also provide distributional thesauri for equal-sized time slices of the corpus. While distributional thesauri can be used as lexical resources in NLP tasks, comparing word similarities over time can unveil sense change of terms across different decades or centuries, and can serve as a resource for diachronic lexicography. Thesauri and clusters are available for download.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhanced Search with Wildcards and Morphological Inflections in the Google Books Ngram Viewer

We present a new version of the Google Books Ngram Viewer, which plots the frequency of words and phrases over the last five centuries; its data encompasses 6% of the world’s published books. The new Viewer adds three features for more powerful search: wildcards, morphological inflections, and capitalization. These additions allow the discovery of patterns that were previously difficult to find...

متن کامل

Recognition of Lustered Pottery of the 3rd and 4th Centuries AH/AD 9th and 10th Centuries from the Imitated Ones

Luster is an innovative decorative technique applied in the Islamic era. The imitative technique of luster glaze is the result of efforts of some potters to imitate the visual features of the earlier luster. The two techniques differ in their production methods, but have many similarities in terms of physical characteristics of the works, including color, pattern, and in some cases, form. Such ...

متن کامل

Scaling to Large3 Data: An Efficient and Effective Method to Compute Distributional Thesauri

We introduce a new highly scalable approach for computing Distributional Thesauri (DTs). By employing pruning techniques and a distributed framework, we make the computation for very large corpora feasible on comparably small computational resources. We demonstrate this by releasing a DT for the whole vocabulary of Google Books syntactic n-grams. Evaluating against lexical resources using two m...

متن کامل

Syntactic Annotations for the Google Books NGram Corpus

We present a new edition of the Google Books Ngram Corpus, which describes how often words and phrases were used over a period of five centuries, in eight languages; it reflects 6% of all books ever published. This new edition introduces syntactic annotations: words are tagged with their part-of-speech, and headmodifier relationships are recorded. The annotations are produced automatically with...

متن کامل

نقشه سازی و مروری بر آنوفل های ناقل مالاریا در ایران

Introduction:Mapping distribution of endemic diseases with their relations to geographical factors has become important for public health experts, especially in the study of vector-born protozoan diseases with emphasis on spatial or geographical epidemiology. This study was carried out to provide distribution maps of the geographical pathology vectors of Malaria in Iran. Methods: A systemat...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Distributed Distributional Similarities of Google Books over the Centuries

نویسندگان

چکیده

منابع مشابه

Enhanced Search with Wildcards and Morphological Inflections in the Google Books Ngram Viewer

Recognition of Lustered Pottery of the 3rd and 4th Centuries AH/AD 9th and 10th Centuries from the Imitated Ones

Scaling to Large3 Data: An Efficient and Effective Method to Compute Distributional Thesauri

Syntactic Annotations for the Google Books NGram Corpus

نقشه سازی و مروری بر آنوفل های ناقل مالاریا در ایران

عنوان ژورنال:

اشتراک گذاری