Distributed Distributional Similarities of Google Books over the Centuries
نویسندگان
چکیده
This paper introduces a distributional thesaurus and sense clusters computed on the complete Google Syntactic N-grams, which is extracted from Google Books, a very large corpus of digitized books published between 1520 and 2008. We show that a thesaurus computed on such a large text basis leads to much better results than using smaller corpora like Wikipedia. We also provide distributional thesauri for equal-sized time slices of the corpus. While distributional thesauri can be used as lexical resources in NLP tasks, comparing word similarities over time can unveil sense change of terms across different decades or centuries, and can serve as a resource for diachronic lexicography. Thesauri and clusters are available for download.
منابع مشابه
Enhanced Search with Wildcards and Morphological Inflections in the Google Books Ngram Viewer
We present a new version of the Google Books Ngram Viewer, which plots the frequency of words and phrases over the last five centuries; its data encompasses 6% of the world’s published books. The new Viewer adds three features for more powerful search: wildcards, morphological inflections, and capitalization. These additions allow the discovery of patterns that were previously difficult to find...
متن کاملRecognition of Lustered Pottery of the 3rd and 4th Centuries AH/AD 9th and 10th Centuries from the Imitated Ones
Luster is an innovative decorative technique applied in the Islamic era. The imitative technique of luster glaze is the result of efforts of some potters to imitate the visual features of the earlier luster. The two techniques differ in their production methods, but have many similarities in terms of physical characteristics of the works, including color, pattern, and in some cases, form. Such ...
متن کاملScaling to Large3 Data: An Efficient and Effective Method to Compute Distributional Thesauri
We introduce a new highly scalable approach for computing Distributional Thesauri (DTs). By employing pruning techniques and a distributed framework, we make the computation for very large corpora feasible on comparably small computational resources. We demonstrate this by releasing a DT for the whole vocabulary of Google Books syntactic n-grams. Evaluating against lexical resources using two m...
متن کاملSyntactic Annotations for the Google Books NGram Corpus
We present a new edition of the Google Books Ngram Corpus, which describes how often words and phrases were used over a period of five centuries, in eight languages; it reflects 6% of all books ever published. This new edition introduces syntactic annotations: words are tagged with their part-of-speech, and headmodifier relationships are recorded. The annotations are produced automatically with...
متن کاملنقشه سازی و مروری بر آنوفل های ناقل مالاریا در ایران
Introduction:Mapping distribution of endemic diseases with their relations to geographical factors has become important for public health experts, especially in the study of vector-born protozoan diseases with emphasis on spatial or geographical epidemiology. This study was carried out to provide distribution maps of the geographical pathology vectors of Malaria in Iran. Methods: A systemat...
متن کامل