Extension of Zipf's Law to Words and Phrases
نویسندگان
چکیده
Zipf’s law states that the frequency of word tokens in a large corpus of natural language is inversely proportional to the rank. The law is investigated for two languages English and Mandarin and for ngram word phrases as well as for single words. The law for single words is shown to be valid only for high frequency words. However, when single word and n-gram phrases are combined together in one list and put in order of frequency the combined list follows Zipf’s law accurately for all words and phrases, down to the lowest frequencies in both languages. The Zipf curves for the two languages are then almost identical.
منابع مشابه
Zipf’s law holds for phrases, not words
With Zipf's law being originally and most famously observed for word frequency, it is surprisingly limited in its applicability to human language, holding over no more than three to four orders of magnitude before hitting a clear break in scaling. Here, building on the simple observation that phrases of one or more words comprise the most coherent units of meaning in language, we show empirical...
متن کاملEvolution of the most common English words and phrases over the centuries
By determining the most common English words and phrases since the beginning of the sixteenth century, we obtain a unique large-scale view of the evolution of written text. We find that the most common words and phrases in any given year had a much shorter popularity lifespan in the sixteenth century than they had in the twentieth century. By measuring how their usage propagated across the year...
متن کاملExtension of Zipf's Law to Word and Character N-grams for English and Chinese
It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about 1,000. However, when single words or characters are combined together with n-gram words or chara...
متن کاملOn the Law of Zipf-Mandelbrot for Multi-Wort Phrases
The paper studies the probabilities of the occurrence of m word phrases (m=2,3, ...) in relation with the probabilities of occurrence of the single words. It is well-known that, in the latter case, the law of Zipf is valid (i.e. a power law). We prove that in the case of m word phrases (m22) this is not the case. We present two independent proofs of this. We furthermore show that in case we wan...
متن کاملMarshall-Olkin Extended Zipf Distribution
The Zipf distribution also known as scale-free distribution or discrete Pareto distribution, is the particular case of Power Law distribution with support the strictly positive integers. It is a one-parameter distribution with a linear behaviour in the log-log scale. In this paper the Zipfian distribution is generalized by means of the Marshall-Olkin transformation. The new model has more flexi...
متن کامل