Extension of Zipf's Law to Words and Phrases

نویسندگان

  • Le Quan Ha
  • Elvira I. Sicilia-Garcia
  • Ji Ming
  • Francis Jack Smith
چکیده

Zipf’s law states that the frequency of word tokens in a large corpus of natural language is inversely proportional to the rank. The law is investigated for two languages English and Mandarin and for ngram word phrases as well as for single words. The law for single words is shown to be valid only for high frequency words. However, when single word and n-gram phrases are combined together in one list and put in order of frequency the combined list follows Zipf’s law accurately for all words and phrases, down to the lowest frequencies in both languages. The Zipf curves for the two languages are then almost identical.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Zipf’s law holds for phrases, not words

With Zipf's law being originally and most famously observed for word frequency, it is surprisingly limited in its applicability to human language, holding over no more than three to four orders of magnitude before hitting a clear break in scaling. Here, building on the simple observation that phrases of one or more words comprise the most coherent units of meaning in language, we show empirical...

متن کامل

Evolution of the most common English words and phrases over the centuries

By determining the most common English words and phrases since the beginning of the sixteenth century, we obtain a unique large-scale view of the evolution of written text. We find that the most common words and phrases in any given year had a much shorter popularity lifespan in the sixteenth century than they had in the twentieth century. By measuring how their usage propagated across the year...

متن کامل

Extension of Zipf's Law to Word and Character N-grams for English and Chinese

It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about 1,000. However, when single words or characters are combined together with n-gram words or chara...

متن کامل

On the Law of Zipf-Mandelbrot for Multi-Wort Phrases

The paper studies the probabilities of the occurrence of m word phrases (m=2,3, ...) in relation with the probabilities of occurrence of the single words. It is well-known that, in the latter case, the law of Zipf is valid (i.e. a power law). We prove that in the case of m word phrases (m22) this is not the case. We present two independent proofs of this. We furthermore show that in case we wan...

متن کامل

Marshall-Olkin Extended Zipf Distribution

The Zipf distribution also known as scale-free distribution or discrete Pareto distribution, is the particular case of Power Law distribution with support the strictly positive integers. It is a one-parameter distribution with a linear behaviour in the log-log scale. In this paper the Zipfian distribution is generalized by means of the Marshall-Olkin transformation. The new model has more flexi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002