Do Wikipedia article titles obey Zipf's law?

George Zipf observed that in many forms of written text a few words occur often but most are rare.

Zipf's law states that the frequency of an item is inversely proportional to it's rank in the frequency table. The most frequent word occurs twice as often as the second most frequent word, three times as often as the third most frequent word, and so on.

I was curious how well this law holds up so I decided to examine Wikipedia article titles. Out of 5,674,805 article titles, there were 16,387,602 total word occurences and 1,481,156 unique words. Of those, the top 2,116 words (0.1%) account for half of all word occurences.

While the general idea behind Zipf's law appears to hold true, I found a mean absolute percentange error of 56% between the expected and actual frequency of word occurences.

Below are the top 10 words. The most popular was of, accounting for ~2.5% of all word occurences. The final column shows the percent error between the theoretical frequency, as predicted by Zipf's law, and the actual frequency.

Rank Word Absolute
Frequency
Relative
Frequency
Expected
Relative
Frequency
% Error
1 of 409,348 2.498
2 the 268,992 1.64 1.25 -23.9
3 in 130,674 0.80 0.83 4.4
4 list 103,288 0.63 0.62 -0.9
5 and 85,683 0.52 0.50 -4.5
6 de 56,617 0.35 0.42 20.5
7 county 51,254 0.31 0.36 14.1
8 film 50,234 0.31 0.31 1.9
9 john 49,731 0.30 0.28 -8.5
10 station 49,051 0.30 0.25 -16.5

You can download the full list of unique words, counts, and relative frequencies here.

Method

You can find my golang source code at github.com/escholtz/wikizipf.


Hi, I'm Eddie Scholtz. These are my notes. You can reach me at eascholtz@gmail.com. Atom