Do Wikipedia article titles obey Zipf's law?
George Zipf observed that in many forms of written text a few words occur often but most are rare.
Zipf's law states that the frequency of an item is inversely proportional to it's rank in the frequency table. The most frequent word occurs twice as often as the second most frequent word, three times as often as the third most frequent word, and so on.
I was curious how well this law holds up so I decided to examine Wikipedia article titles. Out of 5,674,805 article titles, there were 16,387,602 total word occurences and 1,481,156 unique words. Of those, the top 2,116 words (0.1%) account for half of all word occurences.
While the general idea behind Zipf's law appears to hold true, I found a mean absolute percentange error of 56% between the expected and actual frequency of word occurences.
Below are the top 10 words. The most popular was of, accounting for ~2.5% of all word occurences. The final column shows the percent error between the theoretical frequency, as predicted by Zipf's law, and the actual frequency.
You can download the full list of unique words, counts, and relative frequencies here.
- Downloaded the July 1, 2018 Wikipedia English article dump.
- Used Dustin Sallings' dump parser to read the bzip xml files.
- Used Marty Schoch's unicode text segmentation package (one approach to tokenization - splitting text into words).
- Filtered pages that were redirects or not in namespace 0. Apparently, it's a bit controversial that disambiguation pages are considered articles in the first place.
- Normalized the text. Converted to lowercase and removed accents. (So that Pokémon, Pokemon, and pokemon are all equivalent.)
You can find my golang source code at github.com/escholtz/wikizipf.