The Long Tail of the English Language

In the English language, the most common words are incredibly common.  Though there are at least 1 million words in the English language, "you", "I", and "the" account for 10% of the words we actually use. By the time you reach "is", at number 10, you've covered 20%.

The top 100 most common English words account for over 50% of the words we use, which is about how many words a 2-year old know. A 3-year old would probably know most of the top 1,000 words, which covers 75%.  And by the 10,000th most common word, "remorse", you've covered over 88% of the words we commonly use. That leaves a lot of words you don't hear very much.

If you put word frequency on a graph, like the one below, you quickly see an interesting distribution called the Long Tail.  It happens when a small number of items account for a disproportionate number of occurrences, such as the books that Amazon sells.


You can search for words to see where they fall on the graph by typing the word into the Search box.

It's somewhat heartening to see that "good" is more common than "evil" (58th vs 978th), "love" more common than "hate" (110th vs 527th), and "happy" more common than "sad" (272nd vs 844th). But "war" is still more common than "peace" ( 492nd vs 797th).

Many words occur just once every million words, like "icing", or even less.  Tremulous, meaning "quivering as from weakness or fear", occurs 0.04 times per million words - meaning you'd hear it about once ever 25 million words. Some word occur so infrequently we didn't even find samples of them, even though we counted almost 70 million of them.

If you're interested in using Words API to find word frequency information, be sure to check out the documentation.  If you're interested in seeing how we got this data, be sure to read this earlier post about our open source tools.

Be sure to follow us on Twitter for more word facts! You can also discuss this on Hacker News.