Words API open sources tools to calculate English word frequency

Words API lets you search for words using a lot of different parameters, one of which is the frequency of how often the word is used.  Today, we're open sourcing the tools we used to calculate those frequency numbers.

You can view the tools on Github:

The Idea

Computing word frequency isn't a new idea, but most of the data sources are either old, not in the public domain, or weren't computed from a very big corpus of data.  In 2009, Marc Brysbaert and Boris New released a paper describing a new way to calculate word frequencies.  They grabbed data from OpenSubtitles.org, which they showed in their paper better reflected real world usage than other data sources.

You can grab a copy of the resulting data from here.  Unfortunately, they didn't release the tools they used. The tools we're releasing mimic the process described in their paper, so you can run your own calculations.

The Data

The tools were're releasing use XML files from Open Corpus, since the words are already tokenized and are available as one big .tar.gz file (11 Gb, compressed). Open Corpus got the data from Open Subtitles.

If you use this data, please consider becoming a supporter of that site.

The Tools

The data from Open Corpus includes a lot of "duplicate" files. This is because each movie (or tv episode) may have been released in multiple formats/encodings.  Deduper.js will delete all but one file per movie.

Brysbaert and New rejected any file that had greater than 2.5% spelling errors, based on aspell. This filter will mimic that, with a configurable threshold.

Runs trough the XML files that remain and counts words.

The Output

The result is a big JSON file, where each word is a key. Each key contains a freqCount number, which represents the total number of times the word was seen, and a cdCount number, which is the number of files it was seen in.

You can slice and dice it to your heart's content.

How Words API Uses Frequency Information

When you retrieve a word from Words API, it will have a frequency score. This is a log10 representation of the total frequency of the word. The scale is roughly from 1-8, making it a bit easier to compare frequency between words.  You can find out more in the documentation.

No comments:

Post a Comment