You can view the tools on Github:
Computing word frequency isn't a new idea, but most of the data sources are either old, not in the public domain, or weren't computed from a very big corpus of data. In 2009, Marc Brysbaert and Boris New released a paper describing a new way to calculate word frequencies. They grabbed data from OpenSubtitles.org, which they showed in their paper better reflected real world usage than other data sources.
You can grab a copy of the resulting data from here. Unfortunately, they didn't release the tools they used. The tools we're releasing mimic the process described in their paper, so you can run your own calculations.
The tools were're releasing use XML files from Open Corpus, since the words are already tokenized and are available as one big .tar.gz file (11 Gb, compressed). Open Corpus got the data from Open Subtitles.
If you use this data, please consider becoming a supporter of that site.
The data from Open Corpus includes a lot of "duplicate" files. This is because each movie (or tv episode) may have been released in multiple formats/encodings. Deduper.js will delete all but one file per movie.
Brysbaert and New rejected any file that had greater than 2.5% spelling errors, based on aspell. This filter will mimic that, with a configurable threshold.
Runs trough the XML files that remain and counts words.
The result is a big JSON file, where each word is a key. Each key contains a freqCount number, which represents the total number of times the word was seen, and a cdCount number, which is the number of files it was seen in.
You can slice and dice it to your heart's content.
How Words API Uses Frequency Information
When you retrieve a word from Words API, it will have a frequency score. This is a log10 representation of the total frequency of the word. The scale is roughly from 1-8, making it a bit easier to compare frequency between words. You can find out more in the documentation.