Words API Blog

Major Speedup!

We've just finished making some changes to our backend, and most responses are now 70% faster.  Let us know if you run into any issues!

Searching for Words by Detail Type

We've expanded the Words API search functionality so that you can now search for words that have one or more detail types.  For instance, if you wanted to find words that have both "hasUsages" and "hasCategories" relationships, you would call the API like this:

https://wordsapiv1.p.mashape.com/words/?hasDetails=hasUsages,hasCategories

{
  "query": {
    "hasDetails": "hasUsages,hasCategories",
    "limit": 100,
    "page": 1
  },
  "results": {
    "total": 7,
    "data": [
      "humor",
      "humour",
      "trademark",
      "wit",
      "witticism",
      "wittiness",
      "yiddish"
    ]
  }
}

You can also use this to search for random words, like so:

https://wordsapiv1.p.mashape.com/words/?hasDetails=typeOf&frequencyMin=2&frequencyMax=3&random=true

{
  "word": "gloaming",
  "results": [
    {
      "definition": "the time of day immediately following sunset",
      "partOfSpeech": "noun",
      "synonyms": [
        "crepuscle",
        "crepuscule",
        "dusk",
        "evenfall",
        "fall",
        "gloam",
        "nightfall",
        "twilight"
      ],
      "typeOf": [
        "hour",
        "time of day"
      ],
      "hasTypes": [
        "night"
      ],
      "partOf": [
        "eve",
        "eventide",
        "even",
        "evening"
      ]
    }
  ],
  "syllables": {
    "count": 2,
    "list": [
      "gloam",
      "ing"
    ]
  },
  "pronunciation": {
    "all": "'gloʊmɪŋ"
  },
  "frequency": 1.73
}


For other ways to search for words, be sure to check out the documentation.

The Long Tail of the English Language

In the English language, the most common words are incredibly common.  Though there are at least 1 million words in the English language, "you", "I", and "the" account for 10% of the words we actually use. By the time you reach "is", at number 10, you've covered 20%.

The top 100 most common English words account for over 50% of the words we use, which is about how many words a 2-year old know. A 3-year old would probably know most of the top 1,000 words, which covers 75%.  And by the 10,000th most common word, "remorse", you've covered over 88% of the words we commonly use. That leaves a lot of words you don't hear very much.

If you put word frequency on a graph, like the one below, you quickly see an interesting distribution called the Long Tail.  It happens when a small number of items account for a disproportionate number of occurrences, such as the books that Amazon sells.


You can search for words to see where they fall on the graph by typing the word into the Search box.

It's somewhat heartening to see that "good" is more common than "evil" (58th vs 978th), "love" more common than "hate" (110th vs 527th), and "happy" more common than "sad" (272nd vs 844th). But "war" is still more common than "peace" ( 492nd vs 797th).

Many words occur just once every million words, like "icing", or even less.  Tremulous, meaning "quivering as from weakness or fear", occurs 0.04 times per million words - meaning you'd hear it about once ever 25 million words. Some word occur so infrequently we didn't even find samples of them, even though we counted almost 70 million of them.

If you're interested in using Words API to find word frequency information, be sure to check out the documentation.  If you're interested in seeing how we got this data, be sure to read this earlier post about our open source tools.

Be sure to follow us on Twitter for more word facts! You can also discuss this on Hacker News.

Expanded Frequency Information

Based on our recently open-sourced tools, we've updated the word frequency information that Words API returns.  The frequency score for most words have changed, but the relative frequency of the words have not changed much.  The most common words are still "you", "I", "the", etc.

What has changed is that the /words/{word}/frequency endpoint will now return additional information than just the one score.  For example, the frequency results for "apartment" now look like this:


{
  "word" : "apartment",
  "frequency": {
    "zipf": 4.82,
    "perMillion": 65.76,
    "diversity": 0.17
  }
}


Here's what each of those means:
  • zipf
  • This is the same number that gets returned for frequency at the main words endpoint. It's a log10 scale representation of the number of times the word appeared in our corpus. It ranges from 1 to 7, where a higher number means a word that was seen more frequently. For more information, you may want to see this paper.
  • perMillion
  • In any given corpus of one million English words, this is the number of times you can expect to see the word. It's a common frequency measurement that academic papers use.
  • diversity
  • In a document that represents part of a corpus, this is the odds that the given word will appear at least once. It ranges from 0 to 1.

We hope these new ways of looking at word frequency are helpful! If you have any questions, please let us know: support@wordsapi.com

Words API open sources tools to calculate English word frequency

Words API lets you search for words using a lot of different parameters, one of which is the frequency of how often the word is used.  Today, we're open sourcing the tools we used to calculate those frequency numbers.

You can view the tools on Github:
https://github.com/WordsAPI/wordfrequencies

The Idea


Computing word frequency isn't a new idea, but most of the data sources are either old, not in the public domain, or weren't computed from a very big corpus of data.  In 2009, Marc Brysbaert and Boris New released a paper describing a new way to calculate word frequencies.  They grabbed data from OpenSubtitles.org, which they showed in their paper better reflected real world usage than other data sources.

You can grab a copy of the resulting data from here.  Unfortunately, they didn't release the tools they used. The tools we're releasing mimic the process described in their paper, so you can run your own calculations.

The Data


The tools were're releasing use XML files from Open Corpus, since the words are already tokenized and are available as one big .tar.gz file (11 Gb, compressed). Open Corpus got the data from Open Subtitles.

If you use this data, please consider becoming a supporter of that site.

The Tools


deduper.js
The data from Open Corpus includes a lot of "duplicate" files. This is because each movie (or tv episode) may have been released in multiple formats/encodings.  Deduper.js will delete all but one file per movie.

spellingFilter.js
Brysbaert and New rejected any file that had greater than 2.5% spelling errors, based on aspell. This filter will mimic that, with a configurable threshold.

frequencyCounter.js
Runs trough the XML files that remain and counts words.

The Output


The result is a big JSON file, where each word is a key. Each key contains a freqCount number, which represents the total number of times the word was seen, and a cdCount number, which is the number of files it was seen in.

You can slice and dice it to your heart's content.

How Words API Uses Frequency Information


When you retrieve a word from Words API, it will have a frequency score. This is a log10 representation of the total frequency of the word. The scale is roughly from 1-8, making it a bit easier to compare frequency between words.  You can find out more in the documentation.

Today's Random Word: Arcanum

You can now use the API to retrieve a random word, optionally matching search criteria. The basic endpoint is the same as the Search endpoint, you'll just need to append a "random=true" to the request, like so:

https://www.wordsapi.com/words/?random=true

The response back will be a single JSON word document, that should look something like this:

{
  "word": "arcanum",
  "frequency": 2.02,
  "results": [
    {
      "definition": "information known only to a special group",
      "partOfSpeech": "noun",
      "synonyms": [
        "secret"
      ],
      "typeOf": [
        "information",
        "info"
      ],
      "hasTypes": [
        "cabbalah",
        "countersign",
        "esoterica",
        "kabala",
        "kabbala",
        "kabbalah",
        "parole",
        "password",
        "qabala",
        "qabalah",
        "watchword",
        "cabala",
        "word",
        "cabbala"
      ]
    }
  ],
  "syllables": {
    "count": 3,
    "list": [
      "ar",
      "ca",
      "num"
    ]
  },
  "pronunciation": {
    "all": "ɑr'keɪnəm"
  }
}


You can refine your search by adding any of the existing search criteria, such as "letterPattern" or "pronunciationPattern". Check the docs for a full list of search parameters available.

What's a Verb that Rhymes with 'Wow'? Searching for words with Words API

Discuss this on Hacker News 


We just introduced the ability to search for words with Words API. You can search for letter patterns, pronunciation patterns, number of syllables, parts of speech, and more.

You can read the docs for a list of all the parameters you can use, but here's a few examples of the results you can find.

Verbs that Rhyme with Wow


First, use the API to find the pronunciation of wow:

https://www.wordsapi.com/words/wow

The response json include this:

 "pronunciation": {
    "all": "waʊ"
  }

To find a rhyme, you look for words that match everything from the last stressed vowel phoneme to the end of the word. So to search for verbs that match "wow", we'll use the pronunciationPatter parameter to specify a regular expression tho find words that end with that pattern, and the partOfSpeech parameter to only look at verbs:

https://www.wordsapi.com/words/?pronunciationPattern=aʊ$&partOfSpeech=verb

Results:

allow, avow, bow, cow, disallow, disavow, endow, kotow, kowtow, meow, miaow, plough, plow, row, slough, vow, wow


Really Long Words


Let's find the longest words that Words API has, excluding those that have a space in them. Since we don't know in advance how long the words are, we'll use the lettersMin parameter to find words with at least 20 characters. We'll use the letterPattern parameter to exclude words with spaces.

https://www.wordsapi.com/words/?lettersMin=20&letterPattern=^((?! ).)*$

Results:

abetalipoproteinemia, acrylonitrile-butadiene-styrene, adrenocorticotrophic, adrenocorticotrophin


Words That Don't Sound Like You May Think


You can combine the pronunciationPattern and letterPattern parameters to find words that may not sound the way you'd think, based on their spelling. Here's a word that ends with the "am" sound (like in "clam"), but whose final letters are "chm".

https://www.wordsapi.com/words/?pronunciationPattern=.*æm$&letterPattern=.*chm$

Results:

drachm

If you're interested to see what else Words API can do, check out the demo on our homepage.