Words API Blog: January 2015

Searching for Words by Detail Type

We've expanded the Words API search functionality so that you can now search for words that have one or more detail types. For instance, if you wanted to find words that have both "hasUsages" and "hasCategories" relationships, you would call the API like this:

https://wordsapiv1.p.mashape.com/words/?hasDetails=hasUsages,hasCategories

{
  "query": {
    "hasDetails": "hasUsages,hasCategories",
    "limit": 100,
    "page": 1
  },
  "results": {
    "total": 7,
    "data": [
      "humor",
      "humour",
      "trademark",
      "wit",
      "witticism",
      "wittiness",
      "yiddish"
    ]
  }
}

You can also use this to search for random words, like so:

https://wordsapiv1.p.mashape.com/words/?hasDetails=typeOf&frequencyMin=2&frequencyMax=3&random=true

{
  "word": "gloaming",
  "results": [
    {
      "definition": "the time of day immediately following sunset",
      "partOfSpeech": "noun",
      "synonyms": [
        "crepuscle",
        "crepuscule",
        "dusk",
        "evenfall",
        "fall",
        "gloam",
        "nightfall",
        "twilight"
      ],
      "typeOf": [
        "hour",
        "time of day"
      ],
      "hasTypes": [
        "night"
      ],
      "partOf": [
        "eve",
        "eventide",
        "even",
        "evening"
      ]
    }
  ],
  "syllables": {
    "count": 2,
    "list": [
      "gloam",
      "ing"
    ]
  },
  "pronunciation": {
    "all": "'gloʊmɪŋ"
  },
  "frequency": 1.73
}

For other ways to search for words, be sure to check out the documentation.

The Long Tail of the English Language

In the English language, the most common words are incredibly common. Though there are at least 1 million words in the English language, "you", "I", and "the" account for 10% of the words we actually use. By the time you reach "is", at number 10, you've covered 20%.

The top 100 most common English words account for over 50% of the words we use, which is about how many words a 2-year old know. A 3-year old would probably know most of the top 1,000 words, which covers 75%. And by the 10,000th most common word, "remorse", you've covered over 88% of the words we commonly use. That leaves a lot of words you don't hear very much.

If you put word frequency on a graph, like the one below, you quickly see an interesting distribution called the Long Tail. It happens when a small number of items account for a disproportionate number of occurrences, such as the books that Amazon sells.

You can search for words to see where they fall on the graph by typing the word into the Search box.

It's somewhat heartening to see that "good" is more common than "evil" (58th vs 978th), "love" more common than "hate" (110th vs 527th), and "happy" more common than "sad" (272nd vs 844th). But "war" is still more common than "peace" ( 492nd vs 797th).

Many words occur just once every million words, like "icing", or even less. Tremulous, meaning "quivering as from weakness or fear", occurs 0.04 times per million words - meaning you'd hear it about once ever 25 million words. Some word occur so infrequently we didn't even find samples of them, even though we counted almost 70 million of them.

If you're interested in using Words API to find word frequency information, be sure to check out the documentation. If you're interested in seeing how we got this data, be sure to read this earlier post about our open source tools.

Be sure to follow us on Twitter for more word facts! You can also discuss this on Hacker News.

Expanded Frequency Information

Based on our recently open-sourced tools, we've updated the word frequency information that Words API returns. The frequency score for most words have changed, but the relative frequency of the words have not changed much. The most common words are still "you", "I", "the", etc.

What has changed is that the /words/{word}/frequency endpoint will now return additional information than just the one score. For example, the frequency results for "apartment" now look like this:


{
  "word" : "apartment",
  "frequency": {
    "zipf": 4.82,
    "perMillion": 65.76,
    "diversity": 0.17
  }
}

Here's what each of those means:

zipf

frequency

this paper

perMillion
diversity

We hope these new ways of looking at word frequency are helpful! If you have any questions, please let us know: support@wordsapi.com

Words API open sources tools to calculate English word frequency

Words API lets you search for words using a lot of different parameters, one of which is the frequency of how often the word is used. Today, we're open sourcing the tools we used to calculate those frequency numbers.

You can view the tools on Github:
https://github.com/WordsAPI/wordfrequencies

The Idea

Computing word frequency isn't a new idea, but most of the data sources are either old, not in the public domain, or weren't computed from a very big corpus of data. In 2009, Marc Brysbaert and Boris New released a paper describing a new way to calculate word frequencies. They grabbed data from OpenSubtitles.org, which they showed in their paper better reflected real world usage than other data sources.

You can grab a copy of the resulting data from here. Unfortunately, they didn't release the tools they used. The tools we're releasing mimic the process described in their paper, so you can run your own calculations.

The Data

The tools were're releasing use XML files from Open Corpus, since the words are already tokenized and are available as one big .tar.gz file (11 Gb, compressed). Open Corpus got the data from Open Subtitles.

If you use this data, please consider becoming a supporter of that site.

The Tools

deduper.js
The data from Open Corpus includes a lot of "duplicate" files. This is because each movie (or tv episode) may have been released in multiple formats/encodings. Deduper.js will delete all but one file per movie.

spellingFilter.js
Brysbaert and New rejected any file that had greater than 2.5% spelling errors, based on aspell. This filter will mimic that, with a configurable threshold.

frequencyCounter.js
Runs trough the XML files that remain and counts words.

The Output

The result is a big JSON file, where each word is a key. Each key contains a freqCount number, which represents the total number of times the word was seen, and a cdCount number, which is the number of files it was seen in.

You can slice and dice it to your heart's content.

How Words API Uses Frequency Information

When you retrieve a word from Words API, it will have a frequency score. This is a log10 representation of the total frequency of the word. The scale is roughly from 1-8, making it a bit easier to compare frequency between words. You can find out more in the documentation.

Today's Random Word: Arcanum

You can now use the API to retrieve a random word, optionally matching search criteria. The basic endpoint is the same as the Search endpoint, you'll just need to append a "random=true" to the request, like so:

https://www.wordsapi.com/words/?random=true

The response back will be a single JSON word document, that should look something like this:

{
  "word": "arcanum",
  "frequency": 2.02,
  "results": [
    {
      "definition": "information known only to a special group",
      "partOfSpeech": "noun",
      "synonyms": [
        "secret"
      ],
      "typeOf": [
        "information",
        "info"
      ],
      "hasTypes": [
        "cabbalah",
        "countersign",
        "esoterica",
        "kabala",
        "kabbala",
        "kabbalah",
        "parole",
        "password",
        "qabala",
        "qabalah",
        "watchword",
        "cabala",
        "word",
        "cabbala"
      ]
    }
  ],
  "syllables": {
    "count": 3,
    "list": [
      "ar",
      "ca",
      "num"
    ]
  },
  "pronunciation": {
    "all": "ɑr'keɪnəm"
  }
}

You can refine your search by adding any of the existing search criteria, such as "letterPattern" or "pronunciationPattern". Check the docs for a full list of search parameters available.

What's a Verb that Rhymes with 'Wow'? Searching for words with Words API

Discuss this on Hacker News

We just introduced the ability to search for words with Words API. You can search for letter patterns, pronunciation patterns, number of syllables, parts of speech, and more.

You can read the docs for a list of all the parameters you can use, but here's a few examples of the results you can find.

Verbs that Rhyme with Wow

First, use the API to find the pronunciation of wow:

https://www.wordsapi.com/words/wow

The response json include this:

 "pronunciation": {
    "all": "waʊ"
  }

To find a rhyme, you look for words that match everything from the last stressed vowel phoneme to the end of the word. So to search for verbs that match "wow", we'll use the pronunciationPatter parameter to specify a regular expression tho find words that end with that pattern, and the partOfSpeech parameter to only look at verbs:

https://www.wordsapi.com/words/?pronunciationPattern=aʊ$&partOfSpeech=verb

Results:

allow, avow, bow, cow, disallow, disavow, endow, kotow, kowtow, meow, miaow, plough, plow, row, slough, vow, wow

Really Long Words

Let's find the longest words that Words API has, excluding those that have a space in them. Since we don't know in advance how long the words are, we'll use the lettersMin parameter to find words with at least 20 characters. We'll use the letterPattern parameter to exclude words with spaces.

https://www.wordsapi.com/words/?lettersMin=20&letterPattern=^((?! ).)*$

Results:

abetalipoproteinemia, acrylonitrile-butadiene-styrene, adrenocorticotrophic, adrenocorticotrophin

Words That Don't Sound Like You May Think

You can combine the pronunciationPattern and letterPattern parameters to find words that may not sound the way you'd think, based on their spelling. Here's a word that ends with the "am" sound (like in "clam"), but whose final letters are "chm".

https://www.wordsapi.com/words/?pronunciationPattern=.*æm$&letterPattern=.*chm$

Results:

drachm

If you're interested to see what else Words API can do, check out the demo on our homepage.

A Wordnet Word List

In case anyone wants a simple list of words available in WordNet, you can grab it here:

https://s3.amazonaws.com/wordsapi/wordnetWords.txt.gz

The file is gzipped, and is 595.4 kB. It's one word per line.

Rhyme Time!

WordsAPI can now be used to find words that rhyme. Rhymes are not returned as part of a normal GET request of a word , so you must call the Rhymes endpoint to retrieve rhymes of a word.

https://www.wordsapi.com/words/:word/rhymes

The rhymes result will include an object containing the word you requested, and one or more sub-objects that each contain a list of rhyming words. Each sub-object represents the part of speech that the rhymes are for.

For instance, "dog" can be both a noun and a verb. However, it is pronounced the same in both cases, so the result of the call to the Rhymes endpoint results in a single sub-object, called "all".

Request

GET https://www.wordsapi.com/words/dog/rhymes

Response

{
  "word": "dog",
  "rhymes": {
    "all": [
      "decalogue",
      "demagogue",
      "antilog",
      "apologue",
      "backlog",
      "bulldog",
      "dialog",
      "dialogue",
      "duologue",
      "emmenagogue",
      "epilogue",
      ...
   ]
  }
}

However, "wind" sounds different if you're using it as a noun (that's a lot of wind) than it does as a verb (the river winds through the hills). The result of calling the Rhymes endpoint for wind returns two arrays, one for wind as a noun, and another for wind as a verb..

Request

GET https://www.wordsapi.com/words/wind/rhymes

Response

{
  "word": "wind",
  "rhymes": {
    "noun": [
      "determined",
      "downwind",
      "headwind",
      "lind",
      "rescind",
      ...
    ],
    "verb": [
      "designed",
      "affined",
      "behind",
      "bind",
      "blind",
      "combined",
      "confined",
      "disinclined",
      "find",
      "grind",
      "hind",
      "humankind",
      ...
    ]
  }
}

For more of what you can use WordsAPI for, be sure to check out the Documentation.

The Words Hacker News Visitors Searched For

I posted a Show HN article about WordsAPI on Monday, and to my surprise it made it to the first page. The Hacker News Effect has been known to take out websites before, but I was happy to see that WordsAPI stood up to it pretty well and the API stayed very responsive (most responses were < 50ms).

Quick stats (24 hours after posting the Show HN):

Unique Visitors: 17,919
Pageviews: 33,465
Demo word searches: 49,802
API Users Signed Up: 124

The first page of WordsAPI has a little demo that lets you query the API directly. I thought it would be a good idea to let potential API users see exactly what the results would look like, as that's what I look for when checking out new APIs.

Here's the top 20 words that the Hacker News crowd searched for, excluding the examples listed on the page.

test
dog
hand
water
hello
cat
apple
[ a part of a male's anatomy ]
[ a four letter swear word ]
house
sex
computer
good
banana
food
fish
poop
word
love
book

Discuss this on Hacker News

New Features: Pronunciation & Syllables

Just added pronunciation and syllables to the API results. They'll show up if you ask for the "everything" about a word, but not if you ask for word details like "synonyms".

The source for both new features is the Moby Project. According to Wikipedia:

As of 2007, it contains the largest free phonetic database, with 177,267 words and corresponding pronunciations"

The format for the files is a bit archaic, so it was interesting converting it into something that could be easily consumed via the Words API. The pronunciation file doesn't use the International Phonetic Alphabet, which is what I wanted the API to have, but luckily Wikipedia has a mapping.

Please note that these pronunciations are in American English. Event with that in mind, it was interesting to learn that there apparently isn't a definitive guide to how to pronounce a word. During testing I found that some words would vary depending on which source I checked.

The format for the result will look something like the example below, if you requested "lexicographer":

{
  "results": [
    ...
  ],
  "syllables": {
    "count": 5,
    "list": [
      "lex",
      "i",
      "cog",
      "ra",
      "pher"
    ]
  },
  "pronunciation": {
    "all": ",lɛksə'kɑgrəfər"
  }
}

Watchout for the stress indicators in the pronunciation. The single quote indicates primary stress in a word, but if you're using JavaScript and aren't careful you may end up with an escape character in front of it.

Words API Blog

Searching for Words by Detail Type

The Long Tail of the English Language

Expanded Frequency Information

Words API open sources tools to calculate English word frequency

The Idea

The Data

The Tools

The Output

How Words API Uses Frequency Information

Today's Random Word: Arcanum

What's a Verb that Rhymes with 'Wow'? Searching for words with Words API

Verbs that Rhyme with Wow

Really Long Words

Words That Don't Sound Like You May Think

A Wordnet Word List

Rhyme Time!

The Words Hacker News Visitors Searched For

New Features: Pronunciation & Syllables

Follow Us

Archive