Words API Blog

Why English is Weird



Approximately 1.5 billion people, roughly 20% of the world’s population, speak English. Only 350 million or so are native speakers, meaning the remaining 1.15 billion learned it as a foreign language. It’s the most widely studied second language and the default common language for many international businesses and travelers.

While these figures make it seem like it’s easy to pick up the language, many learners will probably beg to differ. English has many oddities that make it one of the harder languages to learn, despite how widely it is spoken.

Effective Ways to Learn English


There are many methods for learning languages. Classroom instruction tends to focus on learning grammar, spelling and pronunciation rules. As students learn one word, it’s only natural that they would apply the same rules to a similar looking word or similarly-structured sentence. This might work in some languages; English is not one of them.

This is often why many language learners report having better success picking up English from movies or TV. They get to hear how it is actually spoken and start to develop that native sense of when something “sounds right.”

No matter which method learners choose, they could benefit from keeping these quirks in mind.

Word Meanings


Consider these examples:

  • Pineapples do not contain any pine or any apples
  • Hamburgers are not made of ham
  • A pocketbook is not a book and it doesn’t fit into your pocket
  • Eggplants don’t have egg in them


These inconsistencies can make it harder to deduce the meaning of a word. While a new English speaker may understand part of the word, this does not necessarily mean that the speaker will grasp the meaning of the full word.

English also has a long list of homophones, homographs, homonyms, heterographs and multinyms. Most native English probably don’t know these terms, but they are familiar with the words.

  • Homophones: same sound, different meaning, ex: to, two, too
  • Homographs: same sound and spelling, different meaning, ex: bow (front of boat and to bend at waist)
  • Homonyms: same spelling, different meaning, ex: live, wound, wind
  • Heterographs: same sound, different spelling and meaning, ex: know, no
  • Multinyms: same sound, two or more different spellings and meanings, ex: raise, rays, raze, etc. 


How do you explain these differences, and so many more, to students? ESL students probably hear the phrase, “that’s just how it is” more times than they care to.

Spelling and Pronunciation


While English has some spelling rules such as “I before E except after C”, these are quickly violated as soon as a student learns the words science and weird. One grammarian says that only 44 words actually follow that rule while 923 don’t.

Another rule is that plurals are created by adding an s or es to the end of a word. But alas, mouths contain teeth not tooths, more than one mouse becomes mice and more than one moose becomes…moose.

In theory, spelling also affects pronunciation. However, English also has a lot of silent letters. In knife and gnome, for example, the first and the last letters are silent. There are also letter combinations that can lead to wildly different sounds. Take trough, rough, bough and through for example. In many situations, English is just not phonetically consistent.

Again, with these exceptions, learners are often told that they just need to memorize the different spellings and pronunciations.


Verb Tenses


Sometimes English verb conjugation is considered easy to learn because there generally aren’t different spellings depending on the subject. For example, I go, you go, he goes, we go, they go, and for past tense I/you/he/we/they went whereas the equivalent in French would have multiple spellings. In general, the past participle will be formed by adding -ed or -d to the end of the word. For example, to face becomes faced and to live becomes lived. But of course, there are irregular verbs and exceptions.

  • The past tense of fight is fought
  • The past tense of light is lit
  • The past tense of read is read
  • The past tense of feel is felt
  • The past tense of drive is drove
  • The past tense of wake is woke
  • The past tense of have is had


English also gets complicated when it comes to the future. There are many ways to talk about the future, each with their own subtleties. For example, here a few ways someone can say they are going to consume dinner at a later time:

  • I’ll get dinner
  • I’m going to have dinner
  • I was going to have dinner
  • I’m getting dinner


There may be general rules that learners can follow, but there are likely just as many, if not more, exceptions that they will have to remember as well.

Order of Words


English can also be a little peculiar when it comes to word order. In general, most sentences will follow a subject – verb – object order. However, depending on the word, adjectives and adverbials can fall in multiple locations in a sentence. For example, one can say I ate quickly or I quickly ate, but one can only say I stayed long not I long stayed.

Possession


A common grammar mistake even among native speakers is the use of apostrophes. They can be used as a contraction where he is going becomes he’s going. However, they can also be used to denote possession. Instead of saying the toys belonging to Sarah, speakers simply say Sarah’s toys. The exception to using an apostrophe to denote possession occurs with the word it.
It is warm becomes it’s warm
To describe the leaves of a tree falling off, one would say the tree lost its leaves

Native speakers will likely be forgiving of this error as non-grammar-lovers make this mistake frequently themselves.

Don’t Get Discouraged


English can be hard to learn. The rules all seem to have exceptions and there’s a lot about the language, its spelling, pronunciation, grammar and more that doesn’t quite make sense. But people who want to become fluent in this global language shouldn’t get discouraged. Its popularity also means that there are ample resources for helping students learn. WordsAPI, for example, is an online source to find definitions, synonyms, antonyms, word hierarchies and much, much more. 

Major Speedup!

We've just finished making some changes to our backend, and most responses are now 70% faster.  Let us know if you run into any issues!

Searching for Words by Detail Type

We've expanded the Words API search functionality so that you can now search for words that have one or more detail types.  For instance, if you wanted to find words that have both "hasUsages" and "hasCategories" relationships, you would call the API like this:

https://wordsapiv1.p.mashape.com/words/?hasDetails=hasUsages,hasCategories

{
  "query": {
    "hasDetails": "hasUsages,hasCategories",
    "limit": 100,
    "page": 1
  },
  "results": {
    "total": 7,
    "data": [
      "humor",
      "humour",
      "trademark",
      "wit",
      "witticism",
      "wittiness",
      "yiddish"
    ]
  }
}

You can also use this to search for random words, like so:

https://wordsapiv1.p.mashape.com/words/?hasDetails=typeOf&frequencyMin=2&frequencyMax=3&random=true

{
  "word": "gloaming",
  "results": [
    {
      "definition": "the time of day immediately following sunset",
      "partOfSpeech": "noun",
      "synonyms": [
        "crepuscle",
        "crepuscule",
        "dusk",
        "evenfall",
        "fall",
        "gloam",
        "nightfall",
        "twilight"
      ],
      "typeOf": [
        "hour",
        "time of day"
      ],
      "hasTypes": [
        "night"
      ],
      "partOf": [
        "eve",
        "eventide",
        "even",
        "evening"
      ]
    }
  ],
  "syllables": {
    "count": 2,
    "list": [
      "gloam",
      "ing"
    ]
  },
  "pronunciation": {
    "all": "'gloʊmɪŋ"
  },
  "frequency": 1.73
}


For other ways to search for words, be sure to check out the documentation.

The Long Tail of the English Language

In the English language, the most common words are incredibly common.  Though there are at least 1 million words in the English language, "you", "I", and "the" account for 10% of the words we actually use. By the time you reach "is", at number 10, you've covered 20%.

The top 100 most common English words account for over 50% of the words we use, which is about how many words a 2-year old know. A 3-year old would probably know most of the top 1,000 words, which covers 75%.  And by the 10,000th most common word, "remorse", you've covered over 88% of the words we commonly use. That leaves a lot of words you don't hear very much.

If you put word frequency on a graph, like the one below, you quickly see an interesting distribution called the Long Tail.  It happens when a small number of items account for a disproportionate number of occurrences, such as the books that Amazon sells.


You can search for words to see where they fall on the graph by typing the word into the Search box.

It's somewhat heartening to see that "good" is more common than "evil" (58th vs 978th), "love" more common than "hate" (110th vs 527th), and "happy" more common than "sad" (272nd vs 844th). But "war" is still more common than "peace" ( 492nd vs 797th).

Many words occur just once every million words, like "icing", or even less.  Tremulous, meaning "quivering as from weakness or fear", occurs 0.04 times per million words - meaning you'd hear it about once ever 25 million words. Some word occur so infrequently we didn't even find samples of them, even though we counted almost 70 million of them.

If you're interested in using Words API to find word frequency information, be sure to check out the documentation.  If you're interested in seeing how we got this data, be sure to read this earlier post about our open source tools.

Be sure to follow us on Twitter for more word facts! You can also discuss this on Hacker News.

Expanded Frequency Information

Based on our recently open-sourced tools, we've updated the word frequency information that Words API returns.  The frequency score for most words have changed, but the relative frequency of the words have not changed much.  The most common words are still "you", "I", "the", etc.

What has changed is that the /words/{word}/frequency endpoint will now return additional information than just the one score.  For example, the frequency results for "apartment" now look like this:


{
  "word" : "apartment",
  "frequency": {
    "zipf": 4.82,
    "perMillion": 65.76,
    "diversity": 0.17
  }
}


Here's what each of those means:
  • zipf
  • This is the same number that gets returned for frequency at the main words endpoint. It's a log10 scale representation of the number of times the word appeared in our corpus. It ranges from 1 to 7, where a higher number means a word that was seen more frequently. For more information, you may want to see this paper.
  • perMillion
  • In any given corpus of one million English words, this is the number of times you can expect to see the word. It's a common frequency measurement that academic papers use.
  • diversity
  • In a document that represents part of a corpus, this is the odds that the given word will appear at least once. It ranges from 0 to 1.

We hope these new ways of looking at word frequency are helpful! If you have any questions, please let us know: support@wordsapi.com

Words API open sources tools to calculate English word frequency

Words API lets you search for words using a lot of different parameters, one of which is the frequency of how often the word is used.  Today, we're open sourcing the tools we used to calculate those frequency numbers.

You can view the tools on Github:
https://github.com/WordsAPI/wordfrequencies

The Idea


Computing word frequency isn't a new idea, but most of the data sources are either old, not in the public domain, or weren't computed from a very big corpus of data.  In 2009, Marc Brysbaert and Boris New released a paper describing a new way to calculate word frequencies.  They grabbed data from OpenSubtitles.org, which they showed in their paper better reflected real world usage than other data sources.

You can grab a copy of the resulting data from here.  Unfortunately, they didn't release the tools they used. The tools we're releasing mimic the process described in their paper, so you can run your own calculations.

The Data


The tools were're releasing use XML files from Open Corpus, since the words are already tokenized and are available as one big .tar.gz file (11 Gb, compressed). Open Corpus got the data from Open Subtitles.

If you use this data, please consider becoming a supporter of that site.

The Tools


deduper.js
The data from Open Corpus includes a lot of "duplicate" files. This is because each movie (or tv episode) may have been released in multiple formats/encodings.  Deduper.js will delete all but one file per movie.

spellingFilter.js
Brysbaert and New rejected any file that had greater than 2.5% spelling errors, based on aspell. This filter will mimic that, with a configurable threshold.

frequencyCounter.js
Runs trough the XML files that remain and counts words.

The Output


The result is a big JSON file, where each word is a key. Each key contains a freqCount number, which represents the total number of times the word was seen, and a cdCount number, which is the number of files it was seen in.

You can slice and dice it to your heart's content.

How Words API Uses Frequency Information


When you retrieve a word from Words API, it will have a frequency score. This is a log10 representation of the total frequency of the word. The scale is roughly from 1-8, making it a bit easier to compare frequency between words.  You can find out more in the documentation.

Today's Random Word: Arcanum

You can now use the API to retrieve a random word, optionally matching search criteria. The basic endpoint is the same as the Search endpoint, you'll just need to append a "random=true" to the request, like so:

https://www.wordsapi.com/words/?random=true

The response back will be a single JSON word document, that should look something like this:

{
  "word": "arcanum",
  "frequency": 2.02,
  "results": [
    {
      "definition": "information known only to a special group",
      "partOfSpeech": "noun",
      "synonyms": [
        "secret"
      ],
      "typeOf": [
        "information",
        "info"
      ],
      "hasTypes": [
        "cabbalah",
        "countersign",
        "esoterica",
        "kabala",
        "kabbala",
        "kabbalah",
        "parole",
        "password",
        "qabala",
        "qabalah",
        "watchword",
        "cabala",
        "word",
        "cabbala"
      ]
    }
  ],
  "syllables": {
    "count": 3,
    "list": [
      "ar",
      "ca",
      "num"
    ]
  },
  "pronunciation": {
    "all": "ɑr'keɪnəm"
  }
}


You can refine your search by adding any of the existing search criteria, such as "letterPattern" or "pronunciationPattern". Check the docs for a full list of search parameters available.