Expanded Frequency Information

Based on our recently open-sourced tools, we've updated the word frequency information that Words API returns.  The frequency score for most words have changed, but the relative frequency of the words have not changed much.  The most common words are still "you", "I", "the", etc.

What has changed is that the /words/{word}/frequency endpoint will now return additional information than just the one score.  For example, the frequency results for "apartment" now look like this:


{
  "word" : "apartment",
  "frequency": {
    "zipf": 4.82,
    "perMillion": 65.76,
    "diversity": 0.17
  }
}


Here's what each of those means:
  • zipf
  • This is the same number that gets returned for frequency at the main words endpoint. It's a log10 scale representation of the number of times the word appeared in our corpus. It ranges from 1 to 7, where a higher number means a word that was seen more frequently. For more information, you may want to see this paper.
  • perMillion
  • In any given corpus of one million English words, this is the number of times you can expect to see the word. It's a common frequency measurement that academic papers use.
  • diversity
  • In a document that represents part of a corpus, this is the odds that the given word will appear at least once. It ranges from 0 to 1.

We hope these new ways of looking at word frequency are helpful! If you have any questions, please let us know: support@wordsapi.com

No comments:

Post a Comment