Sunday, December 19, 2010

Why Google Ngram isn't ready for prime time

I read about Google Ngram Viewer on reddit a week or two ago, and immediately found it fascinating. Imagine, that! A tool to let you see when a word achieved currency and how its popularity changed over time compared to other words! I predict that this will become an essential tool for people who are interested in language, both privately and professionally.

Leaving aside the rather noisy section before around 1650 (small sample size, mis-dated books and poor print quality make this a rough patch), it still has some problems. Let's see, for example, when people started talking about singularities:


Wow, nothing before 1800, except a bit of early chatter? How strange. Here's the first quirk of Ngram: it's case sensitive, and English used to do what German does today: they capitalize most nouns. Let's add the graph of "Singularity" to the mix:


Fascinating! Now we get the early uses, but there's a really weird gap in the usage levels. Surely mathematics didn't go through a frightful drought in the late 1700s? Here's where a bit of detective work is needed. English used to use the "long s" between around 1650 and 1800:


So now we can build the full picture:


So, to make ngram work properly, Google needs to add:
  • A case-insensitive version
  • Ideally, better recognition of variant letter forms, or at leaſt a warning for miſsing variants.
  • An option to sum different forms (for example, to show a graph of the sum of "singularity" and "ſingularity" versus "infinity".