Trochee Chart

Here’s something I made as I drew today’s comic.  It’s a chart of Google results for “X Y” (in quotes) where X and Y are words from the first panel of the strip.  The first word is on the top, the second down the side (the opposite of the intuitive way, of course).

"Doctor Doctor" and "Jesus Jesus" are highest. The highest non-repeating combo is "Pirate Captain", followed by "Robot Monkey" and "Penguin Zombie".

I generated this using a Google API variable search tool developed by Eviltwin on #xkcd (I’m not linking to the tool so as to avoid potentially getting his API key revoked) Edit: He now offers the source and says it can be run without a key, and is happy to let people use it until Google does something. Not only is the API helpful in making these kinds of charts (which I spend more time doing than I care to admit), it also gives a roughly accurate count of results—in contrast to the Google search page.

The “number of results” count that Google gives when you search is clearly fabricated.  This is clear for a few reasons.  When Google says this:

Excellent!  That's a lot!

You can tell that it’s wrong first by scrolling to the end of the results.  When you get to page 32, it suddenly becomes:

I learned in AP Calculus that 316 is WAY less than 190,000.

This doesn’t usually matter, since nobody looks much past the first few pages of results, but it’s annoying if you’re trying to use the number of results as a measure of something.  When I was making the Numbers comic, I didn’t use the API, and there were a few graphs I had to throw out, crop, or put on an unnecessary log scale; otherwise, Google’s clumsy number-fudging made the graphs look nonsensical.  I can’t find a good example now (perhaps they’ve smoothed it out a bit) but when searching for things like “I was born in <X>”, the results for successive years would look something like this:

… 150 : 200 : 250 : 300 : 350 : 117,000 : 450 : 251,000 : 500 : 550 : 312,000 : 320,000 : 390,000 : 425,000 …

If you scrolled to the last page for each, you’d find that the smaller counts were roughly accurate, but the counts in the hundreds of thousands had no more actual results than their neighbors.

I suppose it’s remotely possible that these numbers are correct, there are no years with an in-between number of hits, and for some reason they’re just not showing you most of the promised pages when you try to flip through them.  But making this even less likely is the fact that the search API (which is apparently being deprecated and replaced right now) doesn’t return these bad numbers—it gives reasonable-looking results which seem to be roughly consistent with the number you come up with by navigating to the last search page.

So it really looks like there’s a certain threshold of result volume beyond which Google apparently says “screw it” and throws out a gigantic number.  I imagine this is probably due to incompetence rather than intentional deception; I’m sure it’s hard to generate pages quickly from many sources, and maybe for searches with a lot of results they don’t have time to get it all synced up.  So they fudge the numbers.  The fact that this makes it look like they have way more results than they do is presumably just an unintended bonus.

All in all, this isn’t a big deal and I don’t think there’s anything particularly evil about it. It does make it hard to use Google hits as an accurate gauge of anything, but I suppose if you’re trying to study something by seriously analyzing Google result counts, you have bigger methodological problems to worry about.

Edit: As Mankoff observes, it looks like the API sometimes *underestimates* the number of results, too.  For example, it still reports 0 results for “narwhal zombie”, when a regular search shows quite a few. Now, I notice, scrolling through them, that most either have some minor character/text in between the two words, or are related to the comic I just posted.  But at least one seems to date back to last year.

144 replies on “Trochee Chart”

  1. If we all decided to raise our infant children by saying “mom,” “dad,” “child,” and “milk” instead of “mommy,” “daddy,” “baby,” and “bottle,” maybe we could prevent another generation of this and stop our steady slide towards everyone just saying “boobies boobies boobies boobies boobies” until civilization collapses.


  2. RE: ” It does make it hard to use Google hits as an accurate gauge of anything, but I suppose if you’re trying to study something by seriously analyzing Google result counts, you have bigger methodological problems to worry about.”

    I use Google hits as the Truth Criterion that the Pyrrhic Skeptics philosophized about, as described here:


  3. Inspired by some old XKCD comics, I tried to write a script to return the number of Google hits for a list of search terms. For anyone trying the same, this is my story. Their SOAP API had been deprecated, and they suggested the Google Web Search API as a replacement. Things seemed to work perfectly.

    BUT the estimates didn’t match the estimate from browser-based Google searches. Here are two references:

    Someone from that last link suggested Bing or Yahoo. A quick fact-finding search for Bing and Python turned up the awesome bingapi module. ( Within five minutes I had the less-than-ten-lines script up and running.


  4. You know, Randall, it’s only a matter of time until someone at Google figures out how to troll your graph comics in return for inadvertently messing with their results so many times. 😉


  5. Just saying I was surprised there was only one mention of a Bacon Raptor on the web (now 2). Come on they must be yummy!


  6. Is it bad that I just read this and found almost every one of those combinations that aren’t used to be more awesome than the rest?….

    I have a Troche problem Doctor Narwhal… And your name really isn’t helping, so I think I need a transfer.

    Plus I would like to point out that your Raptor obsession is Troche oriented.


  7. Maybe I’m just stupid, but I thought a lot of these were just meant as random pop culture references — not an attempt at total randomness.

    For example, the pirate craze happened after Pirates of the Carribean, and I always interpreted it as a reference to that. Zombie was a reference to the spate of zombie movies a while back, and I don’t remember seeing it prior to that. Badger is, of course, a Weebl reference. I don’t think anyone will dispute this. I thought Narwhal was also a Weebl reference, though it’s possible that the word appeared earlier than the ‘toon and I just didn’t notice it until then. Penguin happened after March of the Penguins. Kitty has been a popular word and animal for more than a century, due to its hazardous level of cuteness, but got an obvious boost from the LOLcats. Bacon happened after a run of bacon cheeseburger commercials from several different fast food chains at the same time, I seem to recall. Raptor was Jurassic Park. Jesus has been around almost as long as kitty. Laser has been a cool word since they were invented, but seemed to achieve meme status right around the time of Austin Powers, originally always with quote fingers. Ninja comes from a lot of things, but got a recent boost from Naruto.

    So yeah, I don’t think it’s about randomness so much as just bringing up something from the past with humorous connotations. And I don’t see what the hell is wrong with it. Trochees are fun. You’re guilty, too. You’ve referenced most of these things before.


  8. Just wanted to let you know I’ve read ALL of your comics in less than a week… You’re awesome…


  9. @Xezlec: It seems most were specifically chosen as things that would come up in Google hits in combination: “badger badger badger” as you said, but also “ninja vs. pirate” (or Chris the Ninja Pirate) as well as “robot ninjas”. There’s “Doctor McNinja” that’s a webcomic,

    Narwhal, by contrast, I think IS completely random.


  10. Doesn’t this make you want to create comics for the ones with zero hits? Hobo raptors could be really cool, too…


  11. I agree with Xezlec, although I think in most of these cases, there are multiple source factors. Ninjas for example also probably get a big boost from Teenage Mutant Ninja Turtles, which has childhood nostalgia for much of the current 18-30 demographic that is so prevalent on internet. “Teenage Mutant Ninja Turtles” is also a great example of trochee soup.

    @Coda, I don’t narwhals are completely random, either. Someone in relatively recent memory, I’m not sure who, was talking about narwhals – as in, how do you know they’re really real and unicorns aren’t? Have you seen one? They don’t usually live in captivity. There are surprisingly few photos.

    I think it comes up because of the cognitive dissonance between what we learned unquestioningly at a young age and what we now find unusual, and the long gap between the two during which narwhals were more or less forgotten.


  12. Xezlac and others: The comic isn’t stating that the fascination with trochees originates solely from their pronunciation, it’s only proposing that their popularity has persisted so long in part due to their rhythm. If the narwhal song had been about Monodon monoceros rather than narwhals, then it wouldn’t have been nearly as popular.

    The point is, a catchy name contributes greatly to a meme’s persistence in our culture – and a trochee, the comic posits, is an incredibly catchy name.


  13. “Monkey Raptor” and “Laser Raptor” should be green instead of blue, and “Zombie Zombie” red instead of orange, right?


  14. It may be something having to do with “omitted results”, given that when I navigated to the last page of results for “Narwhal zombie” I was prompted as to whether I wanted to display them (the results having gone down from 310 to 101 when I hit the last page) at which point the results sprang up to the initially quoted number of 310 when I renavigated to the last page.

    Basically, I think google counts the omitted repeated results for the initial assesment but gives you the truth when you hit the last page. Enigmatic.


  15. I love that you managed to make a heat-map that includes a sense of humor. Such a rarity where I come from m/


  16. I’m fairly confident that the pirate obsession predated the Internet’s Pirates of the Caribbean movies, but it’s sort of hard to verify.

    A possible source for an interest in Narwhals is the recent Futurama movie “Bender’s Big Score,” which had a Narwhal as a prominent side character. It’s a pretty recent thing, which might explain why it hasn’t yet caught on yet. BUT IT SHOULD!


  17. @tyler Welllll.. if you can get some julian dates you can actually search by date using the daterange:startdate-enddate syntax. I’m not sure how accurate this is but it does yield some interesting results:

    “Pirate” daterange:2452098-2452828 (from 2001 to the day before Pirates of the Caribbean was out) yields “About 182.000 results”.

    For the same daterange “Pirate Pirate” only yields: 273 results.

    Meanwhile “Ninja” yields: “About 164.000 results” and “Ninja Ninja” only yields: 615 results.

    Soooo based on the simple use of the trochee “Pirate” > “Ninja” prior to the movies, while the repetition of the trochee implies “Ninja” > “Pirate”. Bringing the old controversy to a new level of frustration


  18. Hey, have you ever stopped to consider that the human mind might have just generated Troche words to name awesome things? I mean, consider Narwhals, Troche name aside, they are still freaking awesome, they are like real life unicorns, only better. Same goes for Raptors, it conjures up the image of some huge, ruthless, and intelligent lizard. (It’s interesting to point out that Raptor is just an abbreviated form of Veloceraptor used to refer to the entire family of Raptor dinosaurs, and the word Raptor actually refers to any predatory bird, but rather than getting the same reaction from both meanings, we get a far better reaction from the more awesome creature.)

    Then consider Grim’s Law, words with similar meanings don’t change in syllable structure from the word that they came from, generally speaking. So thinking along that structure, consider the original language that all humanity shared, and how it was created: the Troche words illicit a certain response inside our minds, so they generally become used to refer to revered positions and terrifying animals, simply because using a Troche creates that reaction inside their mind, making things both easier to communicate and easier to teach. (Which might be why words like mommy, daddy, baby, and bottle are used around infants a lot: for ease of teaching the language.)

    Then if you consider that, maybe the Troche structure goes back even further, back to when we were true primates without a language, but we still used our voices to communicate. Maybe the sound that we used for good things had a Troche-esque structure and Troches themselves are just watered down biological carryovers from back before we even had language.

    Of course for this to be true then it would have to be a pattern observed in all language (or at least all European language), rather than just in English. And I could be pulling it all out of my ass, but you know, that’s neither here nor there.


  19. is there a way to get the enum setting to read from a file and search each word in the file as a different value to search for?


  20. Re the google results. I believe that Google ‘games’ its speed (not evilly) by throwing up the first page or three of definite results while it goes through the rest of the results to determine if they meet the threshold applicability for inclusion.

    Making this assumption means I believe Google is using some sort of heuristics to hack the number down from an absolute number of non-trivial occurrances in text to ones that are semantically appropriate to the search term.

    In your example, ‘1931’ appears a lot of times in the context of ‘born,’ but the vast majority of those hits don’t actually connect born IN 1931, so they get tossed while you look at the first few pages. And as the results are trimmed down, so is Google’s assessment of the number of hits.


  21. Try searching for “Google”, and you still won’t get more than 100 pages of results, it looks to me like they’re just limiting the amount of pages based on the amount of results they have.


  22. It’s simple, really.

    Retrieving all the results of a search is a huge performance hit. Most people will only look at the first page, so Google just retrieves the first results. When you dig deeper, Google will actually retrieve all results; and then it can also give you the accurate number of results. Until then, what you see is just an estimate.

    Obviously, the algorithm for the estimate is wildly inaccurate, and errs to extreme exaggeration (which looks better than underestimating — though it’s actually less helpful). But to be fair, Google does say About [x] results and doesn’t claim it to be an exact number. (On the site, that is — if you’re using the API you’re supposed to understand these things :P)


  23. steady slide towards everyone just saying “boobies boobies boobies boobies boobies” until civilization collapses.

    The Booby Singularity?


  24. it’s probably something like “Data Result Toulples * Mean results per touple” for the estimate.. they like doing breadth first searches at google, fetch all the related touples first, then refine away the unrelated results from the raw data while you browse results.


  25. You may find that outer product “matrix search” like that violates US patent 6,721,729.


  26. I had a economics professor who would make a point by asking a student in the class to google the term and say the number of results.

    It would always be 5 or 6 million and he considered this enough to prove to the class that what he was talking about was important.

    It would be funny to use this counting method in his class and pretty much make his point useless.


  27. I think you’ve also explained the success of baseball, football, hockey, dancing, living, breathing, dreaming, sleeping, loving, kissing, holding, hearing, wanting, trying, hoping, f-cking, underwater belly dancing, hula hooping, Elvis Presley, Jesus Lizard, Marlon Brando, Heavy Metal, Martha Stewart, Shirley Temple, Bridget Fonda, Tracy Kidder, Wesley Willis and, well, unfortunately, me.


    Google reported about 140,000 results. It also asked if she wanted to search for


  29. I’m working on analysis on term change over time.
    My Interest lies in date(year) dependent nummbers of hits. I’m not familiar with programming…. My example should look like this:

    search: “school quality”
    1984 – xxxx-(hits)
    1985 -xxxx- (hits)
    1986 …..

    Hope you can help me ! ?


  30. Forgot to say…. It sould be for
    meaning : “school quality” = “Schulqualität”…


  31. Narwhals (that looks stupid, is that even the correct pluralization of narwhal?[“Pluralization” does {looks stupid } too]) when combined with badger return twenty-three results in Google, and three in the matrix thing. After some sifting, only three (in the Google search) omit an “-” or “,”, ∴ Google search is lazy and doesn’t include punctuation.


  32. /’/…/…./…../ˉ
    ……..(‘(…′…′….ˉ~ /’)

    Google in the input: = ==you can find many brand names, even more surprising is that he will sell you the unexpected o(∩_∩)o



Comments are closed.

%d bloggers like this: