Trochee Chart

Here’s something I made as I drew today’s comic.  It’s a chart of Google results for “X Y” (in quotes) where X and Y are words from the first panel of the strip.  The first word is on the top, the second down the side (the opposite of the intuitive way, of course).

"Doctor Doctor" and "Jesus Jesus" are highest. The highest non-repeating combo is "Pirate Captain", followed by "Robot Monkey" and "Penguin Zombie".

I generated this using a Google API variable search tool developed by Eviltwin on #xkcd (I’m not linking to the tool so as to avoid potentially getting his API key revoked) Edit: He now offers the source and says it can be run without a key, and is happy to let people use it until Google does something. Not only is the API helpful in making these kinds of charts (which I spend more time doing than I care to admit), it also gives a roughly accurate count of results—in contrast to the Google search page.

The “number of results” count that Google gives when you search is clearly fabricated.  This is clear for a few reasons.  When Google says this:

Excellent!  That's a lot!

You can tell that it’s wrong first by scrolling to the end of the results.  When you get to page 32, it suddenly becomes:

I learned in AP Calculus that 316 is WAY less than 190,000.

This doesn’t usually matter, since nobody looks much past the first few pages of results, but it’s annoying if you’re trying to use the number of results as a measure of something.  When I was making the Numbers comic, I didn’t use the API, and there were a few graphs I had to throw out, crop, or put on an unnecessary log scale; otherwise, Google’s clumsy number-fudging made the graphs look nonsensical.  I can’t find a good example now (perhaps they’ve smoothed it out a bit) but when searching for things like “I was born in <X>”, the results for successive years would look something like this:

… 150 : 200 : 250 : 300 : 350 : 117,000 : 450 : 251,000 : 500 : 550 : 312,000 : 320,000 : 390,000 : 425,000 …

If you scrolled to the last page for each, you’d find that the smaller counts were roughly accurate, but the counts in the hundreds of thousands had no more actual results than their neighbors.

I suppose it’s remotely possible that these numbers are correct, there are no years with an in-between number of hits, and for some reason they’re just not showing you most of the promised pages when you try to flip through them.  But making this even less likely is the fact that the search API (which is apparently being deprecated and replaced right now) doesn’t return these bad numbers—it gives reasonable-looking results which seem to be roughly consistent with the number you come up with by navigating to the last search page.

So it really looks like there’s a certain threshold of result volume beyond which Google apparently says “screw it” and throws out a gigantic number.  I imagine this is probably due to incompetence rather than intentional deception; I’m sure it’s hard to generate pages quickly from many sources, and maybe for searches with a lot of results they don’t have time to get it all synced up.  So they fudge the numbers.  The fact that this makes it look like they have way more results than they do is presumably just an unintended bonus.

All in all, this isn’t a big deal and I don’t think there’s anything particularly evil about it. It does make it hard to use Google hits as an accurate gauge of anything, but I suppose if you’re trying to study something by seriously analyzing Google result counts, you have bigger methodological problems to worry about.

Edit: As Mankoff observes, it looks like the API sometimes *underestimates* the number of results, too.  For example, it still reports 0 results for “narwhal zombie”, when a regular search shows quite a few. Now, I notice, scrolling through them, that most either have some minor character/text in between the two words, or are related to the comic I just posted.  But at least one seems to date back to last year.

144 replies on “Trochee Chart”

  1. Have to feel a bit bad for all the ‘journalists’ who have used the number of Google results for a search in an article. Suddenly all their work is worthless.

    Like

  2. For as long as I remember, before the “About XXX results”, when they were giving the “exact” number of results, they still didn’t show more than 30 pages.

    From my experience, the number is accurate, but they just don’t bother to show 5000 pages of results, because it would be so much harder to generate/store/etc.

    Like

  3. Christophe, because of the odd behavior where the number of results jumps by three or four orders of magnitude between very similar searches, and because searches with many actual results (e.g. “porn” or “mp3 download”) present you with (at least) hundreds of navigable pages, I don’t think your hypothesis is correct.

    Like

  4. I was always under the impression that numbers were fudged because it’s a lot more work for you to count every item that matches your search criteria than it is to just pick the first 50 and then return them. Of course then you can’t get the total count, so you just stick links to 10 pages or whatever.

    It’s only when you go to get results 151-200 for page #4 (the query has to be rerun for each page of course) and only get 20 results back that you know that there must be only 170 total results. Then you can show accurate numbers and page links.

    This is different since Google does try to give you an estimate (and they don’t use SQL, where you do the whole SELECT LIMIT thing for getting a pageful of results).

    Like

  5. There are many results for Narwhal zombie. Did they all show up after this comic? It does not look like it. How accurate is this matrix?

    Like

  6. Dan,

    This would make sense, except for one telling issue: you see this phenomenon of a huge mismatch sometimes between searches that claim, say, 14,000 results, but actually have only 60.

    In this situation, the number at the top of the first page is wrong … but the number of ‘o’s in “goooooooogle” at the bottom matches the actual number of pages. So that number *is* calculated as part of the page generation.

    Anyway, even if that *were* the explanation, none of this is an excuse for consistently fudging the numbers upward by a factor of 1000 🙂

    Like

  7. @Nfora

    if the journalists were basing their columns on such flimsy-fact checked research, then the columns are probably worthless anyway

    Like

  8. Actually maybe not.

    Although my search originally returned the same results as your post indicated, I noticed there was a link asking whether to show results omitted because they were similar. Clicking on it then showed the other results.

    Like

  9. Whoa. I actually *just implemented this feature* in a search engine I’m writing. The problem was that my data is of a lot of different types, all of which are useful for finding relevant results, but only some of which are actually documents the user wants to look at. I do the search on all the data first, then sort through the results and only keep the actual documents when determining the result count. This filter step was starting to take many tens of seconds, so I cut it back to “just get through the current page of results, and then if it’s taking too long, estimate based on the document:junk ratio you’ve seen so far” … which generally overestimates since as you get deeper into the list, you get fewer and fewer documents and more adjunct types. An overestimate is better than nothing, though.

    I can’t imagine that this is the reason Google gives an overestimate, given their computing resources. Maybe, though.

    Like

  10. This is something I’ve always been interested in, but I’ve not been able to probe very much into because of the inaccuracies with the number of google search results.

    There’s a paper published (http://arxiv.org/PS_cache/cs/pdf/0412/0412098v3.pdf) extending the idea of “distance” to vocabulary and words by using a large mass of language. For those of you not so math-inclined, there’s a very nice summary in the Melbourne University Maths Society magazine, Paradox (Page 12 of http://www.ms.unimelb.edu.au/~mums/paradox/archive/p06-1.pdf). With this API, perhaps you could get some nifty results.

    Like

  11. When searching these search words gave me:

    iPhone: 91 pages
    Dog: 81 pages
    Porn: 80 pages
    robot: 77 pages
    a: 74 pages
    ninja: 71
    mp3 download: 68 pages
    double rainbow: 67 pages
    xkcd: 67 pages
    google: 57 pages

    At the end google says:
    In order to show you the most relevant results, we have omitted some entries very similar to the xxx already displayed.
    If you like, you can repeat the search with the omitted results included.

    Like

  12. Google’s algorithms for estimating hit-counts are extrapolated from a very small sample of the intersection of stored indices for words or word sequences. The general approach, as of five or six years ago, was discussed for the case of boolean search in
    http://itre.cis.upenn.edu/~myl/languagelog/archives/001837.html
    http://itre.cis.upenn.edu/~myl/languagelog/archives/001840.html

    The case of word-sequence search is basically similar. The indices have gotten bigger, and the algorithms have no doubt changed somewhat, but the hit-count estimates are still very bad. However, the count of actual pages that are shown to you is clearly worse.

    Like

  13. Maybe Bing does a better job of counting google results? I haven’t checked, but it’d be funny if it was true.

    Like

  14. As an aside, I’m pleased to see that “Hobo Raptor” and “Raptor Hobo” both have zero results. That suggests to me that raptors are doing well, and not having to live on the streets.

    Like

  15. Inspired by this, I whipped up a similar chart for the same set of trochee phrases in the Archive-It service. This is a hosted web archiving service of the Internet Archive (where I work). It contains just over one billion archived web pages from 2006-current. A mini-wayback machine (if you will) of curated web archives.

    I tried to come close to the same color-coding, but I’m probably off a bit.

    Like

  16. So “Doctor Doctor” was the #1 result? There’s a Robert Palmer joke in there somewhere…

    Like

  17. I seem to recall that while working on an implementation of the Google Custom Search Engine for a client there were some subtleties in how Google returned and calculated results.

    From memory, the Results number is initially based on some probability analysis they do of their index. After a certain number of pages, or if you’ve hit the end of the results, it’s replaced by the accurate value.

    The number of pages is based on this number, but Google simply return the last valid page should they find there’s less results as they’re compiling your query response.

    This results in some ‘interesting’ bug reports, and lots of ‘fun’ discussions as to why it’s expected behaviour.

    “When I click on Page 20 for search ‘x’ it gives me page 12, and no longer shows page 20 as option. Going back to Page 10 it still shows Page 20 as an option.”

    Like

  18. I’ve found Google News to be especially notorious for this and often with much smaller result sets.

    Right now, looking at my Google-localised http://news.google.com/ front page, the top story’s linking to ‘all 203 related news articles’. As soon as I click through that link, the following page becomes ‘all 30 related news articles’, which is the actual total amount it’ll return. The third story down’s going from 33 to 8. Another story’s claiming 5 on the front page, but dropping to 2.

    Like

  19. Are you sure this isn’t related to the thing that Google does where they don’t show duplicate results? You know, the “In order to show you the most relevant results, we have omitted some entries very similar to the n already displayed.
    If you like, you can repeat the search with the omitted results included.”

    Like

  20. Strangely, a lot of results for “Raptor Jesus” but not many at all for “Jesus Raptor”

    Weird.

    Like

  21. For those who say that the “omitted similar results” is the culprit: it’s not:) At least I searched for “random stuff” (without the quotation marks of course), and it promised me approx. 83 000 000 hits, with an actual number of 812. Then I followed the link to include all the search results. This offered approx. 83 900 000 hits (which is a bit more), with an actual number of 722 (which is less). How about that?:)
    But the reason why I’m writing this comment: I edited the URL to include “start=7200” instead of “start=720” corresponding to the last page in the latter search. Now it tells me that my search didn’t match any documents. The interesting part is a light grey text above this, saying “Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 7300.)”. Now either the inconsistent URL confused the site, or there is a real constraint on the number of pages shown to regular users. If so, then the whole plan of wanting to see tens of thousands of hits is incompatible with the service. Maybe it makes more sense to you:)

    Like

  22. The number of results is an estimate based on “we spidered x% of the web, and found y number of results, so we think there are probably this many results out there in reality.” In other words, if you spider 10% of the web and find 10 results, you report “About 100 results.”

    Like

  23. The way search engines count and order their results is interesting. To obtain faster search times, Google probably doesn’t assemble more than the first ten actual results when you search, so the “number of results” can be misleading. For those of you looking to study this and other bizzare search effects, a good source of some data is the RankManiac 2011 competition, hosted at rank-maniac-2011.appspot.com

    Like

  24. @Stu: asking “does this break TIOBE?” is like asking “does beheading a zombie count as murder?” Of course it doesn’t. The fact that the zombie was walking around moaning does not alter the fact that it was already dead — you can’t murder it, it’s already dead.

    And the fact that misguided people whose pet languages are currently favoured by TIOBE like to pretend that it’s an objective and meaningful rating system does not alter the fact that TIOBE is inherently flawed and meaningless — you can’t break it, it’s already broken.

    Like

  25. A person in Evanston, Illinois, once yelled “Puma!” out of his car at me for no apparent reason. I guessed at the time that this was a person trying to yell something “random” to serve as a complete non sequitur, which as we know is the perfect way to screw with people at snooty liberal arts universities like Northwestern.* But then it occurred to me that he had chosen the word not truly at random, but because he had probably been watching Red vs. Blue. It did not occur to me until today that he was shouting a trochee.

    * Of which I am not a student but given my “when in Rome” street clothes he could hardly have known that. Also, if the intention was in fact to make the target nerd generate fits of intellectual thought loops to explain his random behavior the fact that I am still thinking about this years later means he was probably more successful than he will ever know. In fact, it was probably the most successful practical joke for the least amount of effort he has ever or will ever perpetrate in his life …and HE’LL NEVER KNOW!!!

    Like

  26. Also keep in mind that Google operates huge clusters of machines which need not be in sync. So whenever you click on the next page of results you may land on a different server which has a different idea of what are results 120 to 129.

    Like

  27. You would probably get more accurate data from the Google Ngram Viewer . For example, here are statistics from scanned books on uses of the phrase “I was born in 1931”:

    Like

  28. You should re-run the script that generated this chart to see how much this blog influences google results.

    Like

  29. We did a study based on Google answers (“A study of the seldomness of strong human emotions using internet metrology” (Review of April Fool’s day Transation))

    The results seemed quite coherent (decreasing number of answers for less plausible keyword searched).

    Like

  30. Maybe one thing to take on count is the fact that on the search results, google will group results belonging to the same page, or a page to closely related. If you expand it, you get a lot more results, and it happens mainly on forums, wich are known to repeat a lot of topics.

    Does the API count every result on each condensed page? Not sure if those results are enough to reach millions.

    Like

  31. Google results aside, what I want to know is when did the shift from iambs to
    trochees happen? Billy Shakespeare was all about the iambs, and I am for
    purposes of this discussion I am pegging the rise of trochees at no later than
    1964, when Mary Poppins was released (supercalifragilistic mutant ninja turtles)
    which leaves a ~350 year gap. If one takes Robert Frost as an indication that
    iambs were still popular in the 1900s (a quick Wikipedia (returning to the reliability
    of research materials for a moment) search shows _The Road Not Taken_
    release in 1916, for example) which would suggest Disney was on the forefront
    of the trocheeing of American popular culture.

    Like

  32. if you google google shouldn.t you get everything on the internet as a result,or does the google team just give wrong driestions to your house?

    Like

Comments are closed.

%d bloggers like this: