Well, that was fun. Not.

Cataclysm.

  • A raft of exceptions started turning up with geolocation and weather. Turns out Yahoo’s weather service isn’t as smart as I thought it was, and outside of major city centers it can be difficult to tell at exactly which level weather is mapped in at, if anywhere. We now do weather feed pulls at levels 13, 11, 9, 6, and 3 in the hopes of finding something useful. If you live somewhere other than a major city, chances are your weather just got more reliable.
  • If we can’t find weather, we’ll return weather conditions for unknown. I think that means more people will see volcanoes. I think I’m okay with that.
  • We were previously directly calling the memcache API, which sounds nice on the surface until you realize that actually it’s quite unreliable as a cache (within the wider definition of ‘reliable’) and that you really ought to wrap it in a bit of cache management. I’ve done so, so that when problematic keys come up, they get discarded, and we silently ignore fails to update caches (as it’s not fatal, or at least, ought not to be.)
  • We’re doing so many more searches that the search results’s size basically exceed the storage limits on a memcache key. I’ve split that out into caching search results for each search, instead of the aggregate search set, to get better fan-out; it also means we’re doing an aggregate query for cached results across multiple keys simultaneously, which, due to magical google infrastructure, should improve performance for search caching.

Anyways, that’s what I get for getting too eager; I could easily have canaried it and paid a more watchful eye, but didn’t, and pushed it along with the new client binary all at once. While most people might only have seen a sluggishness in updates, I saw…. I think it peaked at about 10QPS of errors by midnight, due to a sea of broken shit that I didn’t test properly before I stuck it in production.

We’re back down to our more normal 1QPS of traffic, and our normal error rate (mostly due to Flickr taking too long to perform even these simpler searches - we’ve still got an upper bound on request fetch time of 10s), that client retries magic away.

More improvements, probably, soon.