When 99% Isn't Good Enough
29 Aug 2007My company is at the beginning of a what will end up being a fairly long, exhaustive migration process. Probably on the order of 12-16 months, migrating web sites from a set of servers on one side of the country to a set of servers on the other side. It's not your typical forklift migration (where you actually move the servers and plug them in at their new home); instead, it's literally moving files, mail, DNS, etc. to a new platform.
It's pretty daunting, pretty complicated, and can occasionally be pretty cool.
On the flip side, it's now 2:55AM Eastern in Boston (where I started my day), but I'm in Phoenix where it's actually only 11:55PM. That's a sign that maybe things didn't go quite as smoothly as one would have hoped.
The step we're on is a step where we take over DNS for folks. It's always somewhat difficult, because we'll get a big list of domains and have to figure out whose record (our nameserver's or the other nameserver's) is the "real" record. It's not generally too tough to figure it out (you can judge by the SOA of the records) and the number of domains is usually short of 100k, so as long as you're accurate to within 1-2%, it's not too bad. That's 1000 guys who might break, which is pretty easy to handle with a good support team and some quick script fixes.
Let me take a step back. The process is actually that our nameservers need to become the authoritative nameservers for the domains we're moving. This allows us to later change their DNS to point to their new home, and it all kinda works. We have to get the domains, merge them into our nameservers, become authoritative, and then fix what breaks.
We did that yesterday. Except it wasn't 100k domains. It was 1.2 million. And the domains weren't coming from a single, well-maintained nameserver. They domains came from three, somewhat munged together nameservers. There were internal conflicts, conflicts with our servers, missing zones. A host of issues. We thought we'd worked most of them out and gotten the problems down to, at most, 4-5k domains. That's a lot, but in reality, it's less than 0.5% of the total domains.
"Pretty good," you say.
"Not quite," I say.
For you see, there weren't just three nameservers. There were five. So there's a couple thousand domains we missed. And we also missed some of the conflicts (either by omission or by grabbing the wrong data). In the end, it was closer to 12k domains that were wonky.
That's still only 1%. Damn good, given all of the variables.
Except 12000 broken domains leads to a whole lot of phone calls and emails. And some angry customers. And some tired folks staying up to fix things that they weren't responsible for breaking. And one tired folk--me--staying up because he feels guilty for only being 99% good enough.
DNS is a fickle beast. Thankfully, it's pretty quickly fixable. Once we'd identified some global problems, we could fix them rapidly and put big chunks of the broken domains back in working order.
I often argue with people who think the "Chinese Market" is a valid business plan. You know, the folks who say "hey, if we can just get our product in front of 100 million people, and get 1% of those people to buy, we'll be rich!" Except, of course, it doesn't really work that way. It's hard to get a product in front of that many people who would be interested in buying, and it's hard to get 1% of any audience to buy anything.
Well, not in business plans, at least. It does work that way in technical issues. If you've got a huge enough base of users, the smallest mistakes can have a big impact on your company and team. In these cases, sometimes being 99% accurate isn't good enough.
Here's a graphical representation:
You see, with 100k domains, you never quite reach screwed. It's manageable.
With 1.2 million domains, you're pretty much totally screwed.