For my day job, I’m a software developer, and all of my previous jobs have been IT related in some fashion. I still remember some advice that my supervisor at my first tech-related job gave me: “If we’re doing our job right, no one will know we’re here. When something does break, whether it’s your fault or not, you’re the face of the problem. Don’t take it personally.”
Last year, Guild Wars 2’s European game servers went down for about 20 hours. The “face” of that nightmarish problem was Platform Team Lead Robert Neckorcuk.
I know Neckorcuk’s article had a lot of technical detail, and while I think he did a great job giving us a readable insight to what happened, a TLDR version might be instructive.
In short, the game runs off of two databases. Let’s call them dbA and dbB. They are hosted, not at ArenaNet headquarters as you might expect but on Amazon Web Services servers. (Protip: Whenever someone talks about “The Cloud,” it’s just a fancy, trendy way of saying they’re renting server resources from someone else like Amazon or Microsoft.) Every time something happens in the game that needs to be recorded – like, say, an item goes into your inventory, you complete an achievement, or you switch up your build – it gets recorded first to the live database, dbA, and then as time permits, to the backup, dbB. If something goes wrong with dbA, it automatically switches over to dbB, making dbB the primary and keeping a log of what needs to happen to catch dbA back up to speed so it can take over as the backup once it’s up and running again. All the while, life in Tyria goes on, and players never know the difference.
Then the servers started to run out of space. That’s normally not a big deal, but due to a small issue with their cloud service provider, it required a restart to take effect. Simple, the team could just restart one server, then the other. For some reason, the synchronization from dbA to dbB was running slowly, so they decided to uncouple the databases over the weekend, letting the game run exclusively off of dbA, and letting dbB catch up so it was ready to restart and then take over on Monday morning. The problem is, due to a bad driver and lack of space, dbA stopped working sooner than expected, before the devs were able to complete the restart, and dbB, which was now missing several days’ worth of events, took over. Suddenly, players were seeing their inventories, characters, etc. revert to the state they were in the previous Friday. This is why it initially looked to many players like a rollback had happened right before the outage.
At least, that’s the official story. Some players still believe that an extremely powerful Chronomancer accidentally cast Continuum Split on the entire continent of Europe. That’s unconfirmed at this time, but the Chronomancer spec did receive a nerf shortly thereafter. Take that as you will.
From here, it was a matter of piecing together what needed to happen to reconcile the two servers, which I can tell you from personal experience is an extremely delicate and frustrating process. Neckorcuk talks about a variety of new strategies that his group implemented to keep this kind of thing from happening again, including additional monitoring, redoubled testing efforts, and a standardized software package to deploy across all of their servers. All great things that, in my experience, often only get implemented after a situation like this one.
There’s a little more to it than that, but that’s the short version. All in all, it was caused by a series of technologies that normally keep the game running smoothly, combined with a bad driver, and some absolutely understandable human error. I cringed when I started to see where it was all headed because it’s one of those things that would have seemed perfectly logical at the time, but when you know what the end result is, you can see it coming a mile away.
The article was also a really interesting peek into all the work it takes to run an MMORPG. Neckorcuk even admits something that every IT person can relate to but isn’t always willing to say out loud: “I don’t fully comprehend how the [cloud storage] system works.” It made me chuckle, but it just goes to show what an incredible feat of IT engineering our favorite games really are.
Say what you will about Guild Wars 2 as a game, or ArenaNet as a company, but most MMOs go down for an hour or two weekly as a matter of course, and many go down for a full day or more during a big content update, but no one bats an eye. Some games I could name have suffered from months-long service health problems with little more communication than, “Our bad, we’re working on it.” Guild Wars 2 went down last year for the first time in four years, in only one region, for less than 24 hours, and we got a 3,000-word essay on what went wrong. And it hasn’t gone down again since. That’s a level of uptime reliability that I only wish I could get from most online services I use, even ones I rely on in my professional life, and a level of transparency that’s practically unheard of.
I would like to thank ArenaNet’s Robert Neckorcuk for taking the time to pen this extensive after action report for the players, and thank all of the behind-the-scenes technical staffers at ArenaNet for keeping the game running as smoothly as they have for all these years. They have the kind of job that nobody thinks about until something breaks, and I would say that they do it quite well.