Flameseeker Chronicles: Guild Wars 2’s outage analysis, simplified

    
7

For my day job, I’m a software developer, and all of my previous jobs have been IT related in some fashion. I still remember some advice that my supervisor at my first tech-related job gave me: “If we’re doing our job right, no one will know we’re here. When something does break, whether it’s your fault or not, you’re the face of the problem. Don’t take it personally.”

Last year, Guild Wars 2’s European game servers went down for about 20 hours. The “face” of that nightmarish problem was Platform Team Lead Robert Neckorcuk.

I sympathize with Neckorcuk. I’ve had to be the guy who called the department head and delivered the bad news that an update went bad or someone clicked a button they shouldn’t have and now we’re going to have to take a service offline. Usually, I’ve had no ETA to give at first; if you don’t know exactly what’s gone wrong, it’s hard to say how long it will take to fix it. It’s not fun. Sometimes it’s because we didn’t test something properly. Sometimes it’s because something happened when we went live that we couldn’t have foreseen in testing. Stuff happens, but it’s still never a fun call to make. And it’s even less fun for the frontline support people who are going to have to field calls all day from people, often angry people, who can’t access the service they’re paying for.

I know Neckorcuk’s article had a lot of technical detail, and while I think he did a great job giving us a readable insight to what happened, a TLDR version might be instructive.

In short, the game runs off of two databases. Let’s call them dbA and dbB. They are hosted, not at ArenaNet headquarters as you might expect but on Amazon Web Services servers. (Protip: Whenever someone talks about “The Cloud,” it’s just a fancy, trendy way of saying they’re renting server resources from someone else like Amazon or Microsoft.) Every time something happens in the game that needs to be recorded – like, say, an item goes into your inventory, you complete an achievement, or you switch up your build – it gets recorded first to the live database, dbA, and then as time permits, to the backup, dbB. If something goes wrong with dbA, it automatically switches over to dbB, making dbB the primary and keeping a log of what needs to happen to catch dbA back up to speed so it can take over as the backup once it’s up and running again. All the while, life in Tyria goes on, and players never know the difference.

Then the servers started to run out of space. That’s normally not a big deal, but due to a small issue with their cloud service provider, it required a restart to take effect. Simple, the team could just restart one server, then the other. For some reason, the synchronization from dbA to dbB was running slowly, so they decided to uncouple the databases over the weekend, letting the game run exclusively off of dbA, and letting dbB catch up so it was ready to restart and then take over on Monday morning. The problem is, due to a bad driver and lack of space, dbA stopped working sooner than expected, before the devs were able to complete the restart, and dbB, which was now missing several days’ worth of events, took over. Suddenly, players were seeing their inventories, characters, etc. revert to the state they were in the previous Friday. This is why it initially looked to many players like a rollback had happened right before the outage.

At least, that’s the official story. Some players still believe that an extremely powerful Chronomancer accidentally cast Continuum Split on the entire continent of Europe. That’s unconfirmed at this time, but the Chronomancer spec did receive a nerf shortly thereafter. Take that as you will.

From here, it was a matter of piecing together what needed to happen to reconcile the two servers, which I can tell you from personal experience is an extremely delicate and frustrating process. Neckorcuk talks about a variety of new strategies that his group implemented to keep this kind of thing from happening again, including additional monitoring, redoubled testing efforts, and a standardized software package to deploy across all of their servers. All great things that, in my experience, often only get implemented after a situation like this one.

There’s a little more to it than that, but that’s the short version. All in all, it was caused by a series of technologies that normally keep the game running smoothly, combined with a bad driver, and some absolutely understandable human error. I cringed when I started to see where it was all headed because it’s one of those things that would have seemed perfectly logical at the time, but when you know what the end result is, you can see it coming a mile away.

The article was also a really interesting peek into all the work it takes to run an MMORPG. Neckorcuk even admits something that every IT person can relate to but isn’t always willing to say out loud: “I don’t fully comprehend how the [cloud storage] system works.” It made me chuckle, but it just goes to show what an incredible feat of IT engineering our favorite games really are.

Say what you will about Guild Wars 2 as a game, or ArenaNet as a company, but most MMOs go down for an hour or two weekly as a matter of course, and many go down for a full day or more during a big content update, but no one bats an eye. Some games I could name have suffered from months-long service health problems with little more communication than, “Our bad, we’re working on it.” Guild Wars 2 went down last year for the first time in four years, in only one region, for less than 24 hours, and we got a 3,000-word essay on what went wrong. And it hasn’t gone down again since. That’s a level of uptime reliability that I only wish I could get from most online services I use, even ones I rely on in my professional life, and a level of transparency that’s practically unheard of.

I would like to thank ArenaNet’s Robert Neckorcuk for taking the time to pen this extensive after action report for the players, and thank all of the behind-the-scenes technical staffers at ArenaNet for keeping the game running as smoothly as they have for all these years. They have the kind of job that nobody thinks about until something breaks, and I would say that they do it quite well.

Flameseeker Chronicles is one of Massively OP’s longest-running columns, covering the Guild Wars franchise since before there was a Guild Wars 2. Now penned by Tina Lauro and Colin Henry, it arrives on Tuesdays to report everything from GW2 guides and news to opinion pieces and dev diary breakdowns. If there’s a GW2 topic you’d love to see explored, drop ’em a comment!
Advertisement

No posts to display

7
LEAVE A COMMENT

Please Login to comment
  Subscribe  
newest oldest most liked
Subscribe to:
Reader
Tremayne

Having worked in tech for a bank for over 20 years, and dealing with incidents like this from the perspective both of the technical expert trying to fix it and the management team, I can relate to this (and “my mortgage hasn’t been paid” is a wee bit more serious than “my Norn Warrior had a precursor drop and it’s been rolled back”).
There will be lessons learned. That same database problem will not be repeated. And the next big incident will be something else. That’s the nature of the job.

Reader
styopa

I’ve always thought it would be amusing (but probably not regarded so by most) if a game implemented a 4th-wall breaking metaboss whose special attacks actually added to your ping, caused packet loss, warping, or if you stood particularly long in the fire, disconnected you.

Hee hee. Yeah, sadistic.

EmberStar
Reader
EmberStar

Wouldn’t this just be the MMO version of Psycho Mantis from that one Metal Gear game? Where it turned out the key to defeating him was that you had to unplug from Port A, and plug your controller into Port B so he couldn’t “read your mind” and instantly dodge, deflect or counter everything you did?

Reader
Loyal Patron
Patreon Donor
Neurotic

I remember that! That was brilliant. :D

Reader
Sleepy

That was an interesting read, especially the comment about not fully understanding how the cloud works. One of the reasons I got out of IT was the sense of constantly upskilling, training and reading up, purely to remain competent at my job. It’s a weird profession when your expertise has a half life of a few years.

Reader
Loyal Patron
Patreon Donor
Neurotic

I can sympathize with the cloud sentiment. Speaking as a technical documentation editor for a large business software company, seeing documentation for all aspects of modern multi-platform commercial software, the cloud stuff is far and away the most opaque and difficult to document.

Reader
Nate Woodard

Lol impossible! The network schemes don’t even make sense.