Kiddies and adults alike looking for a few Halloween treats in Roblox this past weekend experienced a nasty trick instead. Starting last Thursday, the game service went down for an astounding three days — 60-plus hours — of unexpected downtime.
Players who were left in the dark over the situation didn’t receive much in the way of illumination from the developers in the first couple of days. The dev team tweeted that it was having some sort of server issue and denied that the outage was caused by “specific experiences or partnerships on the platform.” Apparently this last mention was in reference to a running promotion by Chipotle to give away $1 million worth of burritos through the game service.
As most of the Roblox community is aware, we recently experienced an extended outage across our platform. We are sorry for the length of time it took us to restore service. A key value at Roblox is “Respect the Community”, and in this case we apologize for the inconvenience to our community.
On Thursday afternoon October 28th, users began having trouble connecting with our platform. This immediately became our highest priority. Teams began working around the clock to identify the source of the problem and get things back to normal.
This was an especially difficult outage in that it involved a combination of several factors. A core system in our infrastructure became overwhelmed, prompted by a subtle bug in our backend service communications while under heavy load. This was not due to any peak in external traffic or any particular experience. Rather the failure was caused by the growth in the number of servers in our datacenters. The result was that most services at Roblox were unable to effectively communicate and deploy.
Due to the difficulty in diagnosing the actual bug, recovery took longer than any of us would have liked. Upon successfully identifying this root cause, we were able to resolve the issue through performance tuning, re-configuration, and scaling back of some load. We were able to fully restore service as of this afternoon.
We will publish a post-mortem with more details once we’ve completed our analysis, along with the actions we’ll be taking to avoid such issues in the future. In addition, we will implement a policy to make our creator community economically whole as a result of this outage. There are more details on this to come. As part of our “Respect the Community” value we will continue to be transparent in our post-mortem.
To the best of our knowledge there has been no loss of player persistence data, and your Roblox experience should now be fully back to normal.