Resolving an Amalgam of Issues During the Elite Specialization Beta

Hello! My name is Robert Neckorcuk, Platform Team Lead at ArenaNet, and today I get to be the mouthpiece for a large, multidisciplinary effort that managed the Guild Wars 2®: Visions of Eternity™ elite specialization beta event through its release, subsequent issues, and eventual fix. Beta events have usually been affairs where the development teams seek feedback on new features, skills, and systems—but as we were all reminded recently, these betas are also the first live trial of new technologies that are used to produce new features, skills, and systems. This post will recap the rocky start to this beta event and provide a behind-the-curtain look at how our team approached and overcame those challenges with help from all of you!

I want to start with a special shout-out to all of you in the community—we recognize this was not an ideal situation, from the original game issues to us requiring your aid in gathering more data when we’d purposefully enabled a known crash-inducing event. Through it all, we saw many messages of kindness, understanding, and support. On behalf of our whole team, our words cannot express our gratitude as we struggled to solve this issue. Thank you all.

The “Alpha” Beta

On Wednesday, August 20, at exactly 9:17 a.m. Pacific Time (UTC-7), the beta event weekend showcasing the new elite specializations for our sixth expansion, Guild Wars 2: Visions of Eternity, began. This was not a great start, as the beta was supposed to begin at 9:00 a.m.! We simply messed up a time-zone conversion, and the starting time was off by an hour. Fortunately, we were able to modify some configuration files and get the beta event started before 10:00 a.m.!

We never plan to ship bugs, but we do make plans and have systems in place that allow us to quickly react when issues do arise. Soon after the release, three issues were identified that we knew we wanted to correct as soon as possible. One of these we were able to simply disable with configuration files. A second issue, unrelated to the beta event, could be fixed by deploying a change to one of our backend servers. The third would require a same-day game build. By 2:00 p.m., all three issues had been handled, and we were looking forward to kicking back and joining in the beta fun. Shortly after this, our Community Manager, Rubi, messaged the group asking for a status update, as the game was on fire.

All around our remote offices, brows furrowed. We weren’t getting any crash reports and all of the graphs looked steady. Once we knew to start digging, however, we eventually hit paydirt. One of our client error reports was repeatedly showing a Code 1083. Cue the ominous music.

All Locked Up

Code 1083 is a very specific error code, referenced only once in our entire server codebase: “Waiting for Unlock.” In Guild Wars 2’s core server architecture, ‘locks’ are a component of how our map contexts are organized, as maps can be created on any server in our hosting fleet. The character-locking mechanism exists to ensure that there is only ever one map context that has authority to write data to your character record. A useful tool to mitigate exploits, but it is still used primarily for ensuring consistency of character and account data. Normally, when you enter a map, a pending lock is placed in the database for your character, showing the game server’s intent to take authority. A handoff occurs when there’s either no existing lock or an existing lock is released from a previous game server, and the pending lock is converted to the authority lock.

But we don’t always get to live in a perfect world—bugs happen and maps crash. As the developers, we want to gather as much debugging information about these events as we can. So when a map context does crash, we can actually take a quick pause during shutdown, gather relevant information, and email it back to the office to investigate and fix. During this flow, the game server knows that this map is crashing and can also tell the database to release any held locks.

At this point, we gathered our first real clue as to the cause of both our lack of operational visibility and the type of error we would need to chase down. As we were not receiving the reports of these crashes, we knew that entire “clean-up” code path was never getting triggered, meaning locks were not being released cleanly. The map context holding that lock no longer existed, so there was no entity to tell to release the lock. The final outcome was to simply expire the lock on a timer. After a while, with no update from the previous lock owner, we would release the lock and a character could once again connect to a map context that was able to acquire a new lock.

The long timer exists to cover a number of different infrastructure interruptions, and while we recognize it is painful for the player to be unable to log in, it is the lesser of two evils, as it ensures the integrity and consistency of our players’ data.

After discussion among the team and leadership, we believed this was a significant issue and that prematurely ending the beta event would be the correct course of action, providing players a better experience and allowing the team to investigate the source of the issue. With the knowledge that our application code was crashing and not triggering the collection and reporting flows, Occam’s razor suggested to us that we would be looking for a memory-corruption issue. This was confirmed later that evening using the operating-system logs.

Needle in a Stack of Needles

We started by breaking down the problem and trying to find the smallest scope where the issue could exist. While this would help prevent us from staring at millions and millions of lines of code, with the knowledge we had, the scope space couldn’t be narrowed too finely. We have nine new elite specializations, and we didn’t know which profession was the catalyst (pun intended) of the crash.

We ship code all the time, and although the beta event was in August, the very first code changes supporting these new specializations were submitted last year. On Thursday, several team members started working on the problem from this angle by searching for key words in submission comments and perusing changes.

Other folks looked into our data, seeing if we could identify any patterns regarding certain maps, professions on those maps, player counts on maps, etc., that could help further narrow the scope of the issue.

We did have one other hint that helped narrow down where we should be looking, and it dealt with how our game servers manage and group different types of memory.

The Mind of a Gamer Server

Every game needs data to know how to operate: which creatures spawn where, when dynamic events trigger, the geometry of the ground for collision detection, and so much more. A game server is able to create a number of map contexts, and each map context will claim some memory to hold all of its local information for use and updates.

In an effort to minimize our memory footprint and reduce duplication, our game servers will additionally create a section of shared static memory. Here, we can place any unchanging data and allow it to be read-only, shared among all map contexts. Most things tied to identifier tags, like items, achievements, creature types, etc., exist here and can be read by any map context, as it will always be the same information, regardless of which map is running.

Based on the memory corruption crashes we saw (or didn’t see), we could determine that the only way for this to occur was through an issue with shared memory. Our search space narrowed slightly more.

Try and Try Again

While we continued our investigations, we commandeered a planned team-wide internal playtest on Thursday to attempt to reproduce the crash on one of our development servers. The instructions were intentionally unclear: make a beta character, mash buttons, and submit a report if you crashed. There was no crash.

We couldn’t cause a crash.

Our development servers run slightly different code and configuration than our live servers do. On Friday, we came up with a plan to run another internal playtest on our staging clients, which is the closest mimic we have to the live environment.

Once again, we couldn’t cause a crash.

With fewer options available, we made the decision to enable the beta event in the live game again, intending to have our players cause a server crash.

The “Beta” Beta

At exactly noon Pacific Time, the beta event was turned back on. Our chat grew quiet for a few minutes as we all pondered what might happen next. We had made some additional preparations prior to reenabling the beta event by identifying specific tools and monitoring to try and provide the best opportunity to spot crashes and capture debug data. What we hadn’t specifically prepared for was how lucky we were about to get.

Just before the noon relaunch, we deployed a new build, fixing a bug involving some of the new player summons. The first stroke of luck was that a new build forced all of our game servers to create new shared memory buffers, which shuffled around all the memory allocations. At 12:11 p.m., our monitors reported the first crash. Our second stroke of luck came a moment later, when we received a full crash report from our internal diagnostics. The second-and-a-half stroke of luck was the nature of the crash, as provided by the crash-dump file. A piece of code was trying to delete a piece of deleted memory.

“Why are we double-deleting?” we asked ourselves. “Wait, why are we deleting from the read-only memory at all?”

First, we wanted to make sure this was the crash we were looking for and not a strangely timed coincidence. We passed the information from the crash-dump investigation to our QA team, and within fifteen minutes we were able to reproduce the bug. In the meantime, our team monitoring the live game had noted a few more crashes; some were sending crash reports, and others were triggering the memory-monitoring tools we had set up earlier. The reports all showed similar information about trying to read or delete a section of already-deleted memory. After one final discussion to confirm whether we’d gathered all we needed, we once again disabled the beta event.

Engineers Cause and Fix the Bug

The engineer profession’s newest elite specialization, the amalgam, introduced a flashy new profession mechanic called Morph and a new code path for how these skills are set and updated. This new behavior was what morphed our plans of a successful beta into a three-day tribulation.

The bug occurred due to a misunderstanding of a data structure using local map context memory, or the game server’s shared memory. When a single engineer player with the amalgam elite specialization equipped swapped out a Morph skill while its cooldown was active, the intent of the code was to store that skill change and the cooldown carryover relative to that particular player’s context. That change was made using a pointer that unknowingly referenced the shared memory. Thus, when we intended to delete a local reference to the previous skill, we were accidentally deleting the skill from the wider shared memory. The next time any player on that game server attempted to reference the now-deleted Morph skill, we would crash.

Fortunately for us, we wrote this data structure and its helper functions, so technically the fix was a one-liner.

At 12:53 p.m., we submitted this fix to our development branch and began our processes for bug testing, regression testing, and promoting the change to our staging and eventually live builds.

Personally, I am still impressed with how quickly we are able to ship changes. It’s due in part to our wizardly Engineering team, and “E,” who actually fixed this particular bug. However, our build pipeline also has a few tricks up its sleeve to help us keep things moving along quickly. For most Tuesday releases, we need to run a full build that includes all of our code and content. However, for a change with only one piece of server code, we can perform a fast build. This is a much more minimal and selective build that takes very little time to complete, allowing us to get it out the door quickly.

The “Gamma” Beta

Just a few hours later, with the fix deployed to the live game, we enabled the beta event once again. Our Engineering team members and QA logged in to create amalgam characters and verify the fix. To our relief, there were no further crashes from the verification.

Our team continued to monitor, using all the same tools as we had earlier in the day, and it appeared that the fix had worked. We were pleased to finally deliver the experience of these new elite specializations to our players with all the new skills and mechanics working as expected.

In Closing

After the incident, and after the beta event closed a week later, the team came together once again to recap, iterate, and plan. Is it possible to catch issues such as this internally and earlier in the process? Is there a new tool or procedure that can help mitigate these types of problems? If we were to see a repeat of this issue, what would we handle differently? How could we recover to a nominal player state faster?

I’m going to morph the closing I wrote for a different post back in 2020:

Our continued efforts are always targeted at providing our players with the best experience and usability of our services. We love to celebrate the designs architected by those before us and the tools and processes we utilize to retain our world-class server standards and in-game mechanics and events. We recognize that we may not achieve perfection, but we will certainly strive for it with every future procedure and deployment. As we look forward to the exciting new features and projects the gameplay and design teams are bringing to you, we are constantly working behind the scenes to make sure that you can always experience and enjoy all that Guild Wars 2 has to offer.

Thanks again and see you in Tyria!