Forums

What's New

Infrastructure Post-Mortem: August 24/25th Issues

Cobol72 Edited 6 years ago

As many of you aware, Stratus was plagued by infrastructure issues throughout the tournament weekend, resulting in no matches happening at all on Saturday, starting late on Sunday and finishing without playing all the matches. We know this experience has been frustrating for everyone who played in the tournament and for the staff involved in running it so I wanted to take the time to explain everything that happened and what we'll do going forward to make sure it doesn't happen again.

Our plan for Saturday was to have majjus bring our tournament infrastructure online. This is a fairly mundane process, so cs wrote up a step-by-step guide for our less experienced sysadmins to follow (like majjus_) so we didn't believe it was problematic for him to perform this task. When Saturday rolled around however, we ran into a number of issues:

majjus' access credentials for the server were no longer valid (10:36 AM). We had verified his credentials were valid the previous night, but between then and Saturday morning, a copying error appeared on the server-side in his credentials, meaning they were no longer valid. We don't know how this happened, it's possible someone accidentally added a keystroke between the previous night and that morning. To resolve this, we had to wait for another sysadmin to provide majjus with their credentials so we could continue with setup (12:07 PM, cs_ was on holiday in Europe so his response took a while).
Once we got into the cluster, we realised that the step-by-step guide wasn't entirely accurate due to a recent infrastructure change and a command was missing that needed to be run (12:10 PM). We figured out the correct command at 12:40 PM and added it to our guide.
Around 12:45 PM, we realised none of the configuration files needed to set up the tournament servers were in place. This was due to our recent infrastructure change. At this point cs was no longer available and there were no other developers around with sufficient knowledge of the infrastructure to help us with this issue, so around 1:30 PM we decided to cancel Saturday's games. Later that night, cs added some information to our guide which showed us how to recreate the configuration from scratch if need be.

Here's a timeline of Sunday's issues:

We normally upgrade the compute power of our backend during tournaments to cope with the extra load that running a tournament creates. This was a part of our step-by-step guide, but after resizing we noticed that none of our backend components were coming online (10:10 AM). After some debugging, we noticed that a flag that's normally attached to our backend server had disappeared, meaning none of our backend components knew to start themselves on that server. We re-added the flag around 10:25 AM and our backend components started coming online.

Once the backend came online, we noticed that some tournament servers were using maps from last tournament instead of the current one (10:30 AM). We spent a bit of time debugging and noticed that a flag was misspelt on the tournament node, meaning servers were pulling maps from the wrong place. We fixed this fairly quick and the maps began loading properly (10:40 AM).

Throughout Sunday, we were struggling with various API and proxy crash loops. There's no one cause for these issues, PGM is a complex piece of software with many intertwined parts and one of them blipping can easily cause cascade failures. This started happening around 2:39 PM. We tried various things to fix the crashes, but when nothing was working we decided to end the day early around 3:30 PM. We continued working on fixing the issues and at around 4PM we successfully managed to bring the backend back into a stable state.

Now I'm going to go through the root issues and how we plan on fixing them:

Issue 1: We were relying on inexperienced sysadmins to bring our tournament infrastructure online. majjus was the primary person responsible for bringing the servers online, and while he has good understanding of Linux fundamentals he does not necessarily have enough background knowledge to fully understand how to debug our infrastructure when things go wrong. After the tournament got started on Sunday, I was given cluster access and became responsible for making sure things went smoothly. While I had background knowledge on the technologies we use to build our infrastructure, I did not at the time understand how it was configured within our network, which made debugging the crash loops which occurred Sunday afternoon more difficult for me to debug.

Solution: From now on, we'll be making sure that we have experienced sysadmins available for the duration of a tournament to solve infrastructure problems as they come up. Previously we only required sysadmins to be around to bring up the infrastructure and we would just ping them if issues arose. From now on, we'll be assinging one sysadmin to be on-call for the duration of the tournament and require them to be available in under five minutes if any issues arise. This way if anything does go wrong, it'll take less time for action to be taken.

Issue 2: Our standard operating procedures (step-by-step guides) have not been adequately tested. For our less experienced sysadmins, cs produced a number of guides on how to perform various routine functions on our infrastructure. Unfortunately, these guides weren't tested and it led to a number of issues when majjus had to set up the servers based on those instructions alone.

Solution: Our experienced sysadmins will create a number of guides on how to perform routine tasks. Once these guides are written, they will then be tested by a less experienced sysadmin to ensure that they're clear and accurate. These guides will also be reviewed regularly from now on.

Tournament setup guide excerpt

In addition to the above, we'll also bring up our tournament servers at least two hours in advance so we have time to resolve any issues, and test our procedures before each tournament to make sure they're still relevant. We'll also be preparing multiple backup steps to take for any parts of our guides which have the potential to fail.

We know we have a bad reputation for bringing a stable tournament experience to you guys. PGM is a difficult piece of software to run but we want to do more to deliver a stable experience to you both during tournaments and on the rest of the network as well. Please feel free to ask any questions you have about the issues this weekend down below and I'll do my best to answer them.

sqyid Edited 6 years ago

Thanks for the clear cut response, hopefully the server won’t have as much issues anymore

finshu Edited 6 years ago

Thanks!

mrcookie_13 Edited 6 years ago

Why was nothing tested before hand (Is it not possible or just didn’t happen)? And also, how does a flag go missing if it was present for previous tournament? I’m confused about proxy restarts because that harmed past tms but Wool-E was mostly unaffected. Also, how do proxy restarts not affect the main(public) server?

sslumper Edited 6 years ago

Why do these issues always happen during tournaments?

Why were servers not tested before the day of the tourney? (notice the missing flags etc)?

Cobol72 Edited 6 years ago

We had tested parts of the procedure beforehand but not all of it entirely as it would involve some network downtime. That being said, we could've been more careful about reviewing our plans and addressing issues proactively.
Flags can go missing for a number of reasons and I'm not entirely sure why the flag fell off on Sunday. That being said, we added some guidance on that issue to our tournament setup guide so we don't expect this to cause a delay to tournament start going forward.
Proxy crash loops can occur for a number of reasons, but often due to a separate part of our infrastructure failing leading to a cascade failure across a number of other systems. PGM is made up of a bunch of tightly integrated parts which makes the whole system really fragile and unstable. We may not have had issues during WOOL-E but honestly it's just sheer luck. The TM proxy is also separate to the main proxy so while one may crash the other could be fine.

Cobol72 Edited 6 years ago

These issues aren't necessarily limited to tournaments. Bringing up the tournament servers requires us to change many different things in our infrastructure which could cause problems in it of itself. The server definitely has stability issues outside of tournaments but we're working to reduce these across the board.
We tested our procedure the night before, but not fully as that would require network downtime. The missing flags were not issues we've encountered in past tournaments or when testing our procedures, and we aren't quite sure why it happened this time around. Nonetheless, we've updated our internal documentation with information about how to fix that issue if it happens again in the future. We've learnt a lot from this weekend and our internal documentation is now far more robust than it was in the past.

Diabolicx Edited 6 years ago

Thanks for clearing it up

9116d3fa-1400-4ada-aecd-2eae1d73e383 Edited 6 years ago

with all due respect and appreciation for the transparency, i as a player could care less for the reasons of the tournament wasting my time. from my personal standpoint, im a regular and involved in this server so that experience does not make me quit this server, however it's really not a good impression on new players if you're looking to expand the scale and size of your events. i mean that the new players would not give a fuck for the reasons of the delay and the waste of time. and almost every tournament there is a delay (actually every one apart from woole iirc), and while im happy that this time as well you said "we learnt a lot about our mistakes" im almost certain, like any past tourney, that the next event will also feature time delays and other problems. it is evident that it is not that insanely difficult to run an event like this, players like campeh managed to run an event approximately of this scale with like 2-3 organizers and one server. all that goes to say - get ur shit together and dont rush the next event, id rather play events less frequently but have them run as smooth as possible. and from a short read on this thread, i hope operation ares will not be as unstable and fragile as pgm, and next event will run better.

sslumper Edited 6 years ago

Wool e had server crashes and rescheduled matches