As many of you aware, Stratus was plagued by infrastructure issues throughout the tournament weekend, resulting in no matches happening at all on Saturday, starting late on Sunday and finishing without playing all the matches. We know this experience has been frustrating for everyone who played in the tournament and for the staff involved in running it so I wanted to take the time to explain everything that happened and what we'll do going forward to make sure it doesn't happen again.
Our plan for Saturday was to have majjus bring our tournament infrastructure online. This is a fairly mundane process, so cs wrote up a step-by-step guide for our less experienced sysadmins to follow (like majjus_) so we didn't believe it was problematic for him to perform this task. When Saturday rolled around however, we ran into a number of issues:
Here's a timeline of Sunday's issues:
Now I'm going to go through the root issues and how we plan on fixing them:
Issue 1: We were relying on inexperienced sysadmins to bring our tournament infrastructure online. majjus was the primary person responsible for bringing the servers online, and while he has good understanding of Linux fundamentals he does not necessarily have enough background knowledge to fully understand how to debug our infrastructure when things go wrong. After the tournament got started on Sunday, I was given cluster access and became responsible for making sure things went smoothly. While I had background knowledge on the technologies we use to build our infrastructure, I did not at the time understand how it was configured within our network, which made debugging the crash loops which occurred Sunday afternoon more difficult for me to debug.
Solution: From now on, we'll be making sure that we have experienced sysadmins available for the duration of a tournament to solve infrastructure problems as they come up. Previously we only required sysadmins to be around to bring up the infrastructure and we would just ping them if issues arose. From now on, we'll be assinging one sysadmin to be on-call for the duration of the tournament and require them to be available in under five minutes if any issues arise. This way if anything does go wrong, it'll take less time for action to be taken.
Issue 2: Our standard operating procedures (step-by-step guides) have not been adequately tested. For our less experienced sysadmins, cs produced a number of guides on how to perform various routine functions on our infrastructure. Unfortunately, these guides weren't tested and it led to a number of issues when majjus had to set up the servers based on those instructions alone.
Solution: Our experienced sysadmins will create a number of guides on how to perform routine tasks. Once these guides are written, they will then be tested by a less experienced sysadmin to ensure that they're clear and accurate. These guides will also be reviewed regularly from now on.
In addition to the above, we'll also bring up our tournament servers at least two hours in advance so we have time to resolve any issues, and test our procedures before each tournament to make sure they're still relevant. We'll also be preparing multiple backup steps to take for any parts of our guides which have the potential to fail.
We know we have a bad reputation for bringing a stable tournament experience to you guys. PGM is a difficult piece of software to run but we want to do more to deliver a stable experience to you both during tournaments and on the rest of the network as well. Please feel free to ask any questions you have about the issues this weekend down below and I'll do my best to answer them.
Why do these issues always happen during tournaments?
Why were servers not tested before the day of the tourney? (notice the missing flags etc)?
Log in to reply to this topic.