As a full-stack web developer, I attended SRECon to expand my thinking about the reliability and observability of the services I develop. Here are my top 5 takeaways:
Casey Rosenthal’s talk titled “The success in SRE is silent” reminded us that while nobody thanks you for the incident that didn’t happen, you can still evaluate how the people around you are learning. First, check their reaction, thumbs up or thumbs down, about the changes. Eventually, they will be able to gauge that they’ve learned something. After that, you may notice shifts in behavior, such as asking for help setting up a monitor on Slack (where before, they might not have added a monitor). Finally, the results of new things making it to production, such as the new monitor.
Alper Selcuk shared Microsoft’s response to the massive expansion in the use of Microsoft Teams within education at the beginning of the pandemic. One of their techniques for avoiding service blackouts was brownouts, such as no longer displaying the cursor locations of other users on a shared document, preloading fewer events on the calendar, and decreasing the quality of videos on conference calls. This allowed Microsoft to keep the services online while increasing capacity and optimizing the service for the new load level. What brownouts could be applied to your service if it were to experience a sudden increase in demand?
Victor Lei applied experience skydiving to disaster recovery. There is a specific altitude in skydiving to stop trying to fix your main parachute and decide what is next. Then, there is another altitude where the skydiver automatically fails their backup parachute. Timeboxing is a technique for limiting the time spent testing a new idea or optimization, but it’s easy to lose track of time during a disaster. I’d like to see more guidelines for how long the on-call engineer should try to fix a problem before failing to the backup or calling in additional support.
Mattie Toia discussed emergent organizational failure. One point was forgetting how hard prioritization is, which can be helped by collaborating on mental models and making sharing and communication easy. Another was using incentives to replace dedication when the organization needs to demonstrate trust through actions. At the center of all five points was trust, how to build that, and recognizing that each organization member is complex and has their views of the world and the organization.
Christina Yakomin explained how to use the scientific method to test the resilience of systems.
- First, consider your system and all its parts. Then, research all the ways the system might be able to fail. (Newer engineers are especially helpful with this since they are less likely to dismiss failure paths that long-time engineers might ignore.)
- For each failure path, hypothesize about what will happen. (Make sure everyone can share their thoughts on what will happen rather than just agreeing with the first person to respond.)
- Then, test the failure path and see what happens (Note: If you’re planning to try something extreme like taking the entire database offline, you might have to test in staging instead of production but be sure to simulate real load during the test.)
- Analyze your findings. Even if the results matched what was expected, is that the behavior you want your system to have?
- Report the findings and document the test process since you will likely want to repeat this test in the future.
- Finally, repeat this process regularly (perhaps quarterly or yearly).
I look forward to helping each project I’m on continue to grow in features, reliability, and observability to weather the good times and the bad.
SRECon is an open-access conference. Videos of all the talks will be free from Usenix in the following weeks.