Back in April, I attended Failover Conf, a virtual conference hosted by Gremlin. Overall I thought the conference was pretty good, but as with all conferences, the usefulness of the talks varied. The influence of safety thinking was clear, especially Resilience Engineering, which was explicitly covered in two talks (Amy Tobey and J Paul Reed).
The highlights for me were two talks on Site Reliability Engineering (SRE) by Jennifer Petoff on SRE training at Google, and by Danyel Fisher & Liz Fong-Jones, on implementing SRE at honeycomb.io. SRE is an interesting practice; it’s essentially “how Google implemented operations at scale,” making the conference an interesting blend of theory (Resilience Engineering) and practice (SRE).
The downside of the conference was the unusually high number of marketing emails participants received; I mean, I know it’s a free conference, but even Gremlin admitted there were too many. Thankfully, you can watch all the talks without registration here.
The conference also had a dedicated Slack for discussion during and after the talks, which was for me at least as interesting as the talks themselves. From the Slack discussion, I got recommendations on some additional academic reading on Resilience Engineering from J Paul Reed, which I am sharing here:
- Approaching Overload: Diagnosis and Response to Anomalies in Complex and Automated Production Software Systems (masters thesis)
- Managing the Hidden Costs of Coordination
- ACM Queue Vol. 17 No. 6, Human Factors - includes the article above
- Maps, Context, and Tribal Knowledge: On the Structure and Use of Post-Incident Analysis Artifacts in Software Development and Operations (I think this was actually recommended by John Allspaw, also available at Lund)