Information Safety

Improving technology through lessons from safety.

Failover Conf

Back in April, I attended Failover Conf, a virtual conference hosted by Gremlin. Overall I thought the conference was pretty good, but as with all conferences, the usefulness of the talks varied. The influence of safety thinking was clear, especially Resilience Engineering, which was explicitly covered in two talks (Amy Tobey and J Paul Reed).

The highlights for me were two talks on Site Reliability Engineering (SRE) by Jennifer Petoff on SRE training at Google, and by Danyel Fisher & Liz Fong-Jones, on implementing SRE at honeycomb.io. SRE is an interesting practice; it’s essentially “how Google implemented operations at scale,” making the conference an interesting blend of theory (Resilience Engineering) and practice (SRE).

The downside of the conference was the unusually high number of marketing emails participants received; I mean, I know it’s a free conference, but even Gremlin admitted there were too many. Thankfully, you can watch all the talks without registration here.

The conference also had a dedicated Slack for discussion during and after the talks, which was for me at least as interesting as the talks themselves. From the Slack discussion, I got recommendations on some additional academic reading on Resilience Engineering from J Paul Reed, which I am sharing here: