Information Safety

Improving technology through lessons from safety.

Failover Conf

Back in April, I attended Failover Conf, a virtual conference hosted by Gremlin. Overall I thought the conference was pretty good, but as with all conferences, the usefulness of the talks varied. The influence of safety thinking was clear, especially Resilience Engineering, which was explicitly covered in two talks (Amy Tobey and J Paul Reed).

The highlights for me were two talks on Site Reliability Engineering (SRE) by Jennifer Petoff on SRE training at Google, and by Danyel Fisher & Liz Fong-Jones, on implementing SRE at SRE is an interesting practice; it’s essentially “how Google implemented operations at scale,” making the conference an interesting blend of theory (Resilience Engineering) and practice (SRE).

The downside of the conference was the unusually high number of marketing emails participants received; I mean, I know it’s a free conference, but even Gremlin admitted there were too many. Thankfully, you can watch all the talks without registration here.

The conference also had a dedicated Slack for discussion during and after the talks, which was for me at least as interesting as the talks themselves. From the Slack discussion, I got recommendations on some additional academic reading on Resilience Engineering from J Paul Reed, which I am sharing here:


Secure360 Handouts

Secure360 Update: I’ve been asked by a couple of people to share a version of my slides that better shows how my talk presented the ideas in my references post.

To answer that request, I’ve posted a low-res version of the slides with some of my talk notes here.

These notes will probably make more sense if you’ve seen the talk, which was recorded for conference attendees (but not currently publicly available).


Chaos & Resilience Engineering @ Secure360

I’m speaking at Secure360 on May 5, 2020, presenting an updated version of Chaos & Resilience Engineering. As I’ve done before, I won’t be posting copies of the slides. Instead, I’m posting an updated list of references from the talk here.

Note: this post includes some additional references that are not in the final version of the talk (italicized)

My story is told in three acts: My journey to find chaos engineering (ACT I), Chaos engineering and how resilience engineering complements it (ACT II), What I’ve learned so far (ACT III), and How to get started with chaos & resilience engineering (END).

ACT I: My Journey to Chaos Engineering

ACT II: Chaos & Resilience Engineering

ACT III: What I’ve learned so far

END: How to get started with chaos & resilience engineering