Information Safety

Improving technology through lessons from safety.

What is Resilience Engineering?

Last August, I took on a new role at my company, and changed my title to Resilience Engineer. Which leads to an obvious question, what is Resilience Engineering?

Resilience Engineering (RE) as a concept emerged from safety science in the early 2000s. While the oldest reference to “Resilience Engineering” appears to be a paper written by David Woods in 20031, the most-cited work is the book, Resilience Engineering: Concepts and Precepts, a collection of chapters from the first Resilience Engineering symposium in 2004.2 In that book and in subsequent publications, there have been many definitions of RE. This post is my attempt to succinctly define Resilience Engineering as I practice it, which is:

Resilience Engineering is the practice of working with people and technology to build software systems that fail less often and recover faster by improving system performance.

Let’s break that definition down further:

Resilience and Resilience Engineering

Resilience is a concept from ecology that describes a system’s ability to dynamically withstand and recover from unexpected disruptions, rather than maintain a predictable, static state.3 Whereas resilience in ecological systems is the result of the interplay between variability and natural selection, Resilience Engineering seeks to achieve the same results through deliberate management of the variability of performance:

“Since both failures and successes are the outcome of normal performance variability, safety cannot be achieved by constraining or eliminating that. Instead, it is necessary to study both successes and failures, and to find ways to reinforce the variability that leads to successes as well as dampen the variability that leads to adverse outcomes.” 4

As both definitions make clear, resilience isn’t achieved through stability, rather, it is achieved through variability.

Working with people and technology

Systems safety recognizes that people are an integral part of the system; one can’t talk about aviation safety without talking about the technology of the plane and air traffic control, the people - the pilots and controllers, and the interplay of the people and the technology. Similarly, the software systems I work with consist of the code, the machines running the code, and the people that write and maintain the code. The software engineers and the systems they build comprise a sociotechnical system, with both technological/process and social/psychological components.

Further, while technology can’t be ignored, beyond a baseline level of technology, people are the main contributor to resilience or lack thereof; most advances in aviation safety over the past 50+ years have come from human factors research, and it is not by accident that safety science is usually part of the psychology department. For this reason, I focus my efforts on people, and the relationship between people and technology.

Systems that fail less often and recover faster

‘Systems that fail less often and recover faster’ is an over-simplification of resilience, but that statement accurately describes the value proposition of Resilience Engineering in technology; organizations are increasingly reliant on software systems, to the point where software has become safety-critical. We have come to expect that our software systems just work, so that failures are infrequent and systems (the software and the people together) are able to recover from unexpected disruptions quickly.

This is a distinctly different goal than ecological resilience: it isn’t enough to build systems that simply survive, they also need be productive. This is a challenge unique to Resilience Engineering, as it requires both limiting and encouraging variability.

Improving system performance

For me, the key to understanding Resilience Engineering is HOW to achieve resilience. Historically within technology, security and operations have sought to prevent failures (outages, breaches) through centralized control, which does work, but suffers from limitations that RE seeks to overcome.5 The shift in approach starts with the premise that we can’t have a science of non-events, a science of accidents that don’t happen.6 Safety-II (an alternative to traditional ‘Safety-I’) proposes that resilience is the result of factors that make things go right more often - working safely, something that can be studied. Under this model, there is no safety-productivity tradeoff, since improving outcomes leads to improvements in both productivity and resilience.

The work of the DevOps Research and Assessment group at Google demonstrates this concept within software: as organizations improve performance (deployment frequency and lead time for changes) they also improve resilience (time to restore service, change failure rate).7 I’ve found that this approach works more generally, and through RE, seek to help teams improve their performance and help leaders to improve the performance between teams by managing organizational factors.

Other Perspectives

Resilience Engineering is a diverse space and there is a small but growing group of practitioners and researchers that are applying it to software systems. Two notable groups are the Resilience Engineering Association and the Learning From Incidents community. I’ve also recently discovered the work of Dr Drew Rae and Dr David Provan through their Safety of Work podcast. Their paper on Resilience Engineering in practice is aimed at traditional safety professionals but I’ve found its ideas easily adapted to software systems.

As a practitioner-researcher myself, I’m hoping to adapt and apply the science to software systems, to improve the profession, as well as contribute to the collective knowledge - of Resilience Engineering.

Future Articles

Update: I’ve been asked to elaborate on the ideas behind Resilience Engineering, so I’ve added this section to cover a plan for future articles on the topic:

  • The origins and history of Resilience Engineering
  • Parallels between Cybersecurity, Operations, and Safety
  • Is DevOps culture High Reliability culture?
  • My research in software systems resilience

Updates and links will be posted here.

  1. Woods, D., & Wreathall, J. (2003). Managing Risk Proactively: The Emergence of Resilience Engineering. https://www.researchgate.net/publication/228711828_Managing_Risk_Proactively_The_Emergence_of_Resilience_Engineering 

  2. Hollnagel, E., Woods, D. D., & Leveson, N. (2006). Resilience engineering : concepts and precepts. Ashgate. 

  3. Holling, C. S. (1973). RESILIENCE AND STABILITY OF ECOLOGICAL SYSTEMS [Article]. Annual Review of Ecology & Systematics, 4, 1-23. https://doi.org/10.1146/annurev.es.04.110173.000245 

  4. Hollnagel, E. (2008). Preface : Resilience Engineering in a Nutshell. In E. Hollnagel, C. P. Nemeth, & S. Dekker (Eds.), Resilience Engineering Perspectives, Volume 1: Remaining Sensitive to the Possibility of Failure (pp. ix-xii). Ashgate. 

  5. Provan, D. J., Woods, D. D., Dekker, S. W. A., & Rae, A. J. (2020). Safety II professionals: How resilience engineering can transform safety practice. Reliability Engineering & System Safety, 195, 106740. https://doi.org/10.1016/j.ress.2019.106740 

  6. Hollnagel, E. (2014). Is safety a subject for science? Safety Science, 67, 21-24. https://doi.org/10.1016/j.ssci.2013.07.025 

  7. Forsgren, N., Smith, D., Humble, J., & Frazelle, J. (2019). 2019 Accelerate State of DevOps Report. DORA & Google Cloud. https://research.google/pubs/pub48455/ 

comment

Working with R

Around the time of SIRAcon 2020, I decided to start using R. I needed a data analysis tool that would allow me to conduct traditional statistical analysis, and I wanted a tool that would be valuable to learn and one that would allow me to do exploratory analysis as well. Originally I considered SPSS (free to students) and RStudio. The tradeoffs between the two were pretty clear: SPSS is very easy to use, but expensive, proprietary, and old. RStudio and R have a tougher learning curve, but are free and open source, under active development, and have a large online community. After reading a thread on the SIRA mailing list, I was leaning towards R, and re-watched Elliot Murphy’s 2019 SIRAcon presentation on using notebooks, which led me to consider both R Markdown and Python Jupyter Notebooks. I did more searching and reading, and finally settled on R Notebooks for a few reasons: R Notebooks are more disciplined (no strange side effects from running code out of order), fewer environment problems, the support of the RStudio company, better visualizations, and just because R is the more data-sciency language.

The SIRA community was quite supportive of this idea when I asked for suggestions on getting started in the BOF session, and recommended Teacup Giraffes and Tidy Tuesday for learning R, and on my own I found RStudio recommendations. Of course, being a sysadmin at heart, I set out to figure out how exactly to best install R and RStudio, and manage the notebooks in git.

Installation on macOS was easy enough, just brew install r and brew cask install rstudio. GitHub published a tutorial in 2018 on getting RStudio integrated with GitHub, and I started working on that. Quickly I discovered that while the tutorial was helpful, it wasn’t quite the setup I wanted; it published R Markdown through GitHub pages, but wouldn’t directly support the automatically generated html of R Notebooks. Side note: the consensus was to use html_notebook as a working document, and html_document to publish. After more searching, I was able to get Notebooks working on GitHub, but I used the method described in rstudio/rmarkdown #1020 - checking in the .nb.html into git, and using GitHub Pages so that you can view the rendered HTML instead of just the HTML code.

Working through this, I noted that RStudio is quite good at automatically downloading and installing packages as needed; it triggered installation of rmarkdown and supporting packages when creating a new R Notebook, and also readr when importing data from csv. Which got me thinking, what about package management? While it seems that R doesn’t have the level of challenge posed by Python or Ruby, managing packages on a per-project basis is a best practice I learned from using Bundler to manage the code of this site. (the only gem I install outside a project is bundler) So I went looking for the R equivalent…

I first found Packrat and then its replacement, renv (Packrat is maintained, but all new development has shifted to renv). Setting it up is as simple as install.packages("renv") and renv::init(), and RStudio has published:

This left one final question: how exactly to install r? Homebrew itself offers 2 methods: install the official binaries using brew cask install r, and just brew install r. Poking around further, I found that the cask method was sub-optimal as it installs in /usr/local which causes issues with brew doctor. Interestingly, I also found that Homebrew’s R doesn’t include all R features, but the same author, Luis Puerto offered a solution to install all the things. I haven’t tried it yet, but I may go with homebrew-r-srf as suggested by Luis (or a fork of it).

What’s next? At some point I plan to try to integrate GitHub actions for testing, and create a CI/CD pipeline of sorts for Pages, using GitHub actions. And, of course, actually using R for data analysis…

Update: I tested homebrew-r-srf, and am going with homebrew r. There was some weirdness with the install/uninstall (/usr/local/lib/R left over), I don’t know if I’ll need the optional features, and homebrew r now uses openblas. If I find I actually need any of the missing capabilities, I’ll likely write my own formula.

comment

CONOPS (Concept of Operations)

I recently came across a posting on Design Docs at Google. I was struck by the similarities between the design document, as described in the article, and a Concept of Operations (CONOPS). Traditionally, CONOPS are primarily used by the military, for very large and costly projects, such as the design of a new Coast Guard Cutter, created prior to official design documentation, and mainly serve to satisfy project requirements, and is not something you’d expect a modern software organization like Google to use.

In my own work, I’ve come to believe that a shared mental model of the application or service the team is building is essential for reliability and resilience, and there is research that suggests an agile CONOPS can help develop a shared mental model amongst stakeholders, by using visualization, models, and system thinking. My own brief experiment with CONOPS found that creating a visual diagram is most valuable, the formal CONOPS outline, defined in IEEE standard 1362, was less useful.

What’s interesting about the Google Design Doc is that it includes important elements of the CONOPS. The article identifies the following functions of the design document (emphasis mine):

  • Early identification of design issues when making changes is still cheap.
  • Achieving consensus around a design in the organization.
  • Ensuring consideration of cross-cutting concerns.
  • Scaling knowledge of senior engineers into the organization.
  • Form the basis of an organizational memory around design decisions.
  • Acts as a summary artifact in the technical portfolio of the software designer(s).

It’s notable that four of the six functions all relate to development of a shared mental model of the system being built - across the engineering organization, with security & privacy, senior engineers, and for posterity. Additionally, I argue that many of the features described would also be found in a good CONOPS: Goals and Non-Goals, visual diagrams, and existing constraints. Unsurprisingly, the post also recommends making the document only as long as needed, avoid creating an ‘implementation manual’, and iterate.

I’d agree with all of that, and would also suggest one additional lesson from well written CONOPS: adding operational scenarios, as included in the CPC CONOPS mentioned earlier, can be an effective tool for helping people understand what’s being proposed, and how the designers envision it being used. Having specific narratives helps ‘make it real’, and makes implicit assumptions more explicit.

Bottom Line: whether you call it a CONOPS or a design document, creating a high-level description of what you’re planning to build, without getting into the weeds, is an underutilized but effective way to build better software systems. Focus on visualization and creating a common mental model for the organization (including our future selves), iterate, and consider using scenarios to help build understanding.

comment