Information Safety

Improving technology through lessons from safety.

Security Differently

For a while, I’ve been seeing evidence that cybersecurity, especially traditional security, has been stagnant; adding security controls hasn’t appreciably improved outcomes and we continue to struggle with basic problems like vulnerabilities. (as Cyentia discovered and reported in its Prioritization to Prediction series, organizations of all sizes only fix about 10% of their vulnerabilities in any given month)

Many organizations have accumulated 20+ years of security policies, standards, and controls, without significantly removing rules that may no longer be needed, and organizations of all sizes continue to experience security breaches.

Safety faced a similar problem 10-15 years ago. Safety scientists and practitioners saw that security outcomes were stagnant and looked for new approaches. One of these, Safety Differently, was created by notable safety science academic Sidney Dekker in 2012. It was part of the emerging acknowledgement that the traditional method of avoiding accidents through policies, procedures, and controls was no longer driving improvements in safety.

Safety Differently argues that three main principles drive traditional thinking:

  • Workers are considered the cause of poor safety performance. Workers make mistakes, they violate rules, and they ultimately make safety numbers look bad. That is, workers represent a problem that an organization needs to solve.
  • Because of this, organizations intervene to try and influence workers’ behavior. Managers develop strict guidelines and tell workers what to do, because they cannot be trusted to operate safely alone.
  • Organizations measure their safety success through the absence of negative events.

Safety Differently advocates a switch from a top-down to a bottom-up approach, adopting new principles:

  • People are not the problem to control, they are the solution. Learn how your workers create success on a daily basis and harness their skills and competencies to build a safer workplace.
  • Rather than intervening in worker behavior, intervene in the conditions of their work. This involves collaborating with front-line staff and providing them with the right tools and environment to get the job done safely. The key here is intervening in workplace conditions rather than worker behavior.
  • Measure safety as the presence of positive capacities. If you want to stop things from going wrong, enhance the capacities that make things go right.

What does this have to do with cybersecurity? I believe that we’re seeing the same thing in security: historically, we’ve focused on constraining worker behavior to prevent cybersecurity breaches, and the limits of that approach are becoming increasingly clear. Adapting concepts from Safety Differently offers a solution, by supporting success and focusing on positive capacities: Security Differently.

Adopting Security Differently

In practical terms, what would adopting Security Differently look like? The Safety Differently Movie provides good insights into how this would apply to security and evidence of its effectiveness:

  1. Most importantly, the organization’s top leadership must take responsibility for security. Since security performance can’t be separated from organizational performance, security can’t be “the CISO’s problem” or even “the CIO’s problem.” A key part of this shift is acknowledging that it is our workers - not our security team - that create security.

  2. A clear shift in ownership of security performance to the Operations and Engineering teams. As I argued in a 2021 talk, many positive security outcomes are well within the capabilities of Technology organizations. One example is vulnerability management, which is solved through proactively updating software and refreshing technology - something all technology teams can do.

  3. Likely, a much smaller security team. The head of health and safety for Origin noted that he cut the size of his team from 20-30 people to 5, as his team had to give up safety performance management. This doesn’t necessarily mean that security spending is significantly reduced, because of the shift in ownership of security performance.

  4. A focus on positive measures of security performance (Security Metrics). A security team is still needed to measure security outcomes, like successfully defending against an attack, as well as measures that have been shown to contribute to success, like security updates and secure configuration.

  5. A significant reduction of security policies and procedures, along with training on Security Differently concepts. The Australian grocery Woolworth’s found that the combination of eliminating national safety procedures and training in Safety Differently led to the best outcomes: fewer accidents in the store, along with the highest levels of safety ownership and engagement.

  6. Asking, not telling. While ownership shifts to the Technology team, security expertise is still needed, to coach and support security performance - advising developers on how to fix security bugs - and develop new capacities to address novel threats (the Solarwinds Attack is an example). Simply asking teams, “what do you need to be secure?” is a key part of improving their performance.

In an organization that has fully adopted Security Differently, top leadership (CEO/CIO) sets security goals, the Security team keeps score through evidence-based metrics aligned to those goals, provides expertise and support to achieve the goals, and develops or acquires new defenses when new threats emerge (in practice, this happens infrequently).

Importantly, investment in Security Differently is not a cost, rather, it is an investment in improved organizational performance. By changing the focus from preventing bad outcomes to creating positive outcomes and developing organizational capacities not only improves security, but also improves quality, engagement, and overall organizational performance. (And also reduced incident response costs!) Evidence of this affect can be found in the DORA Research, which can be summarized as “performance begets performance”: the technical capabilities of DevOps, including shifting left on security, improve software delivery performance and ultimately organizational performance.

Adopting Security Differently can both improve both efficiency and outcomes, much like the traffic experiment from Safety Differently: when traffic engineers removed traffic controls from a key mixed-use intersection in Drachten, they forced people to take greater responsibility for safety, and what looked riskier on the surface was much safer, reducing annual accidents from 10 to 1, and also eliminated gridlock.

In a future article I will continue to explore the idea of reimagining the role of security through related work in safety.

comment

The Definitive Introduction to the DORA Research

I’ve spent a good deal of time over the last three years studying software delivery performance, both learning from the work of Nicole Forsgren and the DevOps Research and Assessment (DORA) team at Google, as well as conducting my own research. I’ve often needed to explain the research to others, especially in the context of the “four metrics”, and set out to write this, the definitive introduction to the research (well, at least my definitive version).

DevOps Metrics

The research program that is now run by the DORA team originated in 2013, when Nicole Forsgren, a PhD researcher, joined two early DevOps champions, Jez Humble and Gene Kim, to work on the Puppet 2014 State of DevOps Report. The team combined practical experience with the rigor academic research to create a report that established a causal relationship between specific DevOps practices and organizational performance, as measured by three key metrics, which were later expanded to four. The key metrics, along with their definitions taken from the DORA Quick Check are listed below:

  • Deployment frequency: For the primary application or service you work on, how often does your organization deploy code to production or release it to end users?
  • Lead time for changes: For the primary application or service you work on, what is your lead time for changes (that is, how long does it take to go from code committed to code successfully running in production)?
  • Time to restore service: For the primary application or service you work on, how long does it generally take to restore service when a service incident or a defect that impacts users occurs (for example, unplanned outage, service impairment)?
  • Change failure rate: For the primary application or service you work on, what percentage of changes to production or releases to users result in degraded service (for example, lead to service impairment or service outage) and subsequently require remediation (for example, require a hotfix, rollback, fix forward, patch)?

DevOps and Performance

The findings of the DORA research can be summarized succinctly as: performance begets performance. A visual map of the program shows all of the predictive relationships the team has discovered: many technical, cultural, management, and leadership practices associated with the DevOps movement have been shown to improve Software Delivery Performance (as measured by the four metrics), and ultimately to improve organizational performance. This is the key finding of the body of work: organizations that improve their software delivery performance improve both their commercial performance (profitability, market share, and productivity) and non-commercial performance (quantity of goods and services, operating efficiency, customer satisfaction, quality of products or services, and achieving organization or mission goals). Over time, the program has investigated and identified additional practices that predict improved performance, and added a “fifth metric”, Reliability, the degree to which a team can keep promises and assertions about the software they operate, which includes availability, latency, performance, and scalability. The 2021 Accelerate State of DevOps Report calls this metric “[the] ability to meet or exceed their reliability targets;” expressed another way, this could be measured as how well the organization meets their Service Level Objectives.

It is important to stress that the factors that improve performance extend beyond the technical practices typically thought of as “DevOps”, including CI/CD (Continuous Integration/Continuous Delivery). Many of the factors are cultural, including softer concepts like Trust, Voice, and Autonomy, and some factors are self-reinforcing: for example, Software Delivery Performance predicts improved Lean Product Management, and improved Lean Product Management predicts improved Software Delivery Performance. A central theme is a leadership focus on creating a supportive culture and environment, while allowing teams significant delegated authority in making decisions about the software they build and support. In my own research studying DevOps adoption and performance, I identified that the organizational system can have a significant impact on team performance: teams can be constrained by mandatory enterprise practices, such as change management.

Measuring Performance

Starting in 2015, the DORA researchers have reported on the profiles of “Low”, “Medium”, “High”, and sometimes “Elite” organizations. Using cluster analysis, the team identified data-driven categories of performance. These categories serve as useful benchmarks, and show how the metrics relate to each other:

Metric Elite High Medium Low
Deployment frequency On-demand (multiple deploys per day) Between once per week and once per month Between once per month and once every 6 months Fewer than once per six months
Lead time for changes Less than one hour Between one day and one week Between one month and six months More than six months
Time to restore service Less than one hour Less than one day Between one day and one week More than six months
Change failure rate 0%-15% 16%-30% 16%-30% 16%-30%

Categories of performance from 2021 Accelerate State of DevOps Report

The clusters highlight a larger theme: higher-performing organizations perform better across all measures of performance, which extends beyond the four metrics; for example, organizations that do better at meeting reliability targets and shifting left on security have higher software delivery performance as measured by the four metrics.

It is also notable that there has been variability in these categories over time. In the prior 2019 Accelerate State of DevOps Report (no report was produced in 2020), the profiles were:

Metric Elite High Medium Low
Deployment frequency On-demand (multiple deploys per day) Between once per day and once per week Between once per week and once per month Between once per month and once every six months
Lead time for changes Less than one day Between one day and one week Between one week and one month Between one month and six months
Time to restore service Less than one hour Less than one day Less than one day Between one week and one month
Change failure rate 0-15% 0-15% 0-15% 46-60%

Categories of performance from 2019 Accelerate State of DevOps Report

I find the 2019 profiles to be more useful benchmarks, at least for my work comparing team performance within a larger organization, as the relationship between the metrics is clearer, and fits better with my own experience of team performance.

The DORA group uses survey research to measure software delivery performance, out of necessity: obtaining and comparing direct data across organizations is impractical. However, it is feasible to implement partially or fully automated collection of these metrics within an organization (as I have done). One way of doing so is by collecting data each time code is deployed, using the code deployment automation itself. Writing a log of when each deployment occurred along with the application or service, a calculation of lead time measuring the difference between deployment time and when code was committed, and the type of deployment (normal or hotfix) allows calculation of three of the four metrics over time: Deployment frequency, Lead time for change, and Change failure rate. Time to restore service can be measured as part of the incident (outage) response process, ideally in an automated way, such as pulling data from the trouble ticket system.

How to Improve

So, how do you improve software delivery performance? The simple answer is “adopt all the practices that the research shows improves performance”, but how do you get started? In her 2017 talk The Key to High Performance: What the Data Says, Forsgren cites a specific example: “By focusing on trunk-based development and streamlining their change approval processes, Capital One saw stunning improvements in just two months,” with a 20x increase in releases, some applications deploying to production in a day 30+ times] and no increase in incidents. In the same talk, she offers some general device: “It depends,” and suggests looking at decoupling architecture, adopting a lightweight change approval process, and full continuous integration. My take is that organizations should adopt the DevOps ways of working first, to support cultural change.

Regardless of how, make sure to measure: measuring outcomes using the four metrics will help identify opportunities to improve and measure improvement over time.

References

DevOps Research & Assessment. Explore DORA’s research program. https://www.devops-research.com/research.html

Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate : the science behind DevOps : building and scaling high performing technology organizations (First edition. ed.). IT Revolution.

Forsgren, N., Kim, G., Kersten, N., & Humble, J. (2014). 2014 State of DevOps Report. Puppet Labs, IT Revolution Press, ThoughtWorks. https://nicolefv.com/resources

Google. (2020). DORA DevOps Quick Check. https://www.devops-research.com/quickcheck.html

Smith, D., Villalba, D., Irvine, M., Stanke, D., & Harvey, N. (2021). 2021 Accelerate State of DevOps Report. Google Cloud. https://cloud.google.com/devops/state-of-devops/

Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site reliability engineering : how Google runs production systems (First edition. ed.). O’Reilly. https://landing.google.com/sre/sre-book/toc/index.html

Benninghoff, J. (2021). A cross-team study of factors contributing to software systems resilience at a large health care company [Master’s thesis, Trinity College Dublin]. Ireland.

2015 State of DevOps Report. (2015). Puppet Labs, IT Revolution. https://nicolefv.com/resources

Forsgren, N., Smith, D., Humble, J., & Frazelle, J. (2019). 2019 Accelerate State of DevOps Report. DORA & Google Cloud. https://research.google/pubs/pub48455/

Forsgren, N. (2017). The Key to High Performance: What the Data Says. https://www.youtube.com/watch?v=RBuPlMTXuFc&t=25s

comment

Secure360 2022

A couple of weeks ago, I spoke at Secure360 2022! My talk, “What Safety Science taught me about Information Risk” was an updated version of my SIRAcon 2021 talk (available in the members area at https://www.societyinforisk.org).

Session Description

Two years of study and research has changed how I see risk. Safety science taught me that improving performance is the key to managing risk, and studying successes is the key to risk analysis. The ‘New School’ of safety argues that you can’t have a science of non-events; safety comes through being successful more often, not failing less. Research in DevOps, Software Security, and Security Programs show a strong link between general and security performance. In many (but not all) cases, organizations most effectively reduce cybersecurity risk by improving general performance, not by improving one-dimensional security or reliability performance.

This talk presents a new model for security performance that informs how we can maximize the value of our security investments, by focusing on improving existing or creating new organizational capabilities in response to new and emerging threats, where general performance falls short. It will review both the theory that improving performance improves safety, how that relates to cybersecurity risk, evidence from my own and others’ research that supports this theory, and how it can be used to analyze and manage risk more effectively.

Talk

The talk is broken down into three sections, and covers both the theory as well as how to apply the theory to best improve security performance.

  • Assumptions backed by accepted theory
    • Assumption 1: organizations are sociotechnical systems
    • Assumption 2: all failures are systems failures
  • Arguments for a new theoretical model backed by evidence
    • Argument 1: resilience improves through performance
    • Argument 2: security performance is correlated with general performance
  • Implications of the model for information risk management: optimize risk management based on your performance mode
    • Mode 1: improve general performance
    • Mode 2: add security enhancements to general performance
    • Mode 3: create security-specific systems
    • Guided Adaptability
    • Work against the adversary

Overall, I think the talk went better than I expected. While the theory supports some potentially controversial conclusions, like “retire your vulnerability management program”, I had good engagement from the audience, ran out of time for questions and spent some time afterwards talking with a few attendees in the hall.

I got the survey results back pretty quickly. Only 9 people responded, which was maybe 10-20% (I’m not a good judge of crowd size), but those responses were very positive, with ~90% of attendees saying they would attend my future talks. My weakest score was, “I am Interested in hearing more of this topic” which scored just below “agree”.

Slides

My slides with notes, including references, are here.

Slides from all presenters at Secure360 (who provided them) are available here, and most of my past talks and security blog posts are available at https://transvasive.com.

comment