Information Safety

Improving technology through lessons from safety.

SRE and Security Aren't Safety... Yet

I’m a long-time listener of the Safety of Work Podcast, hosted by David Provan and Drew Rae, both safety scientists who have worked in industry. Very early on, when listening to the first episode, it struck me how much the podcast could be applied to cybersecurity and reliability - taking the first part of the transcript from Episode 0 and replacing “safety” words with “security”, we get:

“There’s a lot of philosophical arguments about what [security] is and how we achieve [security]. Ultimately, no matter how we define it, [security] is something that comes from operational work. People are kept [secure] or get [breached] because of how work is done where it’s done, who it’s done by, what’s it done with, and what’s it done to.

That’s something that is easy to lose sight of when we are doing [security] work. Most [security] practice the stuff that [security] people do is at least one step removed from the operational work itself. Managers and [security] practitioners don’t do the operational work. They try to influence it using a wide variety of [security] tools and practices.

That’s really what we are here to talk about, is talk about the tools, talk about different practices, and talk about the evidence of what works and what doesn’t work. Where things sometimes get mixed up is people start thinking of the tools and practices themselves and [security]. They get confused between the goal-keeping people [secure]-and the means to that end, which is also called [security].”

Which still works for both security and Site Reliability Engineering (SRE)! I’ve found that most episodes have lessons that directly translate to information risk practices (confidentiality, integrity, and availability). However, every so often there is a podcast that doesn’t fit. Episode 111 is an example that shows how SRE and Security aren’t Safety, at least not yet.

Episode 111, “Are management walkarounds effective?” examines a common safety management practice that has no analog in either security or SRE, the leadership safety visit. For those not familiar, the practice is fairly self-explanatory: an organizational executive (think CEO, COO, or other senior leader not in safety) visits a site with a focus on safety, typically including a site inspection and safety conversations with workers. As with most episodes, Drew and David review a paper, this time one titled “The Effectiveness of Management‐By‐Walking‐Around: A Randomized Field Study.” (Open Access PDF) As far as I know, there is no comparable practice in either security or reliability - I’ve never experienced or heard of a senior technology leader (outside security) taking time to review and discuss security with front-line staff (well, maybe, more on that later).

At the end of the episode, Drew summarizes the answer to the question, “Are management walkarounds effective?” as “sometimes yes, sometimes no.” While the study didn’t show that implementing a specific safety walkaround program had a significant impact on staff perception of safety performance, there were still some interesting findings from the paper, which I would summarize as:

  • Leaders that took action on a safety problem raised during the walkarounds had a positive effect where leaders that spent time prioritizing issues, or those who later decided they weren’t able to take action had a negative effect
  • Other studies showed a strong positive effect on the staff directly involved in the walkaround (those who spent more time with the leader)

So, how could we adapt this to security or SRE? Building on the key takeaways from the podcast, I would suggest:

  • Having senior technology leaders perform a walkaround in support of security or SRE can have a positive effect, especially to those directly engaged
  • When creating a leadership visit program, be deliberate about what you’re trying to influence - is it for the leaders to understand the work better, supporting continuous improvement, something else or some combination of goals?
  • It is important for leaders to listen, own, and take action in response to challenges and solutions raised by staff, instead of delegating responsibility, which shows leadership commitment
  • Prioritization is less important than action - picking an issue and fixing it is more helpful than spending time deciding what to work on

As I mentioned, while I haven’t seen a walkaround in technology, I have seen the impact of a senior leader taking an active role in security and availability. In my case, a software development executive decided to make both security and availability a priority, and actively supported both by listening, supporting, and most importantly, taking action to improve organizational effectiveness. Hearing from him was more impactful than hearing from the CISO or head of infrastructure, especially for the developers on his team.

While SRE and security aren’t safety, we should strive to close the gap by adapting lessons from safety to technology.

comment

Running Security like Finance

Lately I’ve been thinking about the role of the CISO and Security and how it compares to the CFO and Finance. It started with two simple questions: “Who is responsible for security?” and “Who is responsible for meeting your budget?”

I suspect that many people would answer the first question with “Security” or “the CISO” while few would say that Finance or the CFO are responsible for meeting the budget. Put more eloquently by my colleague Chris Brown,

“We don’t ask the CFO to make the company profitable, but we do ask the CISO to make the company secure.”

Why the difference? I believe that organizations understand that while Finance can set strategy and keep track of income and expenses, financial success is driven by everyone in the organization. We too need to recognize that it is the organization, not the security team, that creates security, or more directly: the CISO can’t make the company secure.

The need for change

While there is evidence that attitudes towards the cybersecurity team are changing, I believe recent regulatory actions will accelerate this change.

With the new SEC cybersecurity incident disclosure rule and action against the CISO of SolarWinds, I believe cybersecurity is having its Enron moment. Especially when the SolarWinds action was announced, I saw comments along the lines of “if the SEC can take action against a CISO for a breach, no one is safe,” which I think misses the point. Both the disclosure rule and the action against the SolarWinds CISO are measures to ensure proper reporting of security posture, much the same as how the Sarbanes-Oxley Act established rules for financial reporting with clear legal responsibilities for the CFO.

From the SEC press release: “In its filings with the SEC during this period, SolarWinds allegedly misled investors by disclosing only generic and hypothetical risks at a time when the company and Brown knew of specific deficiencies in SolarWinds’ cybersecurity practices as well as the increasingly elevated risks the company faced at the same time.”

These actions send a clear message to publicly traded companies and their CISOs: you must accurately report on cyber risks and controls. I sincerely hope and believe that this will help transform cybersecurity to a team sport at the CEO level. As the SEC complaint reveals, the company and CISO were incentivized to provide public assurances that security controls were effective despite internal discussions to the contrary, to avoid bad press and maintain investor confidence.

By taking this action, the SEC is creating a new incentive, and providing cover to CISOs to accurately present the security posture of publicly traded companies, both to the executive leadership team and the public. It is notable that the complaint names the “SolarWinds Chief Executive Officer, Chief Financial Officer, Chief Technology Officer, and Chief Information Officer at the relevant times are referred to as the “CEO,” “CFO,” “CTO,” and “CIO,” respectively” as “other relevant persons and entities”, and that “Brown failed to ensure that other senior executives were sufficiently aware of, or understood, the severity of cybersecurity risks, failings, and issues that he and others knew about.”

How to change (with metrics)

As I argued in Security Differently, part of the change is to shift from a top-down to a bottom-up approach, and acknowledge that everyone in the organization has a role to play in creating security. Like the CFO and Finance, the CISO and the Security organization can set strategy, but must also provide the organization and its departments with the equivalent of a budget - key metrics that provide useful and timely feedback on how security performance at all levels aligns to company goals.

Security metrics are hard. Entire books have been written about them. There was an active community dedicated to just metrics. And, as we know from safety, lagging indicators like security incidents aren’t a good metric, and many leading indicators are just measuring the work of security - audits and such. Thankfully, research over the past 10 years gives us some useful candidates.

Many leading indicators can be found outside of security; the DORA research program originally started by Nicole Forsgren has found that security both influences and is influenced by the four measures of DevOps Performance: deployment frequency, lead time for changes, time to restore service, and change failure rate. Work by Stephen Magill and Gene Kim found that “most [open source] projects stay secure by staying up to date [on dependencies].” And multiple reports published by Cyentia support the notion that measures of DevOps performance and Proactive Refreshes of technology improves cybersecurity. Traditional quality metrics, like version currency (N, N-1) and how quickly bugs are resolved are good leading indicators of success, and easily understood by technology teams.

Lagging indicators - like security incidents - can be reframed in terms of security performance. We have limited control how often we’re exposed to security threats, but we have much more control over how we detect, respond, and recover. A measure of the percentage of security incidents that were detected and contained before major impact is a potentially useful metric.

Finally, Cyber Risk Quantification (CRQ), itself over 10 years old, can help organizations make better decisions about security investments. While it can be labor intensive, estimating the risk reduction of a particular security investment in monetary terms is really the only way to fairly compare security spending against other possible projects.

Change is hard

I firmly believe that security should be run more like finance. I must also acknowledge that this change is hard. There are more unsolved than solved problems, and security as a professional discipline is much younger than finance; after all, double-entry bookkeeping has been in use for over 500 years. We have much work to do, but would be well served by following the principles of Finance in Security.

comment

On Deadlines

Getting rid of deadlines improves safety and security.

I was recently discussing the safety of Artificial Intelligence (AI) with a colleague. He sent me a link to an episode of the Medicine & Machine Learning podcast featuring Munjal Shah, the co-founder and CEO of Hippocratic AI. In our conversation, my colleague mentioned a quote from the podcast on deadlines:

“You can’t say you have a deadline and say you’re safety-first. One of those things is not true.”

This resonated with me for a specific reason. One of the most effective software engineering teams I’ve worked with was Core Engineering. As the name suggests, Core Engineering created “core” libraries and tools for other software engineering teams to use, like a standardized logging facility and components for authentication and authorization. (Like many large companies, there was value in creating components tailored to our environment). This required writing code, and also high quality documentation and examples to make the components accessible to a broad audience of developers of different skill levels and experience at the company.

The Core Engineering team was notable for a couple of key reasons: first, they were high performing in nearly all aspects of software development: their software was high quality, and the few bugs (including security flaws) that were discovered were fixed quickly, they had high levels of automation including automated tests, wrote excellent documentation, and had effective leadership. Second, and more importantly, they typically did not have hard deadlines for the work they did. Because the software they built wasn’t directly used by clients or internal teams, they had a high level of control over the timing of the release of their software.

Hard deadlines

Why was the lack of hard deadlines so important to their performance? This changed the incentives for the team, giving them the time and space to focus on building it right - put another way, being safety-first.

Because the components the team built were used by many applications, a failure would have broad impact. Additionally, the team wasn’t dependent on client or internal stakeholder funds. Taken together, this meant that the cost and risk of missing a date was low, but the impact of an outage or security flaw was much higher. Almost by accident, the work environment favored prioritizing work to promote confidentiality, integrity, and availability, and so the team did.

Practical Advice

So, to create a safety-first environment, you can’t have deadlines. Just get rid of them and all will be good, right? If it were that easy…

One of the things I’ve learned from safety is that there are always conflicts and trade-offs to manage. Getting rid of deadlines is just not practical; there will always be situations where there is a very real cost of not completing work on time, which in some cases can be quite large (compliance to a new government regulation comes to mind).

So what can be done? This is where safety and security professionals can help, by understanding and reducing the goal conflicts and supporting decisions to prioritize safety or security ahead of deadlines when the risks are too great. Simply being mindful of this conflict can help manage it - leaders can create time and space for teams to prioritize safety and security as best they can, which may be less when the team is under a deadline, but can be more once the deadline has passed. Put another way, they can make the choice to take on security tech debt in the short term to meet an important client obligation and pay it back by raising the priority of security when the production pressures are reduced.

While this won’t always work, acknowledging that the conflict exists and that it will never be fully resolved is a good start.

comment