Monday, May 30, 2016

Understanding risk

Throughout my career, I have tried to take a risk-based approach to actions and decision-making. This has become more important to me as a CIO. Understanding risk is an important part of driving change. If you understand the risks, you can decide which risks you can accept and which you cannot. In doing so, you avoid "analysis paralysis" where you continually evaluate options without actually choosing a direction. By taking a risk-based approach, some options become obvious.

I find I turn to a few simple models to understand and evaluate risk. My standard model is based on Likelihood and Impact. I often go to this model when considering system risk. Which are your riskiest systems? Consider these dimensions:

How likely will this system experience a problem?

As an example, consider a computer system that is supported by only one server, under someone's desk, running on building power, and without a backup—that system has a High Likelihood for failure. If you find a server like this in your organization, you should be very nervous about when it will fall over, because its failure is a matter of when, not if.

In contrast, if you have a computer system that is supported by multiple servers in parallel, running in different data centers, using redundant power and cooling, with multiple levels of backups—that system has a Low Likelihood of failure. I usually don't worry about these systems.
If the system does fail, what's the impact to the organization? This might depend on what the system does, taking into account exposed private data or the reputation of the organization.

Let's say you had a database that tracked when public benches were last repainted. Maybe your facilities department uses this information to know when to schedule a touch-up job or a complete repaint of the bench. If you lost this data, the organization isn't impacted that much. The facilities folks would have to re-evaluate the benches. For many benches, this will take a while, but the organization will continue otherwise uninterrupted from this failure.

On the other hand, if your HR database of employees and salaries was irretrievably lost, your organization would be severely impacted. Someone in HR might be able to reconstitute the data from other sources, but it wouldn't be perfect. And depending on your industry, this is probably private data. Your organization could face damages and fines for the loss of salary information.

The Likelihood and Impact feed into a risk matrix. I prefer to color code the matrix with red as the highest risk, yellow as moderate risk, and green as the lowest risk:


For other risk scenarios, you might also include a "critical" risk, such as when loss of life is possible:

I use a similar risk matrix when I need to understand the risk of making a change. Not all changes carry the same risk. I know from my early career as a systems administrator that you can make certain changes to a running system without worrying too much. But other changes require more attention. Again, consider the Likelihood that a system change will result in a problem, and the Impact that problem would have to the business.

Have a backout plan
Consider timing
Remediation plan
Consider carefully
Have management support!
Probably a standard change
Just do it
Lowest risk
Talk to others
Evaluate the timing
Consider knockdown effects

But when considering the risk of business applications, you might need a different risk matrix. Should you continue to run that business application? Does it really provide value? Is it reliable? These questions feed into a different matrix that helps you decide what to do with business systems.

Business value

At every level in an organization, you need to understand risk. Without a method to evaluate risk and make risk-based decisions, you can quickly move to a decision. Some options are obvious, others will require careful consideration and coordination with others.

Consider leading an exercise to explore the risk in your organization. I find it is helpful to start with common definitions. Jot down some examples of what defines high/moderate/low Impact, and high/moderate/low Likelihood. Define what you mean by "high business value" or "low stability." Get buy-in on these definitions from the key stakeholders, then work with them to place applications or systems in each box. Don't worry about relative placement within the box. If you find yourself saying "This is a Moderate Likelihood, but it's on the high end of Moderate," then just put it in "Moderate" and move on. The most important factor isn't that the Likelihood is "Moderate" but what you do afterwards to reduce the risk.

What can you do to reduce the risk of your systems? Technologists usually start with Likelihood. What can you do at a system level to reduce the Likelihood of a failure? Moving systems into a data center is one way to do this. Also add redundancy where possible, such as redundant power supplies on different power feeds backed by different uninterruptible power supplies, or move important data off single disks to a RAID or a SAN.

At the same time that you address Likelihood, also consider the Impact. How can you reduce the impact of a failure? For example, do you really need to store all that data on that system? Is the extra data increasing your risk unnecessarily? If you can remove unneeded data from a system, you might reduce your Impact.

Over time, you should repeat this exercise, and you should be able to demonstrate your organization reducing its risks. In the Likelihood-Impact chart, your applications and systems should move from red to yellow, and from yellow to green.
image: "decisions" Impact Hub/Flickr (cc by-sa)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.