Thursday, 20 June 2013

The catastrophic collapse of the risk vector

Risk-based testing of complex systems is fairly standard practice, and software systems are no exception. You identify the biggest risks associated with the system you’re testing, and you test for those risks first. Then you go on, in order of descending risk, until you run out of time or interest. As well as being common practice, it’s recommended by the ISTQB in its software testing guide.

But there’s something hopelessly na├»ve about the way that risk-based testing is done.

The usual approach is to work out the ‘size’ of each risk by considering its impact (if it happens) and the probability (or probable rate) of it happening. That’s fine, but I think that what happens next isn’t. Test managers then combine the impact and the probability by multiplying them together. So if I have a risk that would cost me £100,000 if it happens, but I think that it’s only 1% likely to happen in the lifetime of the thing that I’m testing, then the risk has a value of £1000.

That’s certainly a very convenient way of combining the two numbers into a single (scalar) quantity that we can then rank in order of size. But it seems to me that it misses something that’s should be absolutely critical to the way that test managers manage risk.

I’d like to explain that with some illustrations.

First, an illustration from the Royal Navy.

‘Type 23’ frigates cost about £170m each. If the risk of losing one in a particular naval battle were 5%, that would give a scalar risk value of £8.5m. In contrast, ‘Queen Elizabeth’ class aircraft carriers cost about £3bn each. They are much better-defended than type 23 frigates, so the probability of losing one is much less. If the risk of losing one in a particular naval battle were 0.28%, that would give the same scalar risk value of £8.5m. On that basis, the risk of sending either an aircraft carrier or a frigate into a given battle would be considered to be the same.

Yet the two situations are very different. An admiral may be willing to take a £8.5m risk with a frigate, because the loss of a frigate is bearable: the Royal Navy has over a dozen of them. But taking an £8.5m risk with the Royal Navy’s only aircraft carrier because the loss of an aircraft carrier would unbearable: the impact would be more than the navy could survive.

Another illustration: when I arrange my household insurance, if I were to work out the probabilities and impacts of all the risks that my family is exposed to, then the total value of risk would be less than what the insurance companies will charge me for insuring against them. It has to be so, or the insurance companies could not make money. And yet I do take insurance. I don’t insure for the low-impact risks like tumble-dryer breakdown, because I could buy a new one or manage well enough without. But I do insure for the few high-impact risks such as a house fire, because I don’t have the resources to rebuild my house.

The inadequacy of ‘probability x impact’ as a way to model risk is fairly well known. For example, in “The Failure of Risk Management: Why It's Broken and How to Fix It” (2009), Douglas W. Hubbard, the author shows that risk is a vector quantity, and that trying to collapse it to a scalar in this way loses critically important information.

Modelling risk in systems testing as a scalar quantity is simple, but it’s simply inadequate. An alternative approach, recognizing the vector nature of risk, might be something like:
  • There are some risks with impacts that the organization could not survive, but that are so unlikely that the organization deliberately chooses to ignore them. Don’t test for them.
  • Then there are risks with impacts that the organization could not survive, but that are probable enough to be worth testing for. Test for those in descending order of impact x probability.
  • Then there are risks with impacts that the organization could survive. Only start testing for those when you’ve finished testing the unsurvivable ones. Again, test them in descending order of impact x probability.
It seems so logical and not even very difficult: yet I’ve not found any evidence of anybody doing it that way.

No comments:

Post a Comment