21.5.06

Think, Thank, Thunk

This is about a thought experiment. It has nothing to do with global sourcing, except that I've got to figure out how to do the experiment successfully with both parts of my team, in the US and in Pune India.

Here's the problem -- I've got about 25 people working on the QA team for our big flag-ship product. The product is a fairly complex client-server application, Windows and SQL Server on the back end, web-services command and control, Windows on the client-side. About a million lines of code, give or take. 10 years old. Lots of crufty features no one remembers except that one customer in Idaho who we did the feature for back in '98.

Here's the rest of the problem --I'm "re-inventing" the QA practice. I want a more methodical approach. I want deeper test coverage. I want the ability to make tradeoffs between risk and comprehensive test coverage. You can only do that if you have a concept of what comprehensive is.

So I'm doing a lot of work with this team to get them on the same page as me. I'm doing it first with my team here in the US. We're planning on bringing the contract team along the same path, phase-offset by about a week or two.

Here's the thought experiment I did with the team:

Thought experiment 1. Fill in the blank:

Our product is __________ deterministic.

I asked around the room, and got: non, mostly, not very, occasionally, purely, non, somewhat and usually.

Then I told them my answer. For starters, I pointed out that I didn't constrain the number of words they could put in the blank. I just asked them to fill it in. (There's an interesting point about conformity vs. "out of the box" thinking here, but we'll save that for another time.) My answers was "hyper-complex, but utterly".

The product is hyper complex. It's a million lines of object-oriented class-based code in at least 3 languages. It is so stateful as to appear, to human processing engines, random.

I used to say of this product that it is a "random crash generator". This is because, after analyzing empirical evidence from multiple product releases, I determined that as long as we keep testing we will keep finding defects at an asymptotically decaying discovery rate forever. Which means that there are infinite defects in the product. Which plotted against a finite number of code paths means it's effectively a random crash generator. I don't think that any more, and I told my team as much. In fact, I regret ever talking about that leap of flawed logic.

It's code. The code is not dynamic. It's very very very stateful, and it's operating against a matrix of highly complex and stateful background noise. But it's code. Code is, well, encoded. It does the same thing over and over, given the same inputs.

What we've failed to do is to understand the inputs. We don't grok all the salient points that can perturb the system. We haven't listed all the variables in this giant system, and we haven't determined all the states those variables can have. So we're treating the product as a black box, and we're treating our inputs as a black box. So it's no wonder the behavior of the product appears random to us. And it's no wonder the testing we're doing isn't having the effect I want, and isn't always giving us a clear insight into the quality of the compiled code.

Thought Experiment 1a.

It is QA's job to test _________________ product capabilities.

I got: major, some, all, all, all, 75% of, interesting, and new.

I said: "A statistically relevant sample set of".

Here's why I believe that: If we define all the variables that could impact this product, and we then map out all the combinations of all the variables, we'll easily conceive of millions and millions of test cases for this product. Because the product is so stateful, and because no one on my team can yet tell me that the combination (Variable1:Value3, Variable2:Value 8, Variable3:Value1) doesn't take the product down a different code-path than (Variable1:Value3, Variable2:Value 8, Variable3:Value2), all combinations of all inputs and environment variables are, unto themselves, valid test-cases. Which means there's millions and millions of test cases. (Take this as a given, since I don't want to write a whole essay about the complexity of the product in question.)

If there are even one million test cases, and they take 5 seconds on average to run, then running them all takes a cumulative 350 days of just compute time, ignoring the tens of thousands of man-days it would take to code the test cases in automation. Oh, and automation the test is your only hope of having each test case take only 5 seconds. Otherwise, you're probably looking at an average of 30 minutes to an hour per test case.

Which means that testing all the features is a nice idea, but not commercially practical.

(I don't have the math to describe this with precision, but we should be able to hire people who do. Here's the lay-version of the math:) So we've got to pick a statistically relevant sample size, and run tests against said sample size, hopefully also hitting common use case scenarios that we expect our costumers to hit on a frequent basis. Based on the number of combinations we tested, plotted against the number we believe to be present in the product, we know how much of the product we think we sampled. Then, based on the number of defects we found, and the severity, we can extrapolate what's left in the product. There are a lot of messy assumptions in this approach, but I think it's the only commercially viable approach given a hyper-complex product.

Given this belief, I think the first step in re-inventing the QA practice for this product is to scope the product, write down QA's view of it, and begin cataloging all the variables that can impact each component, feature or function. Then, we can figure out how to best organize that info, and can get our heads around how many test cases there would be, in theory, if we had forever to work on each of these releases.

It was an interesting thought experiment.

I'm writing it up and asking my team in India to do it as well, without telling them our answers. It will be interesting to see what they come back with.