What would it take to put government in orbit?

Things go wrong.  Processes don’t quite work.  The requirements change before the software is finished.  The deadline is approaching so the scope is reduced.  The system isn’t quite as scalable as the vendors claimed.  The training was designed to support the original design, not what has actually been implemented.  There are seventy six legacy systems which nobody dares – or can afford – to replace.

And, of course, all that has consequences.  Processes are slower and less accurate. Customers are less satisfied.  Costs are less controlled.  Bugs queue up for maintenance releases stretching far into the future.

There are efforts to improve.  Each starts off in a flurry of engagement and excitement.  Each loses momentum as energy drains away.  Each is replaced by a new idea, supported only by the hope that this time it is going to be different.  And so the wretched cycle continues:  actually things do get better, but they get better slowly and with so many diversions, interruptions and half steps backwards that it is easy not to see the progress being made.  The time and money available to make those improvements is anyway constrained by the need to divert attention to the next big thing, so starting the whole cycle all over again.

Then a miracle occurs

That’s not how it happens at NASA.  While it is mildly disconcerting to be sitting on a train when the driver announces that it has to be rebooted, it’s really not the sort of thing you want to happen when manoeuvring in space.  So it doesn’t.

But how much work the software does is not what makes it remarkable. What makes it remarkable is how well the software works. This software never crashes. It never needs to be re-booted. This software is bug-free. It is perfect, as perfect as human beings have achieved. Consider these stats : the last three versions of the program — each 420,000 lines long-had just one error each. The last 11 versions of this software had a total of 17 errors. Commercial programs of equivalent complexity would have 5,000 errors.

That comes from a long article in Fast Company a couple of years ago, They Write the Right Stuff (which I have read by following a link in the last few days, but can’t now find who provided the tip off).  It’s essentially the story of how NASA (or rather Lockheed Martin) manages software production as an engineering problem:

The most important things the shuttle group does — carefully planning the software in advance, writing no code until the design is complete, making no changes without supporting blueprints, keeping a completely accurate record of the code — are not expensive. The process isn’t even rocket science. Its standard practice in almost every engineering discipline except software engineering.

I recall hearing people such as Ross Anderson excoriate government for not applying engineering principles at a seminar early last year (with the memorable response from Tom Steinberg that he disagreed with almost every word they had said, but that’s another story).  It is, as Tom’s response implies, a very different approach to the beta first, then refine school of thought, but then it’s a different set of tools attempting to solve a different kind of problem.

Building software slowly with meticulous attention to detail is more expensive than building it quickly with a more relaxed approach to errors:

And money is not the critical constraint: the groups $35 million per year budget is a trivial slice of the NASA pie, but on a dollars-per-line basis, it makes the group among the nation’s most expensive software organizations.

But dollars-per-line is, of course a remarkably inappropriate measure of software cost.  In the extreme example of space travel, the measure is of space shuttles not lost and astronauts not killed as a result of software faults.  In the rather more prosaic world of government services, it’s the costs – both administrative and customers’ time and energy – of software which doesn’t quite fit the process it is intended to support.  To be fair, that lack of fit may as often be because the process has changed since the software was specified as because it failed to meet the specification – and more often, I suspect, than both, because the level of shared understanding between all involved was never high enough .

What’s important in the wider context, though, is not the software itself, but the conditions under which it can be produced.

The error database stands as a kind of monument to the way the on-board shuttle group goes about its work. Here is recorded every single error ever made while writing or working on the software, going back almost 20 years. For every one of those errors, the database records when the error was discovered; what set of commands revealed the error; who discovered it; what activity was going on when it was discovered — testing, training, or flight. It tracks how the error was introduced into the program; how the error managed to slip past the filters set up at every stage to catch errors — why wasn’t it caught during design? during development inspections? during verification? Finally, the database records how the error was corrected, and whether similar errors might have slipped through the same holes.

“We never let anything go,” says Patti Thornton, a senior manager. “We do just the opposite: we let everything bother us.”

The process is so pervasive, it gets the blame for any error — if there is a flaw in the software, there must be something wrong with the way its being written, something that can be corrected. Any error not found at the planning stage has slipped through at least some checks. Why? Is there something wrong with the inspection process? Does a question need to be added to a checklist?

That’s getting to the heart of things:  designing out error and funnelling creativity into improving the process in a structured way.  It sounds tedious and stultifying.  But don’t underestimate what an achievement it would be to make transactional services which are supposed to be boringly predictable, actually boringly predictable.  Doing that, of course, goes way beyond the need to get software right:  the need for rigour must be a system-level requirement, of which the software is just one component.  So it’s tempting to conclude that disciplines such as lean and six sigma, which are all about using structure and process to deliver quality and consistency, provide all the answers needed.

Space Shuttle Challenger Maiden Voyage - picture by SensorPhoto


But it may not be quite as straightforward as that.  Anybody who has ever confronted a pile of project control documentation with anything less than a spring in their step may be forgiven for thinking that process may be at risk of getting in the way of purpose.  More importantly, the pattern of strengths and weaknesses of human behaviour is different from the pattern of strengths and weaknesses of computer behaviour.

Twenty years ago when the social security system was being computerised, there was an interesting – and not wholly anticipated – reversal in what was easy and quick and what was hard and slow.  Up to that point, batch processing was slow, because it was done by humans, one file at a time.  The annual uprating of benefits, for example, took months to run:  every single case had to be recalculated clerically.  System change, on the other hand was fast:  a rule change could be propagated by sending a circular with a revised instruction to local offices and implemented within days.  Computerisation removed the pain of uprating:  it became a set of changes to configuration files.  But it made system changes into a painful industry:  even trivial changes had to go into the queue for maintenance releases, from which it could easily take a year or two to emerge.  Twenty years on, we are only beginning to emerge from the pain of that trade-off.

So even within the parts of the system which need to be robust and predictable, we need to be careful not to lock ourselves into ways of building and maintaining them which make change slow and expensive.  The magic fairy dust answer to that one has been service-oriented architecture, but thinking on it has been largely stuck at the IT level – with the result that even Wikipedia can’t easily explain it in terms which make sense to ordinary mortals.  I am increasingly of the view that the real power of the concept is at the level of business services rather than IT services.  Done right, that should mean that increasingly we need to change only what needs to be changed (instead of building whole systems when the real need is for a smaller degree of change), as well as providing better structured means of connecting components from different providers.

But of course all of this collapses into complete irrelevance if it’s an attempt to solve the wrong problem – and many would argue that that is exactly what it is.  Tom Steinberg has recently written that governments must

Accept that any state institution that says “we control all the information about X” is going to look increasingly strange and frustrating to a public that’s used to be able to do whatever they want with information about themselves, or about anything they care about (both private and public). This means accepting that federated identity systems are coming and will probably be more successful than even official ID card systems: ditto citizen-held medical records.

More starkly still, in his contribution to the series of essays produced for Reboot Britain, Lee Bryant argues that:

In government, as in business, we suffer from organisational models that are too expensive and inefficient to succeed in the current climate.…Corporate IT has become a blocker not an enabler and we urgently need a new, more human-scale approach to internal communications and knowledge sharing within organisations in both the private and public sector. The boom times of recent years have hidden a great deal of inefficiency, and as revenues recede, we need flatter, more agile organisational structures instead of the stultifying middle management bureaucratic machines that exist because organisations fundamentally don’t trust their own people, let alone their customers and users.

I don’t see this as a choice, or at least not as an immediate choice.  As things now stand, two things need to be done, they both need to be done well, and they will  be done well by doing different things.  Whether we like it or not, whether VRM is the future or not, government is the custodian of many large databases and the services built on them, and millions of people depend on that custody being effective.  So it is in everybody’s interest that that is done as well as it can be, and the standards of the shuttle software are standards it is worth aspiring to.

But it is just as clear that government is not a single purpose, single application control system.  Its role is not to steer one machine but to be steered by many.  And in that context, the approach to service design implicit in the space shuttle software model may be precisely the wrong one.