Build Quality In - Stop the Line
Hi Folks
If you ever visit a lean manufacturing plant you will see, at every workstation, a cord or button or lever attached to a big, red, flashing light. It's also attached to the production line machinery. Press that button, pull that cord or move that lever and two things happen - first, the big, red, flashing light starts flashing and the production line stops. In Lean Manufacturing, this system is called Andon (Japanese for Indicator).
Whenever there is a quality problem on the line, the operator at the station that discovers it, activates the light and stops the line. If the worker discovers that a part they received to perform their manufacturing step is wrong, or out of tolerance, the whole line stops. If they discover that the output of their manufacturing step is wrong or out of tolerance or fails testing in some way, the whole line stops. Stopping that station seems obvious - they need to fix the problem but why stop the whole line?
The trick here is that it isn't that one particular station that needs to fix the problem. More than likely that station did not cause the problem. The out of tolerance part will have been made by a previous part of the process whose parts will have been made by a previous part of the process and so on. If it was just that station that tried to fix the problem, by getting a new, working part out of stock for example, all that does is work around the issue. Nothing has been done to stop another faulty part from being made earlier in the process. The point of stopping the line is to force the whole production process to find and fix the root cause of the issue. That usually means fixing it somewhere other than where it was discovered. Stopping an individual station causes problems to be worked around. Stopping the line causes problems to be fixed.
When a new line starts up, it can take weeks for the first product to come through the whole process. At every step the lights go on, the line stops and the problem is fixed. That first unit stop/starts its way through the process really slowly. The next unit does the same, maybe a little faster. Each successive unit gets faster and faster as the line sorts out its problems. Eventually the line runs smoothly and the Andon system hardly ever needs to be used. Every station is working at full efficiency because they aren't continually working around problems introduced earlier in the process.
It's a similar situation in software development. When a defect is found, the line should stop. Any time a bug is reported, whether by internal testing or by a customer, development work should stop and the team should focus on the defect. Not on fixing it but on identifying the root cause of the defect and correcting the process to ensure that a bug like that can't happen again. Maybe the spec was unclear. Fix the spec, then fix the bug. Maybe there were some gaps in unit testing. Fix the unit tests then fix the bug. Fix the root cause.
Before release, your unit tests and build system are your Andon system. A failed build or failed unit test should stop the line. A lot of teams have a "build master" who is responsible for fixing the build. The idea here is that the rest of the team shouldn't be bothered by a broken build. This is the wrong approach. Unless the whole team stops and works out why the build broke, the build master will just fix the current problem and pretty soon the build will break again. I have worked with teams where the build manager was a full time job because the build broke all the time. As far as the rest of the developers were concerned though, the build was fine. There was a good build every night. All was well. The poor build master though would have to work back every night fixing that day's build issues. The developers would make the same mistakes. By making the whole team stop and fix the problem, the build issues stopped and the build became stable. The team freed up a whole developer to work on new features.
It's the same with unit tests. A failed test should cause the team to stop, assess why the test is failing (that usually doesn't take much time) and work out how to stop it failing in future. It may be a simple developer error in which case it's just a matter of that developer fixing their code or it may be a deeper issue which will take more effort. The key thing is that it must be fixed straight away. It's very tempting to look at that one failed test and think "we'll fix that next sprint". That one failed test could be hiding a really big defect and the longer it stays unfixed, the more work will be done around it, which will make it harder and more expensive to fix. There could be a fundamental misunderstanding of the user requirements that will affect everything you are doing. Until you analyze that failed test you won't know.
Analyzing the root cause of build failures and broken tests is often really quick. In most cases the team will sort out the issue within a few minutes. That small investment in time has a big payoff in quality. By stopping the line and fixing root causes rather than working around problems, you will find that problems become less and less frequent as your process becomes better and better. Everyone can focus on delivering software, not on fixing bugs.
This also underscores the importance of unit tests and a regular build. They are your Andon system. Without them you don't know when to stop the line.
Cheers
Dave