Paved with Good Intentions

Replacing Judgement with Process

by Kevin Morris

Our design and development team was lean and efficient.  We had five engineers, and we all knew what we were doing.  One of us was probably designated as the "project lead," but in practice that hardly mattered.  That engineer simply acted as the public face of the group to the rest of the company.  Within the team everybody intuitively knew their role, and the complex interactions were never choreographed from any centralized management.  The team was an organic, competent, flexible, almost single-minded machine.

The product evolved very quickly.  I'd find a bug during my development and ping one of the other engineers across the room - "Hey, could you take a look at this? I'm not getting the right data back from your module here..."  The other guy would immediately look at the problem, spot his mistake, and bang out a quick fix - or he'd realize the problem wasn't actually in his module but because a third engineer hadn't quite finished her work yet.  He'd show me the workaround he was using and we'd both be back to full-steam productivity again.  

When the product first launched, it was awesome.  Our team had a clean vision for what the product should be, and we executed that vision extremely well.  However, we were in a fast-moving market.  We had left a lot out of the first version of the product in order to get to market faster.  We immediately started working on the second version - adding the new features we knew we needed, fixing problems found by our customers in the first version, and responding to things our competitors had done in the interim.

Since we had all these actions coming from all these directions, and since we were an all-grown-up engineering team with a real product on the market, and since our product contained a lot of software, we had a growing number of bug reports piling up.  Our management team decided it was time for a bug-tracking system.  The only thing we needed was to come up with a proper method of prioritizing bugs and fixes.

At this point in any project, big red warning signs, blinking lights, and danger buzzers should go off.  You see, as we try to make a set of rules to govern our engineering work, we enter a conceptual vortex where we are attempting to replace our hard-earned engineering judgment with a simple set of rules.  The truth is, this cannot be done. Trying to do so will inevitably result in almost unbearable pain and agony for the development team, but don't concern yourself with that right now, let's get back to our story.

It's simple enough to make a priority system, right?  We'll assign rankings from one to five - with five being the most important and one being the least.

Problem solved.

Well, except one detail. Since our support people and our QA people and our marketing people (oh yeah, and even engineers) will be assigning these priorities, we probably need to give some definitions and guidelines for each level.  How about 5=critical, 4=high, 3=medium, 2=low, and 1=nice-to-have?

Now, problem solved.  

Marketing starts putting their "wish list" into the system, customer support starts filing bug reports on behalf of customers, and QA begins filling the system with problems they're turning up in their newly-created tests.  Game on.  Hundreds of tickets flood the system.  

After an initial week or two of euphoria, the novelty wears off.  Marketing had started off being nice, filing all their ideas for future improvements as "1 - nice-to-have."  Now, they see the error of their ways.  Their work is hopelessly lost at the bottom of the queue.  They start kicking up the priority.  Some features MUST be added or our competitors will have a field day.  Customer support gets in a bidding war with marketing - we have a few customers down with serious issues - those need attention now.  Not to be left out, QA starts bumping their priorities up - their test failures cannot be ignored.

Before long, almost every new bug is being filed as "5 - Critical" because the filer doesn't want it winding up at the end of the to-do list.  They particularly don't want it winding up behind some big pipe-dream of Marketing's where it will clearly never get attention.  

Furthermore, Management has now discovered that we have a bug-tracking system, and they decide they can measure the effectiveness of our engineering team and the quality of our product by the number and priority of open bugs.  Somebody in accounting helpfully suggests creating a metric "The Score" based on multiplying the priority by the number of open bugs, and summing that across all five priorities.  Our team now has a (Management Lameness) "Score" of approximately ten zillion.

By week 3, every engineer on the team has a file of critical bug reports that could account for about four years of work.  We have no idea which issues to tackle first, and Management thinks we're the worst engineering team ever assembled.  Our highly-successful product is even rumored to be on the chopping block for de-funding, and HR is busy re-calculating our bonus schedule to include a metric for open bug count.  Panic sets in, followed by depression. 

We have to do something.  

We create a small committee to review incoming bug reports, re-prioritize them, and assign them to the appropriate engineer.  By forming a team of engineering, marketing, support, and QA, and by defining a set of criteria for each priority level, we could bring sanity back to the process.  Now we just need to define those priority levels much more crisply. This is important business!

Level 5 will now be reserved for the most egregious, bonus-busting, customer-stopping, safety-critical, lawsuit-generating issues.  Do you think your issue is bad?  It's not a level 5.  Just try us.  Short of a problem that causes an unintended thermonuclear detonation in a highly-populated area with an indigenous population of nuns, children, and fuzzy-cuddly animals, we will not assign a 5.  The purpose of level 5 is to have a level that is more important than anything we've ever seen, so that if and when such a problem arises, we'll be able to move it to the top of the list.  (This is also a great strategy for immediately whacking our management bug metric down to a much more manageable number in record time.)

Level 4 is now the highest level we expect to see.  Level 4 is "High" right?  If our product has a 10% chance of killing the operator - that should be a 4.  If sales is unable to sell a single copy because the product won't start during a demo... maybe a 4 as well.  

Three, two, and one would be the new five.  We wouldn't be gamed by the system any more.  The review committee would be sure that the new definitions would be enforced.

Problem solved.

Well, except bug review committee meetings started to get long.  There was endless arguing over priorities.   Meetings were moved to Friday, and sometimes the review meeting took the entire day.  Since our team had only five engineers, losing 20% of one of them to bug review was expensive.  It was expensive for marketing and support too - and attendance at the meetings dropped.  QA started stacking the deck so that their more controversial issues were discussed when people likely to oppose them were not present.  The system took on all the political trappings of a legislative body.

Additionally, lines of conflict started to emerge based on definitions.  Which was more important - a bug that had a .01% chance of occurring with enormous consequences, or a bug that had a 99% chance of occurring with minor consequences?  The idea of probability of occurrence versus consequences of occurrence could not be resolved.  We needed two ratings for each bug - priority and... severity.  Now, we could have a severity of Critical = "very, very bad when it happens", and a priority of 2 = "not very likely to happen".  Everyone could understand what a 2C meant - not likely to happen, but bad when it does.  5L - happens a lot but not a big deal.

Problem solved.

Well, almost.  Now, engineers didn't know what to work on first.  With this 2D matrix of ratings, things had gotten quite complicated.  Plus, engineers argued, there was no way to differentiate between a quick, 5-minute fix and a problem that would require a complete overhaul of the system architecture.  Furthermore, there was no way to account for the potential risk associated with a fix.  Changing a font on a screen display was a whole different level of risk than re-doing a database schema.  There was serious consideration to bringing 2 more types of rating to the system.  On top of all that, we had to differentiate between problems in the version of the product that was shipping and problems in the pre-production version we were still developing.  The bug tracking system became a self-aware, dangerous, complex, evolving beast.  Everyone lived in fear and reverence.  

Over time, our obsession with bug priorities and severities surpassed our desire to please our customers.  It was more important than building a good product.  It got more attention than making money.  We spent more time, energy, and stress playing our internal bug tracking and prioritizing game than we did working on the product.  

Something had to be done.  

During the course of the priority wars, our project lead had lost his job.  Management held him accountable for a massive uptick in the weighted quality metric (determined by a formula of bug priorities and severities averaged over time, and modified by closure and discovery rates).  As a result, I was now the  unwitting heir to the throne.  I decided to stealthily instigate a subversive solution to the systematic problem.

Our engineering representative quietly stopped attending the bug priority meetings.  Nobody minded, of course, because that was one less person to fight against their agenda.  Marketing, support, and QA never complained.

Next, I quietly instructed the engineers to stop worrying about the bug rankings.  I asked them to look over their lists, and to fix the things that, based on their own professional judgment, needed to be fixed.  I told them I would not hold them accountable for differentially weighted fix rate metrics.  I asked them to make the product great.

Externally I explained that we were now working on a new priority formula that took into account the priority, severity, risk, and fix effort, and yielded an optimum sequence for engineering work.  I claimed that this would give us the most possible product improvement in the shortest time.  I played with Excel for a few hours until I found a metric that went up and to the right based on past data.  I made a graph.  I put it on a PowerPoint slide.

Morale improved immediately.  Arguments between engineering and other teams ceased.  Productivity rose.  We encouraged the Marketing, Support and QA teams to fight hard to get the priorities and severities right.  We said it would make the product better, and with our new formula, we needed them to be vigilant.  "Garbage in - garbage out" we quoted.  They were impressed.  They knew that was a revered engineering phrase.

Engineers forgot all about priorities and severities.  They used their judgment again.  The product improved faster than ever.  Customers were happy.  Revenues went up and to the right.

Problem solved.

Comments:

Great story! True?
Posted on 2010-02-17 03:19:32 at 2010-02-17 03:19:32
Remy Really true!
Conclusion is that project lead has a true role: not being inquisitor to the project team, but to provide communication to team and to management, with a real technical background. Use the force Luke.
Posted on 2010-02-17 04:02:43 at 2010-02-17 04:02:43
The problem is that when people design bug tracking systems, they abandon all their years of design experience in favor of throwing together the first few ideas that pop into their heads. If you would start by producing a spec, you would realise that the 'bug priority' is the output we are looking for, not one of the process inputs. What are the inputs? Basically three variables: how much is the bug costing, how much will it cost to fix it and how long will it take. With that information, you have some hope of improving the only metric that really matters - the bottom line. Although optimizing the amount of benefit compared to cost, using the available resources probably turns out to be a variant of the Travelling Salesman Problem, it should be easy to make a sub-optimal prioritization with big improvements over a less systematic approach. The bug tracking system described in the article, which seems an accurate representation of all the ones I've ever seen or heard of, is bound to fail because it asks the person logging the bug to supply information that they don't and cannot know (how important is fixing bug X compared to all the bugs being uncovered over all departments in the company, given the comaprative costs of fixing each bug). Of course with the scheme I outline, the person reporting can still pad the bug cost in an attempt to increase its priority, but sadly you can't design out office politics
Posted on 2010-02-17 06:16:18 at 2010-02-17 06:16:18
kevin Yes, the story is true. I have changed a few of the non-essential facts to protect the identity of former employers. I have also seen virtually identical scenarios play out in a number of companies I've worked with. We build elaborate internal processes, and get so wrapped up in the game that we forget what we're actually trying to accomplish.
Posted on 2010-02-18 20:06:10 at 2010-02-18 20:06:10
kevin Hi Mondo23,

Yes, I am saying that people who design complex algorithms can't design simple business processes.

In most cases I've lived through, highly competent professionals went off on the quest for the perfect priority system. I agree with what you're saying about the person filing the bug not having sufficient information. None of our mad-mad-world processes depended on that. I didn't go into these details in the article, but we had a team that reviewed each bug report after it was filed. Marketing was there to give their opinion on how the bug would affect our ability to sell/demo the product. Customer support was there to tell us what the impact would be (or was already being) on existing customers. Engineering was there to tell us about how much work it would be to fix the bug, and what risk was associated with the fix. QA was there to tell whether the bug was (or was suspected to be) a duplicate of one already filed, why the original test suite didn't catch the bug and how they planned to modify it for the future.

If our company had employed actuaries, I'm sure we would have had them spend time stitching all that together in our own version of "The Formula" from Fight Club.

The problem was that this process took a tremendous amount of time and effort on each and every bug report filed. The end result was a multi-dimensional priority score which did almost everything EXCEPT the one thing we needed it to do - tell our engineers what order to do their work.

My average engineer had about 30-50 bugs in their queue. They had a pretty good idea what order to fix them in without any input from The Process.

Occasionally, after an investment of tens-to-hundreds of man-hours of gut-wrenching inter-group battling over priorities, number seven and eight for one of my engineers would get swapped. My engineer would then fix seven on Thursday morning and eight on Thursday afternoon - instead of the other way 'round. I'm not sure it was worth it.
Posted on 2010-02-18 20:33:16 at 2010-02-18 20:33:16
ICarlson My company has implemented a bug tracker. Ours works well for what it does: documenting any anomaly so that it won't be forgotten or overlooked before releasing the product. The person entering the bug assigns a priority, but really the priority level only matters when the engineering team escalates the bug fixing, due to the fact that all of them cannot be addressed or fixed. Most of the FPGA bugs I deal with are usually included in the core functionality of our products so they're really not optional.

I think Kevin's idea of engineering intuition and experience should dictate 90% of the bug fixing. That's because, in my experience, 90% of bug fixing has to do with how the product was defined from the get go, so it's obviously those issues that will be prioritized very highly. That doesn't mean there isn't a process; it's just within the group and not the company. The company should be included in the process, but only when the bug fixing process gets escalated, such as in cases where bugs can't be fixed or there are too many.
Posted on 2010-02-19 11:52:30 at 2010-02-19 11:52:30
raysalemi I just finished the books Daemon and Freedom by Suarez, and so I'm on a sort of "Gaming brought to real life kick."

It's got me thinking that some of the mechanisms that are used to build social infrastructure in gaming systems such as Worlds of Warcraft or Puzzle Pirates could also be used in a business environment.

For example, what if engineers could rank each other with a reputation score. You get a release from another engineer and it's got some bugs, you ding him a couple of points.

Or what if engineers could go up in levels ala D&D. You hit a deadline, you get some points. You get enough points, you go up a level. Everyone knows that a Level 5 coder is more valuable than a Level 2 coder.

Could a sort of distributed information system become a new corporate infrastructure? Or would it just become more priority hell as Kevin saw on his project?
Posted on 2010-02-20 18:21:01 at 2010-02-20 18:21:01
raysalemi BTW. Great book on this topic: Debugging the Development Process by Macquire: http://bit.ly/9IE8lM
Posted on 2010-02-20 18:23:47 at 2010-02-20 18:23:47
MichelT Great, nicely told story ! I am not in the business of selling bug tracking systems, but we do deliver a software for design quality monitoring and closure.
You can see our product as a means to turn all the Mbytes of logfiles and other EDA artifacts into easy-to-read and automated dashboards.
Our customers are generally conscious about what they would lose by deploying quality checks and metrics that do not make sense or irritate the engineering teams. But all the parties involved (engineering, project management, CAD & methodology, …) realize that productivity improvements come from formalizing the processes, documenting the practices, automating the reports, etc…

A key question for me is : how much is a company willing to invest in the definition of the right metrics, and how can they establish consensus ?
The size factor is important here : while most large companies take the issue seriously, the smaller ones know that they will have to address the dilemma when they grow.

--Michel Tabusse, Satin IP Tech.
Posted on 2010-03-06 02:59:49 at 2010-03-06 02:59:49
Cliff Engineers and process is a love-hate relationship. Sometimes it gets in the way, sometimes we hide behind it.

Good team leaders can use development processes (like bug tracking) to the advantage of the project or they can be wagged by it.

The key question when installing new or different processes and metrics is, "What problem are we trying to solve?" If you can't get a consensus answer to this, you will fail, but once you know what the problem is, then you can usually solve it in a non-invasive and effective way.


Cliff
Posted on 2010-04-12 10:27:59 at 2010-04-12 10:27:59
You must be logged in to leave a reply. Login »