Is Code Coverage Really All That Useful?

Test driven development proponents often tend to push code coverage as a useful metric for gauging how well tested an application is.  100% code coverage has long been the ultimate goal of testing fanatics.  But is code coverage really all that useful?  If I told you that my application has 100% code coverage, should that mean anything to you?

What does code coverage tell us?

Code coverage tells us which lines in our application are executed by our unit tests.  For example, the code below has 50% code coverage if the unit tests only call Foo with condition = true:

string Foo(bool condition)
{
    if (condition)
        return "true";
    else
        return "false";
}

What does code coverage not tell us?

Code coverage does not tell us what code is working and what code is not.  Again, code coverage only tells us what was executed by our unit tests, not what executed correctly.  This is an important distinction to make.  Just because a line of code is executed by a unit test, does not necessarily mean that that line of code is working as intended.  

For example, the following code could have 100% code coverage and pass all unit tests if it is never called with b = 0.  However, once this code is introduced into the wild it could very well crash with a div by zero exception:

double Foo(double a, double b)
{
    return a / b;
}

So what is code coverage good for then?

To borrow an analogy from Scott Hanselman's interview with Quetzal Bradley, imagine you are a civil engineer responsible for testing a newly constructed series of roads.  To test the roads, your first thought might be to drive over them in your car, making sure that there are no potholes, missing bridges, etc.  After driving over all of the roads a few times, you might conclude that they have been tested and are ready for public use.  But once you open the roads to the public, you discover that the bridge overhangs are too low for big rigs, the turns are too sharp for sports cars, and that certain areas of the roads flood when it rains. 

In the above scenario, you had the equivalent of 100% code coverage since you had driven over all the roads, but you only superficially tested their behavior.  Specifically, you didn't test the roads in different vehicles and under different weather conditions.  So although you went through each possible execution path, you failed to accomodate for different states while doing so.

In light of this, the only solid conclusion you can draw from code coverage seems to be what lines of your code have definitely not been tested.  The lines that have been tested are still up for grabs it seems unless you are willing to go through each and every possible state the application can be in when executing them.  This makes code coverage far less useful as a metric as it only tells you what still needs testing but offers you no help in determining when you are done testing.

What *is* a good metric then?

Unfortunately, there doesn't seem to be a good metric for determining whether a line of code has been thoroughly tested or when a developer is done testing.  Perhaps this is a good thing as it keeps us from falling into a false sense of complacency.  It simply isn't feasible in even a moderately complex application to test each and every line of code under every possible circumstance.  The best case scenario seems to be to test the most common scenarios and reasonable edge cases, then add additional tests as functionality inevitably breaks on those scenarios that you didn't account for.  It's an admitedly clumsy system, but it's a realistic one compared to depending on 100% code coverage to weed out all possible bugs.  That's not to say that there isn't use in achieving 100% code coverage.  Executing the code in one particular state still has value, just not as much as developers seem to give it.

As always, I'm very interested to hear your thoughts and observations on this.  Please leave them in the comments below.

 

Kevin Pang

Kevin is a software engineer at Google whose programming interests revolve around web development, software architecture, and design. When he's not writing software, he enjoys watching Jeopardy!, playing Magic the Gathering, and wandering around Disneyland.

 

36 thoughts on “Is Code Coverage Really All That Useful?

  1. Code coverage is just one measure of the quality of your unit testing efforts it certainly isn’t the only one. You can’t beat a code review of the tests to discover cases that are not being taken into account like your example of the divide by zero problem.

  2. Code coverage is really just a statistic for management:
    Industry experience suggests that the design of metrics will encourage certain kinds of behaviour from the people being measured. The common phrase applied is "you get what you measure" (or "be careful what you wish for"). (http://tinyurl.com/6pclm7)

  3. The point of code coverage is not to verify that the system works, thats just what the system tests themselves do.

    the coverage is there to tell you most importantly what code to test, and what code to rip out, you look at the coverage and find the code that doesnt ran during tests, you then inspect the code, and formulate the best tests for it.

    The analogy is way off, something like "you fully test the bridge can handle the required weight, but only test one side of it"

  4. @Dale

    I think you’re saying the same thing I’m saying: that code coverage is only really good for telling you what code isn’t tested.

    However, just because code is covered doesn’t mean that it’s actually *tested*. It has been executed by a test, true, but the actual correctness of the code is not a given just because of that.

    The problem is that many developers and managers assume that if a line of code has been covered that it is tested and should function correctly when in reality that may be far from the truth.

  5. yup, but anything that tells me what code to test and where dead code is, is very useful to me :) I guess it was just the wording of the headline.

    the iterative method of write a few tests, and add tests for every bug you find works quite well for most practical purposes.

    but I think in the end the biggest problem with testing is during the specification process, where things are typically hard to / vaguely specified, the closer your tests are too your specification, the easier they are to write, the more complete your spec is, the more complete your tests are

  6. @Dale

    Yes, I think the title may be a bit more inflammatory than what I intended. ;)

    You are correct. As I mentioned in the post, there is value from knowing what has not been tested. But not nearly as much value as knowing what is tested and working. The point I was attempting to make was that developers and managers often use code coverage for the latter instead of the former.

  7. [quote]This makes code coverage far less useful as a metric[/quote]

    Less useful than what? Maybe code coverage is more useful when you don’t consider it a metric per se, rather a tool to glean information about what your tests are actually testing?

    During development – I find looking at my test’s code coverage very useful in ensuring that each test I write is actually testing what I intend it to, and also for prompting me on where to test next. Seeing that a particular branch was not followed, for example, is very valuable while you are still in Test Development mode.

  8. @Nick Pellow

    Less useful than a metric for gauging code correctness.

    I agree that code coverage is useful in showing what has not been tested. What it fails to do, yet what so many developers seem to use it for, is show what code is working as intended. I think seeing that a line of code has been covered by a unit test tends to give a developer with a false sense of confidence in the code’s correctness.

  9. If only there existed a single "metric for gauging code correctness" …

    I understand your point that seeing a line of code marked as covered, gets interpreted as ‘tested’ by many developers. Clover (a tool I develop on) tries to make it easier to qualify each line of coverage by showing you which tests covered each line.

    Also, coverage resulting from code ran outside a test (called "incidental coverage" e.g. during setUp() and tearDown()) is rendered differently.

    A blog post on per-test coverage is here: http://blogs.atlassian.com/developer/2007/10/my_tests_touched_what.html

  10. I have just followed a conference by a Yahoo! manager saying that he didn’t use code coverage as a metric, but he preferred using the number of tests. Naturally, he has to check from times to times the content of the tests to detect some "I develop for the metrics" behaviors ;o)

  11. I think the problem is not that code coverage is useful, but rather assessment of code quality needs to be done by a human.
    Code metrics, like cyclonic complexity, number of tests, ratio of code-to-comments, lines of code, and code coverage, will point out that something is obviously wrong, but they can’t distinguish between good code and code written to fudge these metrics.
    At the end of the day, you still need to have code reviews done by humans.

  12. @David Kemp

    Yes, I think you hit the nail on the head there. Perhaps there isn’t a good way to automatically check whether the code we write is safe because it simply can’t be done. Another case of looking for a silver bullet when none exists, I’m afraid.

  13. I think there has been some general confusion on the point of Code Coverage and its usability in the real world. When explaining this concept to management or other non-dev types it may be perceived as a "metric for gauging code correctness." The act of unit testing, under TDD, is more than anything a [b]design activity[/b]. Coupled with Continuous Integration this activity will support Refactoring. Therefore, I believe Code Coverage is more useful when thought of as a metric for how well your system can respond to change.

  14. Coverage indicates an absence rather than a positive – while 100% coverage doesn’t guarantee anything, 0% coverage guarantees untested code.
    The issue is that quantity is not sufficient, the ideal metric would distinguish test quality as well.
    I have never found the time to try it on real code, but I found the idea below very interesting in that respect: a good test suite should be sensitive and reflect tightly your requirements/assumptions; therefore, if you mutated your code, your tests should fail. So you could measure your tests quality by mutating the code, and looking at how much failure you create.
    http://msdn.microsoft.com/en-us/magazine/cc163619.aspx

  15. You’re absolutely right. I think anyone using a continuous integration tool or with code coverage standards at play realized the uselessness of the metric. [b]However[/b] what I’ve observed is that [b]teams[/b] that have adopted it themselves (rather than having "management" impose it because it was read about somewhere) helps to create an even greater sense of collaborative code ownerships and accountability to the product. Yes, the metric is irrelevant, test/code quality is all that trully matters and this metric does nothing to guarantee any level of quality, BUT, it starts a discussion and if self-policed by the team can at least be one more pillar in the team agreement that help convey (if not actually deliver) a message to all team members that we need to make an effort to focus on quality, even if that metric is more symbolic than anything else.

    That being said, if you do have good team leads whose incentives or motivation are driven be a real sense of quality (as measured by the users) then you have a chance of having champions for the cause utilizing that tool for positive effect.

  16. I look at coverage as part of the process, not as an end result. Looking at the coverage stats tells me what code is not yet under test. I still do the occasional code review, as well as use analysis tools such as FxCop and NDepend.

  17. I agree, code coverage isn’t all that usefull.( http://blog.schauderhaft.de/2008/10/20/code-coverage-what-is-it-good-for/ ). But I would get further, while it is a good idea to keep the coverage high. Aiming at 100% isn’t a good idea at all. One of the metrics that would be usefull is the probability to find a bug with a specific test. Although impossible to calculate it is easy to see that this value might be much higher for an additional test for a complex method although it is 100% covered then for a simple getter with 0% coverage.

  18. Considering that a lot of projects have token amounts of code coverage (less than 20%, if they have any at all) its kind of irrelevant whether some of those tests are rubbish.

    Junit testing as an approach has failed to get to enough developers for it to be essential. If you had 100% code coverage you would be so far ahead of everyone else that when these problems cropped up you would have a place to put the test. So yes code coverage isn’t the end answer, but if you have less than 50% you have already failed to unit test the code well.

    Code coverage may be used as a stick by managers but the truth is that using a good in IDE tool that shows you coverage is a great way to remind you of something you might have missed and as an input to reviews of other developers code. But equally 100% isn’t essential either.

  19. Like any metric, code coverage has multiple uses. And each observer will find some of these uses more valuable than others.

    Clearly though, there is agreement that identifying code that isn’t covered is useful. Because that tells you with 100% certainty that something hasn’t been tested. And you can then use that information to make informed decisions. Maybe you decide that it’s not important to test that code. Maybe you realize there is a huge hole in your tests and you need to write more tests. The point is that without coverage tools, you would have to guess to make those decisions.

    But is the knowledge that a line of code is covered useful?

    Knowing that there is at least one possible successful execution path through the code is at least a smoke test. Smoke tests have value. Though they should never be considered sufficient.

    Using coverage metrics as a reporting tool to management can be useful if there is resistance to funding test development efforts. (Especially if you can correlate defect counts on projects with low coverage to defect counts on projects with higher coverage).

    Using coverage metrics as a motivational tool for developers can be useful.

    And remember that coverage tools have come a long way, but there is still a lot of potential for future enhancement. Tracking the number of times a line of code was executed, tracking whether boundary values passed through lines of code, tracking the quality of the tests that passed through lines of code, etc. These are all future avenues to improve the value of code coverage metrics.

    And all of the above ignore the entire "how much coverage is enough" debate. Because that debate tends to detract people from the fact that having a tool to actually measure your real code coverage, regardless of what % it ends up being or what % you strive for, is the biggest value. Because that moves you on the road toward making informed, rather than speculative, decisions about your testing practices.

  20. I wrote a parser generator some years back that generated code to check for code coverage, so that when I wrote the test suite to check all grammar rules, I could be sure I hadn’t left any out. It was easy to add and quite helpful in regression testing. This isn’t a be-all-and-end-all technique for debugging, but it’s a useful addition to your repertoire.

  21. Show me a metric that’s adopted by management to "improve" teams, and I’ll show you a team that games the metric, with little or no improvement.

    Code coverage, as many point out, is a wonderful way to find untested code, but does not prove that the code is tested appropriately. However, with TDD, you don’t have to measure covereage, because you’re not testing to meet a metric, you’re writing tests as a way to specify what code you’re going to write. The tests are the requirements. Ergo… your code is, by definition, sufficiently tested. If your code passes all the tests, it’s also correct. If it’s coverage is less than 100%, you’ve found waste, and you need to refactor.

    The above is a gross simplification, but the point is that test sufficiency isn’t measurable by code coverage, but that is seldom claimed. My concern with the headline is precisely that it’s provocative. It’s questioning a claim which was not put forward, and is therefore, unfortunately, rhetorical.

  22. The most important thing that the code coverage metric doesn’t measure, is the code not written. It doesn’t test for missing features, and missing corner cases, or (usually) not complex interactions between different parts of the system.

    I don’t mean to say that shouldn’t measure code coverage if you’re in a position to do so at a cost that represents value – just be aware the Law of Diminishing Marginal Returns (http://en.wikipedia.org/wiki/Diminishing_returns), and realize the limitations inherent in the metric.

    If you find yourself striving for those extra percents, you’re probably focusing on the wrong thing – and your efforts would be better spent elsewhere.

    Or like the philosopher George Sanatanaya put it: "Fanaticism consists in redoubling your effort when you have forgotten your aim."

  23. I’ve seen — and admittedly created — enough lame unit tests to realize that code coverage is a useless measure. Unit testing should be about quality, not quantity. Testing needs to cover boundry cases and use random inputs when possible to be of any use in catching errors that a BDEU (brain dead end user) can easily invoke.

  24. Agree with the main point here – all code coverage tells you for certain is how much of your code has *never* been run, not how much actually works. I’ve worked on projects that have bragged about 90% code coverage – I’d much rather people be more honest, and say there’s 10% out there we know nothing at all about.

  25. Will people ever stop asking meaningless questions to bait unsuspecting readers into reading whatever happened to enter the author’s mind one afternoon?

  26. Not to be rude in any way, and haven’t read all the comments on your page, but you are beating a straw man here.

    Opponents assumed argument:
    100% code coverage is the the reliable measurement.

    Your argument:
    Proof that 100% line coverage does not fullfill above stated argument.

    Conclusion:
    Code coverage is not useful

    Of course tests are useful, it’s a matter of using them right. To travesty our analogy: the engineer should run a ten ton truck with run down tires over the bridge intended to be tested.

    Your comment is, truth to be said, a bit naive, and it seems that you are a bit irritated with the fanboys of code coverage.

    Testing is of course useful.

    No silver bullet.

  27. @Comment

    Not rude at all. In fact, I would agree with you if only for one thing: the strawman isn’t really a strawman — to me at least — since I’ve had to deal with customers and management who have said exactly that.

    If it seems like I’m arguing against a silly assumption, it’s because I am and it is. Unfortunately, it’s an argument that needs to be made because a lot of developers and managers out there *do* see code coverage as silver bullet. Maybe I’m preaching to the choir. :-P

  28. hey a new approach but we need to use coverage tools to get rid of Miss counts so we do not miss optimal path essential for covering certain codes duri9ng testing. We need to get experience from rigorous manual testing first to use code coverage effectively.

  29. I remember from university that there are other metrics related to data paths/data flow, that could be useful for measuring if the corner cases are "covered" (boundary tests, data equivalence partitions, loops, etc).

  30. I don’t think any rational person would see code coverage as the be-all-end-all or silver bullet. It’s a great metric, but only 1 metric of MANY to determine quality. CI, functional and system testing needs to happen in conjunction with unit testing and those unit tests should be code reviewed by humans to make sure they are useful tests. 100% code coverage is useless if the tests suck.

  31. As Mathias said (10/29/2008): "a good test suite should be sensitive and reflect tightly your requirements/assumptions; therefore, if you mutated your code, your tests should fail. So you could measure your tests quality by mutating the code, and looking at how much failure you create."

    I like to use mutation testing to measure the effectiveness/quality of my JUnit test suites.
    It is more demanding than code coverage…

    Jumble (http://jumble.sourceforge.net) is a fast and effective tool that gives you a 0..100% score for each of your Java source code classes, by applying one or two simple mutations to each command and expression in the class and seeing if your tests are strong enough to detect the mutation. It works by mutating the bytecode directly.

    The current release of Jumble (June 2007) works with JUnit 3.8.1, but the SVN repository version works with JUnit 4.4, as will the real-soon-now release.

Comments are closed.