Making your code base better will make your code coverage worse

“There are three kinds of lies: lies, damned lies, and statistics” – attributed to Mark Twain (allegedly)

Certain widely used numbers and calculations can create a misleading impression without additional context. Consider Body Mass Index (BMI), often presented as a measure of healthy weight. Calculated from height and weight, a BMI of 25 or more is generally considered overweight. By this definition, an individual with the following characteristics would be classified as overweight:

Height – 6’-0” (183 cm)
Weight – 230 pounds (104kg)
BMI – 31.2

However, if more information were available—for instance, that this person earned a Super Bowl ring in 2025 as a running back for the Philadelphia Eagles—one might question the accuracy of BMI as a health indicator for this specific individual. This additional context could even lead to questioning the overall value of BMI.

That person is Saquon Barkley, and these are his reported height, weight, and calculated BMI. It is reasonable to assert that Mr. Barkley is very healthy, and BMI is an inadequate metric for determining his health. While BMI may have some utility, it is a very crude measure on its own, failing to differentiate between muscle and fat—critical information for a more accurate assessment.

This discussion extends to another metric that has recently gained significant traction in software development: code coverage.

Experiences have shown that the emphasis on maintaining a minimum of 80% code coverage can influence coding decisions, not always positively. The question arises: why has test coverage become a substitute for code quality? Research into this topic and an experiment led to several general observations:

Not all files, features, or applications possess equal value, yet code coverage tools typically treat them uniformly (without customization).
Automated testing is not always the most cost-effective method for application testing.
The default minimum code coverage of 80% is arbitrary.
Making code more DRY (Don’t Repeat Yourself) can lead to a decrease in code coverage.
The structure of code can enhance the accuracy of code coverage metrics, a point supported by verifiable proof using real data and code.

Why has code coverage become the go to metric for code quality?

Code coverage has existed for many years, but its usage has expanded, seemingly overshadowing other metrics. It has evolved from a useful tool for identifying areas needing more testing into the primary determinant of code quality.

A common assumption is that experts possess a quantified, validated, and tested reason for the widespread pursuit of minimum test coverage. However, scientific evidence directly linking code coverage to software quality remains elusive.

Skepticism regarding over-reliance on code coverage has always been present. Martin Fowler discussed the possibility of achieving 100% code coverage with assertion-free tests over a decade ago. Even with valid concerns about assertion-free tests, other scenarios, such as teams inadvertently writing less effective tests, are arguably more common than widespread malicious intent to game coverage.

Code coverage metrics can be useful for identifying significant gaps when writing tests. While high coverage alone does not guarantee code quality, low coverage serves as a warning sign. When automated, this metric can prevent developers from merging pull requests where any file falls below a set minimum threshold, acting as a rough gatekeeper for code quality.

The increased prominence of code coverage is largely due to numerous software vendors marketing and promoting tools that automate quality measurement using code coverage. A software team can integrate these tools into a pipeline, where they act as gatekeepers, preventing merges if files fall below the minimum threshold. The adoption of these tools is growing.

In essence, determining code quality is complex, whereas purchasing a tool that automates and measures unit tests is straightforward.

Not everything is of equal value, but code coverage treats them equally

Discussions on code coverage range from critiques of its flaws to arguments for 100% coverage. While some advocate for complete coverage, indicating strong faith in the metric, a balanced view suggests it has some value despite being a crude measurement.

Many articles on this topic adopt a developer-centric perspective, often overlooking practical considerations like time and cost, and assuming pristine, greenfield codebases. These articles rarely address the cost of writing tests, the risk management associated with specific bug types, or the return on investment.

One article cited by Fowler, “How to Misuse Code Coverage,” dates back to 1997.

When code coverage is the primary (or sole) metric for gauging test coverage, every file is assessed identically: does it meet the minimum threshold? These discussions often neglect the implications of adding code coverage to an existing codebase that has been in use for years, a common scenario today.

There is no consideration for tasks beyond testing, nor any prioritization of features. The actual function of the code and its delivered value are ignored. Financial assessments of potential feature failures are absent. Every file is treated as having equal value, with each requiring 80% code coverage. This applies equally to a file encrypting user data and one allowing a user to upload a profile image.

When implementing a code coverage tool on an existing codebase, it is advisable to analyze the results before indiscriminately applying a minimum threshold to every file. It is highly probable that in any application, some code merits nearly 100% coverage, while other parts warrant significantly less than 80%. Imposing a uniform percentage across all files can lead to developers writing tests for low-value features, potentially neglecting the remaining 20% of higher-value features.

Some tools offer configuration options for different thresholds across various directories within an application. Many teams could achieve greater value by first manually identifying the most critical parts of their software and applying the minimum threshold only to those areas. However, in practice, the default coverage across the entire codebase is frequently observed.

Many projects adopt these tools without considering the product itself. A distinction should be made when using code coverage for software operating a medical device versus a mobile dating app, or for a legacy application with millions of paying users versus a startup’s minimum viable product. In all these cases, a single value, 80%, is often applied.

Applications contain features whose failure would have minimal consequences for users, liability, or the bottom line, alongside others that could be catastrophic. A more intelligent approach to code coverage tools is possible. Consider alternatives to blindly applying the default threshold to every project and file.

Automated testing is not always the most cost effective way

Never spend six minutes doing something by hand when you can spend six hours failing to automate it – Zhuowei Zhang

In line with the theme of return on investment, it is important to remember that automated code-based tests are not the sole method for testing features or code. For web-based applications, tools like Selenium and Cypress are available. Mobile or desktop applications also offer alternative testing options. These tools typically do not measure code coverage.

There is also the traditional approach, where a human user manually verifies the functionality of a feature or code. Consider the time required to write automated tests for a specific feature set versus the time it would take to manually test the same functionality.

For example, if automated tests can be written in four hours (240 minutes), and manual testing of the same features takes 20 minutes, a very rough calculation suggests a potential return on investment after 12 releases or deployments of that software.

Other costs, such as the expense of the person performing manual tests, the potential for human error versus code reliability, and cloud costs for running tests in a pipeline, are intentionally not factored in here. Even without these values, it is clear that automated tests offer benefits through economies of scale in this scenario. Depending on the feature and the critical nature of the application, it is common for testing to involve both automated and manual methods.

Sometimes, a particular feature is exceptionally difficult to automate tests for, making manual testing less time-consuming, even when considering scale. Instances exist where writing the test was significantly more challenging than writing the code itself. The same logic applies: imagine an automated test requiring 16 hours (two days, or 960 minutes) of effort to write, compared to five minutes for manual testing of the same feature. In this case, 192 deployments would be needed to see a return on investment from the automated test.

Much sub-optimal code exists. Automated tests are only as effective as the code they validate. The entire test strategy is only as valuable as the features it validates. Adding a code coverage tool to a sub-optimal codebase with the expectation that it will magically improve application quality is unlikely to succeed. It is also likely to make it considerably harder for development teams to improve the code.

The way code is structured strongly correlates with the accuracy of code coverage data, a point for which verifiable proof is provided later in this article.

Where does the threshold of 80% come from?

An obligatory search for scientifically-proven reasoning behind the default 80% threshold yielded no results.

However, credible speculation suggests the 80% is related to the Pareto principle. This would make sense, unless one truly understands what the 80% in the Pareto principle signifies.

The Pareto principle states that “roughly 80% of consequences come from 20% of causes.” An example of the Pareto principle applied to testing is: “80% of complaints come from 20% of recurring issues.”

There are numerous ways to apply the Pareto principle to testing. It could involve analyzing 80% of complaints, identifying the corresponding code, and increasing code coverage and quality in those areas. Alternatively, it might mean identifying the 20% of code that delivers the greatest value to customers and allocating 80% of resources and efforts there. Properly applied, this would mean dedicating more time to testing critical functions like “how a failed transaction is handled” and less time validating less critical aspects like “does dark mode work?”

This is not what enforcing 80% code coverage achieves. It assigns equal value to each line of code in every file. This means if one file contains 200 lines of code validating a user’s credit card, and another 200 lines allow users to change their default appearance to dark mode, both files will require 80% coverage. While quantifying the effort for maintaining functionality differs greatly between the two, it is clear that one piece of code holds significantly more value.

Using 80% from the Pareto principle as the default minimum threshold appears to be a misunderstanding, a gross misapplication, and frankly, quite absurd. The only thing more ridiculous than misunderstanding it and making it a universal default for code coverage is blindly trusting code coverage as the ultimate metric for measuring code quality.

For this reason, it is believed that the widespread use of 80% stems from the Pareto principle. Someone encountered it, misunderstood its meaning, found it appealing, and made it the default.

Additionally, 80% is often the minimum threshold for a B grade in many educational systems. While no one aims for C-level code, requiring a B as the minimum everywhere can feel like

Latest Post

Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

Docker vs Kubernetes in Production: A Security-First Decision Framework

Effortless VS Code Theming: A Guide to Building Your Own Extension

Implementing Contrast-Color Functionality Using Current CSS Features

ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

Latest Post

Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

Latest Post

Making your code base better will make your code coverage worse

Why has code coverage become the go to metric for code quality?

Not everything is of equal value, but code coverage treats them equally

Automated testing is not always the most cost effective way

Where does the threshold of 80% come from?

Related Posts