unit

The Testing Pyramid

The “Testing Pyramid” is an industry-standard guideline for functional test case development. Love it or hate it, the Pyramid has endured since the mid-2000’s because it continues to be practical. So, what is it, and how can it help us write better tests?

Layers

The Testing Pyramid has three classic layers:

  • Unit tests are at the bottom. Unit tests directly interact with product code, meaning they are “white box.” Typically, they exercise functions, methods, and classes. Unit tests should be short, sweet, and focused on one thing/variation. They should not have any external dependencies – mocks/monkey-patching should be used instead.
  • Integration tests are in the middle. Integration tests cover the point where two different things meet. They should be “black box” in that they interact with live instances of the product under test, not code. Service call tests (REST, SOAP, etc.) are examples of integration tests.
  • End-to-end tests are at the top. End-to-end tests cover a path through a system. They could arguably be defined as a multi-step integration test, and they should also be “black box.” Typically, they interact with the product like a real user. Web UI tests are examples of integration tests because they need the full stack beneath them.

All layers are functional tests because they verify that the product works correctly.

Proportions

The Testing Pyramid is triangular for a reason: there should be more tests at the bottom and fewer tests at the top. Why?

  1. Distance from code. Ideally, tests should catch bugs as close to the root cause as possible. Unit tests are the first line of defense. Simple issues like formatting errors, calculation blunders, and null pointers are easy to identify with unit tests but much harder to identify with integration and end-to-end tests.
  2. Execution time. Unit tests are very quick, but end-to-end tests are very slow. Consider the Rule of 1’s for Web apps: a unit test takes ~1 millisecond, a service test takes ~1 second, and a Web UI test takes ~1 minute. If test suites have hundreds to thousands of tests at the upper layers of the Testing Pyramid, then they could take hours to run. An hours-long turnaround time is unacceptable for continuous integration.
  3. Development cost. Tests near the top of the Testing Pyramid are more challenging to write than ones near the bottom because they cover more stuff. They’re longer. They need more tools and packages (like Selenium WebDriver). They have more dependencies.
  4. Reliability. Black box tests are susceptible to race conditions and environmental failures, making them inherently more fragile. Recovery mechanisms take extra engineering.

The total cost of ownership increases when climbing the Testing Pyramid. When deciding the level at which to automate a test (and if to automate it at all), taking a risk-based strategy to push tests down the Pyramid is better than writing all tests at the top. Each proportionate layer mitigates risk at its optimal return-on-investment.

Practice

The Testing Pyramid should be a guideline, not a hard rule. Don’t require hard proportions for test counts at each layer. Why not? Arbitrary metrics cause bad practices: a team might skip valuable end-to-end tests or write needless unit tests just to hit numbers. W. Edwards Deming would shudder!

Instead, use loose proportions to foster better retrospectives. Are we covering too many input combos through the Web UI when they could be checked via service tests? Are there unit test coverage gaps? Do we have a pyramid, a diamond, a funnel, a cupcake, or some other wonky shape? Each layer’s test count should be roughly an order of magnitude smaller than the layer beneath it. Large Web apps often have 10K unit tests, 1K service tests, and a few hundred Web UI tests.

Resources

Check out these other great articles on the Testing Pyramid:

BDD 101: Unit, Integration, and End-to-End Tests

There are many types of software tests. BDD practices can be incorporated into all aspects of testing, but BDD frameworks are not meant to handle all test types. Behavior scenarios are inherently functional tests – they verify that the product under test works correctly. While instrumentation for performance metrics could be added, BDD frameworks are not intended for performance testing. This post focuses on how BDD automation works into the Testing Pyramid. Please read BDD 101: Manual Testing for manual test considerations. (Check the Automation Panda BDD page for the full table of contents.)

The Testing Pyramid

The Testing Pyramid is a functional test development approach that divides tests into three layers: unit, integration, and end-to-end.

  • Unit tests are white-box tests that verify individual “units” of code, such as functions, methods, and classes. They should be written in the same language as the product under test, and they should be stored in the same repository. They often run as part of the build to indicate immediate success or failure.
  • Integration tests are black-box tests that verify integration points between system components work correctly. The product under test should be active and deployed to a test environment. Service tests are often integration-level tests.
  • End-to-end tests are black-box tests that test execution paths through a system. They could be seen as multi-step integration tests. Web UI tests are often end-to-end-level tests.

Below is a visual representation of the Testing Pyramid:

The Testing Pyramid

The Testing Pyramid

From bottom to top, the tests increase in complexity: unit tests are the simplest and run very fast, while end-to-end require lots of setup, logic, and execution time. Ideally, there should be more tests at the bottom and fewer tests at the top. Test coverage is easier to implement and isolate at lower levels, so fewer high-investment, more-fragile tests need to be written at the top. Pushing tests down the pyramid can also mean wider coverage with less execution time. Different layers of testing mitigate risk at their optimal returns-on-investment.

Behavior-Driven Unit Testing

BDD test frameworks are not meant for writing unit tests. Unit tests are meant to be low-level, program-y tests for individual functions and methods. Writing Gherkin for unit tests is doable, but it is overkill. It is much better to use established unit test frameworks like JUnit, NUnit, and pytest.

Nevertheless, behavior-driven practices still apply to unit tests. Each unit test should focus on one main thing: a single call, an individual variation, a specific input combo; a behavior. Furthermore, in the software process, feature-level behavior specs draw a clear dividing line between unit and above-unit tests. The developer of a feature is often responsible for its unit tests, while a separate engineer is responsible for integration and end-to-end tests for accountability. Behavior specs carry a gentleman’s agreement that unit tests will be completed separately.

Integration and End-to-End Testing

BDD test frameworks shine at the integration and end-to-end testing levels. Behavior specs expressively and concisely capture test case intent. Steps can be written at either integration or end-to-end levels. Service tests can be written as behavior specs like in Karate. End-to-end tests are essentially multi-step integrations tests. Note how a seemingly basic web interaction is truly a large end-to-end test:

Given a user is logged into the social media site
When the user writes a new post
Then the user's home feed displays the new post
And the all friends' home feeds display the new post

Making a simple social media post involves web UI interaction, backend service calls, and database updates all in real time. That’s a full pathway through the system. The automated step definitions may choose to cover these layers implicitly or explicitly, but they are nevertheless covered.

Lengthy End-to-End Tests

Terms often mean different things to different people. When many people say “end-to-end tests,” what they really mean are lengthy procedure-driven tests: tests that cover multiple behaviors in sequence. That makes BDD purists shudder because it goes against the cardinal rule of BDD: one scenario, one behavior. BDD frameworks can certainly handle lengthy end-to-end tests, but careful considerations should be taken for if and how it should be done.

There are five main ways to handle lengthy end-to-end scenarios in BDD:

  1. Don’t bother. If BDD is done right, then every individual behavior would already be comprehensively covered by scenarios. Each scenario should cover all equivalence classes of inputs and outputs. Thus, lengthy end-to-end scenarios would primarily be duplicate test coverage. Rather than waste the development effort, skip lengthy end-to-end scenario automation as a small test risk, and compensate with manual and exploratory testing.
  2. Combine existing scenarios into new ones. Each When-Then pair represents an individual behavior. Steps from existing scenarios could be smashed together with very little refactoring. This violates good Gherkin rules and could result in very lengthy scenarios, but it would be the most pragmatic way to reuse steps for large end-to-end scenarios. Most BDD frameworks don’t enforce step type order, and if they do, steps could be re-typed to work. (This approach is the most pragmatic but least pure.)
  3. Embed assertions in Given and When steps. This strategy avoids duplicate When-Then pairs and ensures validations are still performed. Each step along the way is validated for correctness with explicit Gherkin text. However, it may require a number of new steps.
  4. Treat the sequence of behaviors as a unique, separate behavior. This is the best way to think about lengthy end-to-end scenarios because it reinforces behavior-driven thinking. A lengthy scenario adds value only if it can be justified as a uniquely separate behavior. The scenario should then be written to highlight this uniqueness. Otherwise, it’s not a scenario worth having. These scenarios will often be very declarative and high-level.
  5. Ditch the BDD framework and write them purely in the automation programming. Gherkin is meant for collaboration about behaviors, while lengthy end-to-end tests are meant exclusively for intense QA work. Biz roles will write behavior specs but will never write end-to-end tests. Forcing behavior specification on lengthy end-to-end scenarios can inhibit their development. A better practice could be coexistence: acceptance tests could be written with Gherkin, while lengthy end-to-end tests could be written in raw programming. Automation for both test sets could still nevertheless share the same automation code base – they could share the same support modules and even step definition methods.

Pick the approach that best meets the team’s needs.

Can Performance Tests be Unit Tests?

A friend recently asked me this question (albeit with some rephrasing):

Can a unit test be a performance test? For example, can a unit test wait for an action to complete and validate that the time it took is below a preset threshold?

I cringed when I heard this question, not only because it is poor practice, but also because it reflects common misunderstandings about types of testing.

QA Buzzword Bingo

The root of this misunderstanding is the lack of standard definitions for types of tests. Every company where I’ve worked has defined test types differently. Individuals often play fast and loose with buzzword bingo, especially when new hires from other places used different buzzwords. Here are examples of some of those buzzwords:

  • Unit testing
  • Integration testing
  • End-to-end testing
  • Functional testing
  • System testing
  • Performance testing
  • Regression testing
  • Test-’til-it-breaks
  • Measurements / benchmarks / metrics
  • Continuous integration testing

And here are some games of buzzword bingo gone wrong:

  • Trying to separate “systemic” tests from “system” tests.
  • Claiming that “unit” tests should interact with a live web page.
  • Separating “regression” tests from other test types.

Before any meaningful discussions about testing can happen, everyone must agree to a common and explicit set of testing type definitions. For example, this could be a glossary on a team wiki page. Whenever I have discussions with others on this topic, I always seek to establish definitions first.

What defines a unit test?

Here is my definition:

A unit test is a functional, white box test that verifies the correctness of a single unit of software code. It is functional in that it gives a deterministic pass-or-fail result. It is white box in that the test code directly calls the product source code under test. The unit is typically a function or method, and there should be separate unit tests for each equivalence class of inputs.

Unit tests should focus on one thing, and they are typically short – both in lines of code and in execution time. Unit tests become extremely useful when they are automated. Every major programming language has unit test frameworks. Some popular examples include JUnit, xUnit.net, and pytest. These frameworks often integrate with code coverage, too.

In continuous integration, automated unit tests can be run automatically every time a new build is made to indicate if the build is good or bad. That’s why unit tests must be deterministic – they must yield consistent results in order to trust build status and expedite failure triage. For example, if a build was green at 10am but turned red at 11am, then, so long as the tests were deterministic, it is reasonable to deduce that a defective change was committed to the code line between 10-11am. Good build status indicates that the build is okay to deploy to a test environment and then hopefully to production.

(As a side note, I’ve heard arguments that unit tests can be black box, but I disagree. Even if a black box test covers only one “unit”, it is still at least an integration test because it covers the connection between the actual product and some caller (script, web browser, etc.).)

What defines a performance test?

Again, here’s my definition:

performance test is a test that measures aspects of a controlled system. It is white box if it tests code directly, such as profiling individual functions or methods. It is black box if it tests a real, live, deployed product. Typically, when people talk about testing software performance, they mean black box style testing. The aspects to measure must be pre-determined, and the system under test must be controlled in order to achieve consistent measurements.

Performance tests are not functional tests:

  • Functional tests answer if a thing works.
  • Performance tests answer how efficiently a thing works.

Rather than yield pass-or-fail results, performance tests yield measurements. These measurements could track things as general as CPU or memory usage, or they could track specific product features like response times. Once measurements are gathered, data analysis should evaluate the goodness of the measurements. This often means comparison to other measurements, which could be from older releases or with other environment controls.

Performance testing is challenging to set up and measure properly. While unit tests will run the same in any environment, performance tests are inherently sensitive to the environment. For example, an enterprise cloud server will likely have better response time than a 7-year-old Macbook.

Why should performance tests not be unit tests?

Returning to the original question, it is theoretically possible to frame a performance test as a functional test by validating a specific measurement against a preset threshold. However, there are 3 main reasons why a unit test should not be a performance test:

  1. Performance checks in unit tests make the build process more vulnerable to environmental issues. Bad measurements from environment issues could cause unit tests to fail for reasons unrelated to code correctness. Any unit test failure will block a build, trigger triage, and stall progress. This means time and money. The build process must not be interrupted by environment problems.
  2. Proper performance tests require lots of setup beyond basic unit test support. Unit tests should be short and sweet, and unit testing frameworks don’t have the tools needed to take good measurements. Unit test environments are often not set up in tightly controlled environments, either. It would take a lot of work to properly put performance checks into a unit test.
  3. Performance tests yield metrics that should not be shoehorned into a binary pass/fail status. Performance data is complex and rich with information. Teams should analyze performance data, especially over time. It can also be volatile.

These points are based on the explicit definitions provided above. Note that I am not saying that performance testing should not be done, but rather that performances checks should not be part of unit testing. Unit testing and performance testing should be categorically separate types of testing.