BDD 101: Unit, Integration, and End-to-End Tests

There are many types of software tests. BDD practices can be incorporated into all aspects of testing, but BDD frameworks are not meant to handle all test types. Behavior scenarios are inherently functional tests – they verify that the product under test works correctly. While instrumentation for performance metrics could be added, BDD frameworks are not intended for performance testing. This post focuses on how BDD automation works into the Testing Pyramid. Please read BDD 101: Manual Testing for manual test considerations.

The Testing Pyramid

The Testing Pyramid is a functional test development approach that divides tests into three layers: unit, integration, and end-to-end.

  • Unit tests are white-box tests that verify individual “units” of code, such as functions, methods, and classes. They should be written in the same language as the product under test, and they should be stored in the same repository. They often run as part of the build to indicate immediate success or failure.
  • Integration tests are black-box tests that verify integration points between system components work correctly. The product under test should be active and deployed to a test environment. Service tests are often integration-level tests.
  • End-to-end tests are black-box tests that test execution paths through a system. They could be seen as multi-step integration tests. Web UI tests are often end-to-end-level tests.

Below is a visual representation of the Testing Pyramid:

The Testing Pyramid

The Testing Pyramid

From bottom to top, the tests increase in complexity: unit tests are the simplest and run very fast, while end-to-end require lots of setup, logic, and execution time. Ideally, there should be more tests at the bottom and fewer tests at the top. Test coverage is easier to implement and isolate at lower levels, so fewer high-investment, more-fragile tests need to be written at the top. Pushing tests down the pyramid can also mean wider coverage with less execution time.

Behavior-Driven Unit Testing

BDD test frameworks are not meant for writing unit tests. Unit tests are meant to be low-level, program-y tests for individual functions and methods. Writing Gherkin for unit tests is doable, but it is overkill. It is much better to use established unit test frameworks like JUnit, NUnit, and pytest.

Nevertheless, behavior-driven practices still apply to unit tests. Each unit test should focus on one main thing: a single call, an individual variation, a specific input combo; a behavior. Furthermore, in the software process, feature-level behavior specs draw a clear dividing line between unit and above-unit tests. The developer of a feature is often responsible for its unit tests, while a separate engineer is responsible for integration and end-to-end tests for accountability. Behavior specs carry a gentleman’s agreement that unit tests will be completed separately.

Integration and End-to-End Testing

BDD test frameworks shine at the integration and end-to-end testing levels. Behavior specs expressively and concisely capture test case intent. Steps can be written at either integration or end-to-end levels. Service tests can be written as behavior specs like in Karate. End-to-end tests are essentially multi-step integrations tests. Note how a seemingly basic web interaction is truly a large end-to-end test:

Given a user is logged into the social media site
When the user writes a new post
Then the user's home feed displays the new post
And the all friends' home feeds display the new post

Making a simple social media post involves web UI interaction, backend service calls, and database updates all in real time. That’s a full pathway through the system. The automated step definitions may choose to cover these layers implicitly or explicitly, but they are nevertheless covered.

Lengthy End-to-End Tests

Terms often mean different things to different people. When many people say “end-to-end tests,” what they really mean are lengthy procedure-driven tests: tests that cover multiple behaviors in sequence. That makes BDD purists shudder because it goes against the cardinal rule of BDD: one scenario, one behavior. BDD frameworks can certainly handle lengthy end-to-end tests, but careful considerations should be taken for if and how it should be done.

There are five main ways to handle lengthy end-to-end scenarios in BDD:

  1. Don’t bother. If BDD is done right, then every individual behavior would already be comprehensively covered by scenarios. Each scenario should cover all equivalence classes of inputs and outputs. Thus, lengthy end-to-end scenarios would primarily be duplicate test coverage. Rather than waste the development effort, skip lengthy end-to-end scenario automation as a small test risk, and compensate with manual and exploratory testing.
  2. Combine existing scenarios into new ones. Each When-Then pair represents an individual behavior. Steps from existing scenarios could be smashed together with very little refactoring. This violates good Gherkin rules and could result in very lengthy scenarios, but it would be the most pragmatic way to reuse steps for large end-to-end scenarios. Most BDD frameworks don’t enforce step type order, and if they do, steps could be re-typed to work. (This approach is the most pragmatic but least pure.)
  3. Embed assertions in Given and When steps. This strategy avoids duplicate When-Then pairs and ensures validations are still performed. Each step along the way is validated for correctness with explicit Gherkin text. However, it may require a number of new steps.
  4. Treat the sequence of behaviors as a unique, separate behavior. This is the best way to think about lengthy end-to-end scenarios because it reinforces behavior-driven thinking. A lengthy scenario adds value only if it can be justified as a uniquely separate behavior. The scenario should then be written to highlight this uniqueness. Otherwise, it’s not a scenario worth having. These scenarios will often be very declarative and high-level.
  5. Ditch the BDD framework and write them purely in the automation programming. Gherkin is meant for collaboration about behaviors, while lengthy end-to-end tests are meant exclusively for intense QA work. Biz roles will write behavior specs but will never write end-to-end tests. Forcing behavior specification on lengthy end-to-end scenarios can inhibit their development. A better practice could be coexistence: acceptance tests could be written with Gherkin, while lengthy end-to-end tests could be written in raw programming. Automation for both test sets could still nevertheless share the same automation code base – they could share the same support modules and even step definition methods.

Pick the approach that best meets the team’s needs.

Test Automation Myth-Busting

Test automation is a vital part of software quality assurance, now more than ever. However, it is a discipline that is often poorly understood. I’ve heard all sorts of crazy claims about automation throughout my career. This post debunks a number of commonly held but erroneous beliefs about automation.

Myth #1: Every test should be automated.

“100% automation” seems to be a new buzz-phrase. If automation is so great, why not automate every test? Not every test is worth automating in terms of return-on-investment. Automation requires significant expertise to design, implement, and maintain. There are limits to how many tests a team can reasonably produce and manage. Furthermore, not all tests are equal. Some require more effort to handle, or may not be run as frequently, or cover less important features. Just because a test could be automated does not mean that it should be automated. Using a risk-based test strategy, tests to automate should be prioritized by highest ROI.

Automated testing does not completely replace manual testing, either. Automated testing is defensive: it protects a code line by consistently running scripted tests for core functionality. However, manual testing is offensive: it uses human expertise to explore features off-script, test-to-break, and evaluate wholistic quality. Returns-on-investment for the same tests are often opposites between automated and manual approaches. Automated and manual testing together fulfill vital, complementary roles.

Myth #2: Automation means we can downsize QA.

Executives often see test automation as a way to automate QA out of a job. This is simply not true: Automation makes QA jobs more efficient and all the more necessary. Automation is software and thus requires strong software development skills. It also requires extra tools, processes, and labor to maintain. The benefit is that more tests can be run more quickly. QA jobs won’t vanish due to automation – they simply assume new responsibilities.

Myth #3: Automation will catch all bugs.

By their very nature, automated tests are “scripted” – each test always follows the same pre-programmed steps. This is a very good thing for catching regression bugs, but it inherently cannot handle new, unforeseen situations. That’s why manual, exploratory testing is needed. Automation, being software, may also have its own bugs. Automation is not a silver bullet.

Myth #4: Automation must be written in the same language as the product code.

Automation must be written in the same programming language as the product code for white-box unit tests. However, any programming language may be used for black-box functional tests. Black-box functional tests (like integration and end-to-end tests) test a live product. There’s no direct connection between the automation code and the product code. For example, a web app could have a REST service layer written in Java, a Web UI frontend written in .NET and JavaScript, and test automation written in Python using requests and Selenium WebDriver. It may be helpful to write automation in the same language as the product so that developers can more easily contribute, but it is not required. Choose the best language for test automation to meet the present needs.

Myth #5: All tests should be such-and-such-level tests.

This argument varies by product type and team. For web apps, it could be phrased as, “All tests should be web UI tests, since that’s how the user interacts with the product.” This is nonsense – different layers of testing mitigate risk at their optimal returns-on-investment. The Testing Pyramid applies this principle. Consider a web app with a service layer in terms of automation – service calls have faster response times and are more reliable than web UI interactions. It would likely be wise to test input combinations at the service layer while focusing on UI-specific functionality only at the web layer.

Myth #6: Unit tests aren’t necessary because the QA team does the testing.

The existence of a QA team or of a black-box automated test suite does not negate the need for unit tests. Unit tests are an insurance policy – they make sure the software programming is fundamentally working. In continuous integration, they make sure builds are good. They are essential for good software development. Many times, I caught bugs in my own code through writing unit tests – bugs that nobody else ever saw because I fixed them before committing code. Personally, I would never want to work on a product without strong unit tests.

Myth #7: We can complete a user story this sprint and automate its tests next sprint.

In Agile Scrum, teams face immense pressure to finish user stories within a sprint. Test automation is often the last part of a story to be done. If the automation isn’t completed by the end of the sprint, teams are tempted to mark the story as complete and finish the test automation in the future. This is a terrible mistake and a violation of Agile principles. Test automation should be included in the definition of done. A story isn’t complete without its prescribed tests! Punting tests into the next sprint merely builds technical debt and forces QA into constant catch-up. To mitigate the risk of incomplete stories, teams should size stories to include automation work, shift left to start QA sooner, or reduce the total sprint commitment size. Incomplete test automation often happens when product code is delivered late or a team’s capacity is overestimated.

Myth #8: Automation is just a bunch of “test scripts.”

It’s quite common to hear developers or managers refer to automated tests as “test scripts.” While this term itself is not inherently derogatory, it oversimplifies the complexity of test automation. “Scripts” sound like short, hacky sequences of commands to do system dirty-work. Test automation, however, is a full stack: in addition to the product under test, automation involves design patterns, dependency packages, development processes, version control, builds, deployments, reporting, and failure triage. Referring to test automation as “scripting” leads to chronic planning underestimations. Automation is a discipline, and the investment it requires should be honored.


Do you have any other automation myths to debunk? Share them in the comment section below!

BDD 101: Test Data

How should test data be handled in a behavior-driven test framework? This is a common question I hear from teams working on BDD test automation. A better question to ask first is, What is test data? This article will explain different types of test data and provide best practices for handling each. The strategies covered here can be applied to any BDD test framework.

Types of Test Data

Personally, I hate the phrase “test data” because its meaning is so ambiguous. For functional test automation, there are three primary types of test data:

  1. Test Case Values. These are the input and expected output values for test cases. For example, when testing calculator addition “1 + 2 = 3”, “1” and “2” would be input values, and “3” would be the expected output value. Input values are often parameterized for reusability, and output values are used in assertions.
  2. Configuration Data. Config data represents the system or environment in which the tests run. Changes in config data should allow the same test procedure to run in different environments without making any other changes to the automation code. For example, a calculator service with an addition endpoint may be available in three different environments: development, test, and production. Three sets of config data would be needed to specify URLs and authentication in each environment (the config data), but 1 + 2 should always equal 3 in any environment (the test case values).
  3. Ready State. Some tests require initial state to be ready within a system. “Ready” state could be user accounts, database tables, app settings, or even cluster data. If testing makes any changes, then the data must be reverted to the ready state.

Each type of test data has different techniques for handling it.

Test Case Values

There are 4 main ways to specify test case values in BDD frameworks, ranging from basic to complex.

In The Specs

The most basic way to specify test case values is directly within the behavior scenarios themselves! The Gherkin language makes it easy – test case values can be written into the plain language of a step, as step parameters, or in Examples tables. Consider the following example:

Scenario Outline: Simple Google searches
  Given a web browser is on the Google page
  When the search phrase "<phrase>" is entered
  Then results for "<phrase>" are shown
  Examples: Animals
    | phrase   |
    | panda    |
    | elephant |
    | rhino    |

The test case value used is the search phrase. The When and Then steps both have a parameter for this phrase, which will use three different values provided by the Examples table. It is perfectly suitable to put these test case values directly into the scenario because the values are small and descriptive.

Furthermore, notice how specific result values are not specified for the Then step. Values like “Panda Express” or “Elephant man” are not hard-coded. The step wording presumes that the step definition will have some sort of programmed mechanism for checking that result links relate to the search phrase (likely through regular expression matching).

Key-Value Lookup

Direct specification is great for small sets of simple values, but one size does not fit all needs. Key-value lookups are appropriate when test data is lengthier. For example, I’ve often seen steps like this:

Given the user navigates to ""

URLs, hexadecimal numbers, XML blocks, and comma-separated lists are all the usual suspects. While it is not incorrect to put these values directly into a step parameter, something like this would be more readable:

Given the user navigates to the "profile" page

Or even:

Given the user navigates to their profile page

The automation would store URLs in a lookup table so that these new steps could easily fetch the URL for the profile page by name. These steps are also more declarative than imperative and better resist changes in the underlying environment.

Another way to use key-value lookup is to refer to a set of values by one name. Consider the following scenario for entering an address:

Scenario Outline: Address entry
  Given the profile edit page is displayed
  When the user sets the street address to "<street>"
  And the user sets the second address line to "<second>"  
  And the user sets the city to "<city>"
  And the user sets the state to "<state>"
  And the user sets the zipcode to "<zipcode>"
  And the user sets the country to "<country>"
  And the user clicks the save button
  Then ...

  Examples: Addresses
    | street | second | city | state | zipcode | country |

An address has a lot of fields. Specifying each in the scenario makes it very imperative and long. Furthermore, if the scenario is an outline, the Examples table can easily extend far to the right, off the page. This, again, is not readable. This scenario would be better written like this:

Scenario Outline: Address entry
  Given the profile edit page is displayed
  When the user enters the "<address-type>" address
  And the user clicks the save button
  Then ...

  Examples: Addresses
    | address-type |
    | basic        |
    | two-line     |
    | foreign      |

Rather than specifying all the values for different addresses, this scenario names the classifications of addresses. The step definition can be written to link the name of the address class to the desired values.

Data Files

Sometimes, test case values should be stored in data files apart from the specs or the automation code. Reasons could be:

  • The data is simply too large to reasonably write into Gherkin or into code.
  • The data files may be generated by another tool or process.
  • The values are different between environments or other circumstances.
  • The values must be selected or switched at runtime (without re-compiling code).
  • The files themselves are used as payloads (ex: REST request bodies or file upload).

Scenario steps can refer to data files using the key-value lookup mechanisms described above. Lightweight, text-based, tabular file formats like CSV, XML, or JSON work the best. They can parsed easily and efficiently, and changes to them can easily be diff’ed. Microsoft Excel files are not recommended because they have extra bloat and cannot be easily diff’ed line-by-line. Custom text file formats are also not recommended because custom parsing is an extra automation asset requiring unnecessary development and maintenance. Personally, I like using JSON because its syntax is concise and its parsing tools seem to be the simplest in most programming languages.

External Sources

An external dependency exists when the data for test case values exists outside of the automation code base. For example, test case values could reside in a database instead of a CSV file, or they could be fetched from a REST service instead of a JSON file. This would be appropriate if the data is too large to manage as a set of files or if the data is constantly changing.

As a word of caution, external sources should be used only if absolutely necessary:

  1. External sources introduce an additional point-of-failure. If that database or service goes down, then the test automation cannot run.
  2. External sources degrade performance. It is slower to get data from a network connection than from a local machine.
  3. Test case values are harder to audit. When they are in the specs, the code, or data files, history is tracked by version control, and any changes are easy to identify in code reviews.
  4. Test case values may be unpredictable. The automation code base does not control the values. Bad values can fail tests.

External sources can be very useful, if not necessary, for performance / stress / load / limits testing, but it is not necessary for the vast majority of functional testing. It may be convenient to mock external sources with either a mocking framework like Mockito or with a dummy service.

Configuration Data

Config data pertain to the test environments, not the test cases. Test automation should never contain hard-coded values for config data like URLs, usernames, or passwords. Rather, test automation should read config data when it launches tests and make references to the required values. This should be done in Before hooks and not in Gherkin steps. In this way, automated tests can run on any configuration, such as different test environments before being released to production.

Config data can be stored in data files or accessed through some other dependency. (Read the previous section for pros and cons of those approaches.) The config to use should be somehow dynamically selectable when tests run. For example, the path to the config file to use could be provided as a command line argument to the test launch command.

Config data can be used to select test values to use at runtime. For example, different environments may need different test value data files. Conversely, scenario tagging can control what parts of config data should be used. For example, a tag could specify a username to use for the scenario, and a Before hook could use that username to fetch the right password from the config data.

For efficiency, only the necessary config data should be accessed or read into memory. In many cases, fetching the config data should also be done once globally, rather than before each test case.

Ready State

All scenarios have a starting point, and often, that starting point involves data. Setup operations must bring the system into the ready state, and cleanup operations must return the system to the ready state. Test data should leave no trace – temporary files should be deleted and records should be reverted. Otherwise, disk space may run out or duplicate records may fail tests. Maintaining the ready state between tests is necessary for true test independence.

During the Test Run

Simple setup and cleanup operations may be done directly within the automation. For example, when testing CRUD operations, records must be created before they can be retrieved, updated, or deleted. Setup would create a record, and cleanup would guarantee the record’s deletion. If the setup is appropriate to mention as part of the behavior, then it should be written as Given steps. This is true of CRUD operations: “Given a record has been created, When it is deleted, …”. If multiple scenarios share this same setup, then those Given steps should be put into a Background section.

However, sometimes setup details are not pertinent to the behavior at hand. For example, perhaps fresh authentication tokens must be generated for those CRUD calls. Those operations should be handled in Before hooks. The automation will take care of it, while the Gherkin steps can focus exclusively on the behavior.

No matter what, After hooks must do cleanup. It is incorrect to write final Then steps to do cleanup. Then steps should verify outcomes, not take more actions. Plus, the final Then steps will not be run if the test has a failure and aborts!

External Preparation

Some data simply takes too long to set up fresh for each test launch. Consider complicated user accounts or machine learning data: these are things that can be created outside of the test automation. The automation can simply presume that they exist as a precondition. These types of data require tool automation to prepare. Tool automation could involve a set of scripts to load a database, make a bunch of service calls, or navigate through a web portal to update settings. Automating this type of setup outside of the test automation enables engineers to more easily replicate it across different environments. Then, tests can run in much less time because the data is already there.

However, this external preparation must be carefully maintained. If any damage is done to the data, then test case independence is lost. For example, deleting a user account without replacing it means that subsequent test runs cannot log in! Along with setup tools, it is important to create maintenance tools to audit the data and make repairs or updates.

Advice for Any Approach

Use the minimal amount of test data necessary to test the functionality of the product under test. More test data requires more time to develop and manage. As a corollary, use the simplest approach that can pragmatically handle the test data. Avoid external dependencies as much as possible.

To minimize test data, remember that BDD is specification by example: scenarios should use descriptive values. Furthermore, variations should be reduced to input equivalence classes. For example, in the first scenario example on this page, it would probably be sufficient to test only one of those three animals, because the other two animals would not exhibit any different searching behavior.

Finally, be cautioned against randomization in test data. Functional tests are meant to be deterministic – they must always pass or fail consistently, or else test results will not be reliable. (Not only could this drive a tester crazy, but it would also break a continuous integration system.) Using equivalence classes is the better way to cover different types of inputs. Use a unique number counting mechanism whenever values must be unique.

BDD‑‑; Collaboration without Automation

In the previous post, I described the tradeoffs of using a BDD test automation framework without the full BDD process. But, what about the opposite? What if a team wants to adopt BDD practices without a test framework to support it? Again, behavior-driven practices are beneficial apart from automation, but not without shortcomings.

The Power of Process

BDD should be a refinement, not an overhaul, of Agile software development. All of the problems BDD solves are simply aspects of the development process that must be solved anyway. BDD simply provides formal practices for solving them uniformly. Consider how BDD addresses the following problems:

Problem Solution
Biz, dev, and test roles are siloed and do not talk together much. BDD brings these three roles together in Three Amigos meetings.
Acceptance criteria are missing or poorly defined, wasting in-sprint time. Acceptance criteria are formalized as specifications using Gherkin.
Product features are hard to explain. Scenarios describe individual behaviors in plain language.
Edge cases are overlooked during testing. Well-defined behavior scenarios capture specifications by example early in development.

All of these problems can be solved through better, behavior-driven practices, and none of them pertain to test automation.

Spec-Less Automation

BDD process improvements don’t necessarily need a BDD framework for test automation. Any test framework could still automate scenario steps. The major difference is that there would be no mechanism to translate Gherkin lines into method/function calls: The automation engineer would simply need to program test cases the “good old-fashioned way.” It would not be much different from translating any other procedure-driven test cases into code.

The weakness of this approach is that specifications are not strongly linked to the test automation. The end-to-end development process is less efficient because behavior scenarios must essentially be rewritten into automation code, rather than becoming part of the automation code. There is also a higher risk that automated test cases won’t cover the actual intention of the test steps. Review and maintenance are more difficult because engineers must always cross-examine the automation code with the Gherkin to make sure they align. All of these problems make it harder to shift left with QA work.

The lack of a behavior-driven test framework is also a double-edged sword for Gherkin steps. On one hand, steps do not need to be scrutinized as strongly in review, since automation code does not directly depend upon them. It is not critical to reuse steps word-for-word or to worry about parameterization. However, sloppy steps can lead to miscommunication and will make adopting a BDD test framework in the future very difficult.

Better Than Nothing

Just like for automation without collaboration, using BDD practices without using a BDD test framework does improve the development process. There aren’t really any disadvantages because the process problems must be solved anyway. A “BDD‑‑;” situation (that’s a postfix decrement, to denote that automation did not follow collaboration) isn’t ideal, but at least it’s better than nothing.

‑‑BDD; Automation without Collaboration

Does it make sense to use a BDD test automation framework on a team that does not follow a Behavior-Driven Development process? I’ve faced this questions a few times recently. Although some BDD benefits will be missing, the answer is still yes, BDD test automation frameworks are still useful apart from a full BDD process. This article covers strengths and weaknesses to explain why.


BDD test frameworks force tests to be behavior-driven, not procedure-driven. Behavior-driven tests focus on individual behaviors, making them concise and comprehensible. Impertinent factors are removed from test cases. Imperative details are specified only when necessary. Test reports are more descriptive, and test results are more meaningful. Tests written without a behavior-driven framework are more likely to become long, unnecessarily complicated, and fragile.

BDD test frameworks also provide inherent structure with steps. Steps are the basic building blocks of test cases, regardless of the type of test automation framework used. While almost all run-of-the-mill test frameworks (like JUnit,, or pytest) provide structure to write separate, independent test cases (usually as methods or functions), they lack structure to write separate test case steps. Typically, programmers end up writing test case logic directly into the test methods/functions, or they write ad hoc helper methods/functions/classes to get the job done. This approach often lacks consistency (especially when multiple engineers contribute to the automation code), and thus reusability suffers and duplication creeps in. Gherkin steps are like guide rails for test cases.

Gherkin steps provide easy reusability for rapid development. In a mature automation code base, new test cases can be written using a few short lines of pre-existing steps. And pre-existing steps can be trusted to work because they’ve been tested before. Parametrized steps enable even greater reuse.

Gherkin steps are self-documenting because they are written in plain English. This makes tests easier to do many things:

  • to write, because it provides an outline for the test in plain language
  • to review, because others less familiar with the feature can quickly understand concise scenarios
  • to maintain, because problems can be pinpointed
  • to explain, because non-technical people can’t read code

Much like any other test frameworks, BDD frameworks integrate with other testing packages and design patterns. For example, it is common to use a BDD framework with Selenium WebDriver and the Page Object Model to do Web UI testing. Other common packages for needs like logging, assertions, and REST API calls also work well with BDD frameworks.

Finally, BDD test frameworks open the door to shifting left. They can be the starting point for QA-led BDD. Demonstrating the value in behavior-driven automation can open interest in Three Amigos collaboration, which can then lead to more process improvements and better software quality.


BDD test frameworks require extra development overhead at first. They aren’t as simple to use as unit-like test frameworks. It also takes a lot of practice to write good Gherkin. I’ve talked with engineers (typically developers) who see the feature file layer as unnecessary “plaster” over test cases. Without full team collaboration and cooperation, the justification for BDD diminishes.

Strict behavior independence may also make execution time less efficient. While steps may be reused, common setup operations must be run for each test. CRUD operations illustrate this point well. In a BDD framework, each operation (create, retrieve, update, delete) would be covered by a separate test scenario. However, the operations are interdependent: a test must create a thing before it can delete the thing. Thus, the delete scenario will borrow some logic from the create scenario. A procedure-driven test could more efficiently stack steps into one test case like this: create, retrieve, update, retrieve, delete, retrieve. Assertions would be interleaved with operations. This one test case would cover multiple behaviors, but it would save execution time by avoiding repeated creations for setup and deletions for cleanups. Many times, people have even asked me if there is a way to sequence Gherkin scenarios together to achieve the same effect! (This is not possible, and it would violate test independence.)

If BDD frameworks are used without a BDD process, then BDD could become pigeonholed as a “QA thing,” forever banished to the realm of the far right (the opposite of shift left, not the political spectrum). This could raise barriers to collaboration if not handled properly.

Furthermore, the lack of the full BDD means that many BDD benefits will go missing. Miscommunications could still easily happen because biz and dev would not be involved in defining behavior scenarios. Delivery deadlines could still be missed because testing and automation cannot readily shift left. Out of the 12 major benefits of BDD, the first 4 would be lost.


Overall, I think the advantages of BDD test automation frameworks outweigh the disadvantages for most above-unit functional testing needs, regardless of whether or not a team uses a full BDD process. Ideally, a team would embrace full-BDD, but that’s not always reality. A “‑‑BDD;” situation (that’s a prefix decrement, to note that collaboration was missing before automation) can still be seen as a glass half-full.

Who Should Lead BDD?

Behavior-driven development offers great benefits: better communication, easier test automation, and higher code quality. There are many ways for a team to start doing BDD, and naturally, someone needs to stand up and lead the effort. In my experience, adopting BDD is its own process. An evangelist converts team leaders, training sessions are given, and Gherkinized acceptance criteria start being automated. However, not everyone will embrace the changes, especially those across different role types. And big changes take time. Rome wasn’t built in a day, and neither will be a mature, effective BDD process.

This post covers three possible ways to lead BDD adoption, each from one of the Three Amigos roles. Any role can lead the charge, but each will have its unique struggles. These possibilities are advisory but not necessarily prescriptive. If you want to move your team into BDD, use these three approaches as guidelines for crafting a plan that best meets your needs. And, of course, the advice in Winning Support for BDD pertains to all approaches. Furthermore, as you read these approaches, put yourselves in the shoes of roles other than your own, so you can better understand the struggles each role faces.

Note: The approaches below presume that the underlying software development process is Agile Scrum. Nevertheless, they may be tweaked and applied to other processes, like Waterfall or Kanban.

The Starting Point

The starting point for all three approaches below is a “traditional” Agile sprint – one that is not (yet) behavior-driven. Product owners write user stories, developers implement the solutions, and testers test the deliverables. The diagram below shows the the main flow of sprint work in this type of sprint, and it will serve as the basis for illustrating BDD adoption:

Traditional Sprint

The overall flow of a “traditional,” non-behavior-driven Agile sprint. Ceremonies like planning, review, and retrospective should still happen, but the are left out of this diagram to put emphasis on parts affected by BDD.



The most common approach I’ve seen is QA-led BDD adoption, because testers arguably have the most to gain. It is most applicable when the Three Amigos roles (biz, dev, and test) are well-defined and separate. The impetus for QA to lead BDD adoption could be that developers deliver code too late to adequately test and automate within a sprint, or it could be that the QA team is struggling to scale their test automation development. There may also be resistance to BDD from biz and dev roles.


The sensible path for QA is to start all the way to the right and progressively shift left. This means that the starting point would be test automation. Start by building a solid automation code base. Pick a well-supported BDD framework like Cucumber, SpecFlow, or behave, and start adding scenarios and step definitions. Select scenarios for core product features rather than the latest sprint stories, so that the code base will be populated with the most basic, useful steps. Once the automation code reaches a “critical mass” for step reusability, QA can then proactively classify new test scenarios as automated or manual. Automated tests become easier and easier to write, giving QA more time to be exploratory with manual testing. Ideally, all manual testing would become exploratory.

Then, it’s time to start shifting left. At this point, all Gherkin steps would be in the automation code only, so set up a tool like Pickles to expose the steps to all team members as living documentation. QA should then schedule Three Amigos meetings with biz and dev to proactively discuss user story expectations. In those meetings, QA should start demonstrating how to write acceptance criteria in Gherkin, which then expedites testing. A big win would be if a QA engineer could write a new scenario using only pre-existing, pre-automated steps and then run it successfully on the spot.

Once biz and dev folks are convinced of BDD’s benefits, encourage them to participate in writing Gherkin. When they get comfortable, encourage product owners to write acceptance criteria in Gherkin when they write user stories, and hold Three Amigos meetings before sprint planning as part of grooming. Convince them that for them to help write Gherkin scenarios is a process efficiency for the whole team.


This slideshow requires JavaScript.


Shifting left is never easy, especially when team members are hardened into their roles. That’s why QA must write both really good test automation and really good Gherkin scenarios. Success should speak for itself once QA delivers good automation fast. Furthermore, QA must be clear that BDD is not merely a test tool, it’s a process that requires a paradigm shift. Otherwise, BDD could be easily pigeonholed to be a “QA thing.”

Dev-Led BDD


There are a few reasons that could push developers to lead BDD. On some Agile teams, there’s no distinction between dev and QA roles: all team members are software engineers responsible for both developing and testing the software. Or, developers may not be satisfied with the testing effort. Maybe too many bugs are escaping the sprint, or maybe automation isn’t getting done in time. Or, perhaps the product owner is not happy with the deliverables and putting pressure on the team to do better. Whatever the circumstance, developers are more than capable of winning with BDD.


The best way for developers to start is to set up Three Amigos meetings, to stop the game of telephone between biz and test. In those meetings, start translating acceptance criteria into Gherkin. Then, start helping out with test automation – that may mean anything from offering advice to QA to building the framework from scratch. Then, start pushing left and right to get biz and test on board with BDD.


This slideshow requires JavaScript.


It may be difficult for developers to work on test automation because they may lack either the expertise or the time to devote to good test automation. Automation is a specialized discipline, and it takes time and diligence to build up expertise to do it right. I’ve seen very skilled developers haughtily build very shabby automation frameworks.

Developers must also be careful to not be too technical, or else biz and test roles may reject BDD for being too complicated or beyond their abilities. Furthermore, some teams may be resistant to developing test automation. For example, automation work may be “starved” for points because it is underestimated or similarly starved for time because it is deemed lower in priority to other work.

Biz-Led BDD


BDD is designed to bring technical and business roles together into healthier collaboration, and biz folks can certainly lead BDD adoption as successfully as more technical folks can. Major reasons for biz to take the lead could be if development is perpetually running behind schedule, if deliverables don’t meet the original requirements, or if software bugs are rampant.


For biz roles, “shift left” could be better called “pull left.” Start by writing solid user stories and Gherkin acceptance criteria. Focus on good Gherkin that is readable and reusable. Then, introduce BDD as a refinement to the Agile process, highlighting its benefits. Initiate Three Amigos meetings to make sure that you are communicating the right things to dev and test. Once collaboration is going well, suggest BDD automation as a way to expedite dev and test work. If acceptance criteria are all Gherkinized, then developing BDD automation would be a natural extension.


This slideshow requires JavaScript.


In my experience, biz roles (specifically product owners) tend to be the most hesitant about BDD. They often see writing Gherkin as a burdensome requirement rather than a way to help their team. Or, they may fear that BDD is “too technical” for them. It may also be difficult for them to pitch BDD automation to the team. To be successful, biz roles need to step outside their comfort zone to win supporters from dev and test.

Process paradigm shifts can be hard, especially on teams that are already overwhelmed with work. Some people just don’t like change. Process and automation change can also be a big challenge if QA is outsourced (which is common).

Side-By-Side Comparison

Here’s the TL;DR:

Role Circumstances Steps Struggles
  • Code is delivered too late to test and automate
  • Automation development is not scaling
  1. Build a solid BDD automation framework
  2. Demonstrate automation success
  3. Set up Three Amigos meetings during the sprint
  4. Start writing Gherkin scenarios with biz and dev as part of grooming
  • Showing that BDD is a whole development process, not just a QA thing
  • Getting the team to truly shift left
  • No separation of dev and QA roles
  • Too many bugs are escaping the sprint
  • Pressure from biz to do better
  1. Initiate better collaboration through Three Amigos and Gherkin
  2. Push right by helping QA with testing and automation
  3. Push left by helping biz write better acceptance criteria
  • Humbly learning good automation practices
  • Dedicating time for automation and more meetings
  • Missed deadlines
  • Deliverables not matching expectations
  • Too many bugs
  1. Write acceptance criteria in good Gherkin
  2. Set up Three Amigos meetings to review Gherkin
  3. Pitch BDD automation
  • Learning semi-technical things
  • Pushing all the way to automation


These are just three general approaches intended to show how BDD is for everyone. If you have other approaches, please describe them in the comment section below! Whatever the approach, make sure to demonstrate that BDD helps everyone, or else people may feel forced into corners and reject BDD for bad reasons. And remember, software quality is not just QA’s responsibility; it is everyone’s responsibility.

Can Performance Tests be Unit Test?

A friend recently asked me this question (albeit with some rephrasing):

Can a unit test be a performance test? For example, can a unit test wait for an action to complete and validate that the time it took is below a preset threshold?

I cringed when I heard this question, not only because it is poor practice, but also because it reflects common misunderstandings about types of testing.

QA Buzzword Bingo

The root of this misunderstanding is the lack of standard definitions for types of tests. Every company where I’ve worked has defined test types differently. Individuals often play fast and loose with buzzword bingo, especially when new hires from other places used different buzzwords. Here are examples of some of those buzzwords:

  • Unit testing
  • Integration testing
  • End-to-end testing
  • Functional testing
  • System testing
  • Performance testing
  • Regression testing
  • Test-’til-it-breaks
  • Measurements / benchmarks / metrics
  • Continuous integration testing

And here are some games of buzzword bingo gone wrong:

  • Trying to separate “systemic” tests from “system” tests.
  • Claiming that “unit” tests should interact with a live web page.
  • Separating “regression” tests from other test types.

Before any meaningful discussions about testing can happen, everyone must agree to a common and explicit set of testing type definitions. For example, this could be a glossary on a team wiki page. Whenever I have discussions with others on this topic, I always seek to establish definitions first.

What defines a unit test?

Here is my definition:

A unit test is a functional, white box test that verifies the correctness of a single unit of software code. It is functional in that it gives a deterministic pass-or-fail result. It is white box in that the test code directly calls the product source code under test. The unit is typically a function or method, and there should be separate unit tests for each equivalence class of inputs.

Unit tests should focus on one thing, and they are typically short – both in lines of code and in execution time. Unit tests become extremely useful when they are automated. Every major programming language has unit test frameworks. Some popular examples include JUnit,, and pytest. These frameworks often integrate with code coverage, too.

In continuous integration, automated unit tests can be run automatically every time a new build is made to indicate if the build is good or bad. That’s why unit tests must be deterministic – they must yield consistent results in order to trust build status and expedite failure triage. For example, if a build was green at 10am but turned red at 11am, then, so long as the tests were deterministic, it is reasonable to deduce that a defective change was committed to the code line between 10-11am. Good build status indicates that the build is okay to deploy to a test environment and then hopefully to production.

(As a side note, I’ve heard arguments that unit tests can be black box, but I disagree. Even if a black box test covers only one “unit”, it is still at least an integration test because it covers the connection between the actual product and some caller (script, web browser, etc.).)

What defines a performance test?

Again, here’s my definition:

performance test is a black box test that measures aspects of a controlled system. It is black box in that it should test a real, live, deployed product. (Even if the build under test has special instrumentation, the performance test itself should still be run in a black box fashion.) The aspects to measure must be pre-determined, and the system under test must be controlled in order to achieve consistent measurements.

Performance tests are not functional tests:

  • Functional tests answer if a thing works.
  • Performance tests answer how efficiently a thing works.

Rather than yield pass-or-fail results, performance tests yield measurements. These measurements could track things as general as CPU or memory usage, or they could track specific product features like response times. Once measurements are gathered, data analysis should evaluate the goodness of the measurements. This often means comparison to other measurements, which could be from older releases or with other environment controls.

Performance testing is challenging to set up and measure properly. While unit tests will run the same in any environment, performance tests are inherently sensitive to the environment. For example, an enterprise cloud server will likely have better response time than a 7-year-old Macbook.

Why should performance tests not be unit tests?

Returning to the original question, it is theoretically possible to frame a performance test as a functional test by validating a specific measurement against a preset threshold. However, there are 3 main reasons why a unit test should not be a performance test:

  1. Performance checks in unit tests make the build process more vulnerable to environmental issues. Bad measurements from environment issues could cause unit tests to fail for reasons unrelated to code correctness. Any unit test failure will block a build, trigger triage, and stall progress. This means time and money. The build process must not be interrupted by environment problems.
  2. Proper performance tests require lots of setup beyond basic unit test support. Unit tests should be short and sweet, and unit testing frameworks don’t have the tools needed to take good measurements. Unit test environments are often not set up in tightly controlled environments, either. It would take a lot of work to properly put performance checks into a unit test.
  3. Unit tests are white box, but performance tests are black box. Unit test results determine if a build is good or bad. Thus, they run before a build is labeled “ready.” Performance tests need to interact with an actual build. Thus, they run after a build is ready and deployed. This is a natural progression: run the most basic tests first, and then run the more advanced tests. It separates these two test types.

These points are based on the explicit definitions provided above. Note that I am not saying that performance testing should not be done, but rather that performances checks should not be part of unit testing. Unit testing and performance testing should be categorically separate types of testing.