Are Gherkin Scenarios with Multiple When-Then Pairs Okay?

Don’t know about Behavior-Driven Development or Gherkin? Start here!

Writing Gherkin is easy, but writing good Gherkin is hard. My post BDD 101: Writing Good Gherkin covers many aspects of good behavior specification, including titles, phrasing, and data. One of the major points I make anytime I discuss good Gherkin is what I call the “Cardinal Rule of BDD.”

The Cardinal Rule of BDDOne Scenario, One Behavior!

A behavior scenario specification should focus on one individual behavior. This is the essence of the BDD mindset – a product’s features can be specified in terms of its behaviors, and the specs should be written as examples of those behaviors in action. Identifying individual behaviors brings clarity to design, development, and testing. Combining behaviors into a single scenario causes ambiguity, miscommunication, and test gaps. Test failure triage also becomes more difficult and time consuming because the root causes for failures are less clear – the culprit could be one of multiple behaviors. There is also a high risk of duplication when scenarios repeat the same sequence of steps instead of isolating behaviors.

One of the dead giveaways to violations of the Cardinal Rule of BDD is when a Gherkin scenario has multiple When-Then pairs, like this:

Feature: Google Searching

  Scenario: Google Image search shows pictures
    Given the user opens a web browser
    And the user navigates to ""
    When the user enters "panda" into the search bar
    Then links related to "panda" are shown on the results page
    When the user clicks on the "Images" link at the top of the results page
    Then images related to "panda" are shown on the results page

A When-Then pair denotes a unique behavior. In this example, the behaviors of performing a search and changing the search to images could and should clearly be separated into two scenarios, like this:

Feature: Google Searching

  Scenario: Search from the search bar
    Given a web browser is at the Google home page
    When the user enters "panda" into the search bar
    Then links related to "panda" are shown on the results page

  Scenario: Image search
    Given Google search results for "panda" are shown
    When the user clicks on the "Images" link at the top of the results page
    Then images related to "panda" are shown on the results page

Despite being so central to BDD philosophy, the Cardinal Rule is the one thing people always try to sidestep. Nobody ever doubts the usefulness of step parameters or the need for good grammar, but people frequently show me scenarios with multiple When-Then pairs and basically ask for an exception from the rule. My gut reaction is always, “NO! Rules don’t change.”


I must first admit that the Cardinal Rule of BDD is “opinionated” – it is the way that I have found BDD to work best for collaboration and automation. Adherence forces people to adopt a behavior-driven mindset, and strictness keeps feature and test quality high. Other experts are more permissive of multiple When-Then pairs, though. Most examples I could find from leading sources such as The Cucumber Book exhibit strict Given-When-Then order for Gherkin scenarios, but other sources such as the online JBehave documentation show scenarios with multiple When-Then pairs boldly on the front page.

I must also begrudgingly admit that there are times when it is simply more convenient for a single scenario to have multiple behaviors (and thus multiple When-Then pairs). This is by no means a best practice but rather a pragmatic alternative for specification dilemmas. (See Purist vs. Pragmatist.) Below are situations in which multiple When-Then pairs may be acceptable.

Lengthy End-to-End Scenarios

End-to-end tests verify execution paths through a live system with all of its parts. Web UI tests frequently fall into this category: Selenium WebDriver interacts with a page in a browser, which then triggers calls to a backend service layer or database. Despite the name, end-to-end tests may still focus on one individual behavior. The example scenarios above, though short, technically count as end-to-end tests.

However, many people use the term “end-to-end” to refer to tests that cover sequences of behaviors. Such a scenario could violate the Cardinal Rule of BDD if it is not handled carefully. My article BDD 101: Unit, Integration, and End-to-End Tests gives strategies for handling lengthy end-to-end scenarios. One strategy is to simply turn a blind eye to multiple When-Then pairs. Ideally, each behavior would already have its own individual scenario, but then a new scenario would explicitly combine the behaviors together to get that full, end-to-end path. The new scenario would be easy to write because the steps could be reused. This isn’t the only strategy, so please be sure to consider the others before writing the tests.


Software system audits frequently require lengthy end-to-end scenarios. They are quite common in highly-regulated domains. For example, a bank may need to prove that a loan is prepared correctly or that a transaction puts money into the right accounts. Auditors typically require tests to run through entire system paths (e.g., multiple behaviors) using the same records, such as one loan application or one payment. Auditees must not only provide test results for past runs but must also repeat tests on demand. Separating each individual behavior into its own scenario makes each test independent, so during test execution, there will be no guaranteed order and no shared test data, and auditors would not have the end-to-end verification that they require. The simplest way to give the auditors what they need is to write one lengthy scenario with multiple When-Then pairs.

Service Calls

Service call testing is another case for which multiple When-Then pairs may be pragmatically justified. REST, SOAP, and WSDL are examples of service call types. Service layer development is more engineering-centric than business-centric, but many teams nevertheless choose to test service calls with Gherkin-based frameworks like Cucumber. Due to the programmatic nature of services, Gherkin scenarios for service calls tend to be quite imperative: specify a request, make the call, and verify parts of the response. This isn’t so bad for independent service calls, but it becomes a problematic when one request needs another call’s response.

One solution is the classic “pure” scenario split: put any necessary setup, including initial requests to get required response parts, into custom Given steps. This abides by the Cardinal Rule and avoids duplicate When-Then pairs. But, it introduces an unsavory form of code duplication. Many service calls end up being written twice: once as a Gherkin scenario for testing, and once in the underlying automation code to be called by Given steps. This violates the DRY principle.

The alternative “pragmatic” solution is to write scenarios that specify multiple service calls in the Gherkin steps. The Karate project advocates this approach, as shown in their “Hello World” example:

Take Caution!

There may be other cases when When-Then repetition is useful. Feel free to leave suggestions in the comments below. My examples are meant to be descriptive, not prescriptive. Another aspect to consider is that allowing multiple When-Then pairs per scenario indicates that a team sees more value in BDD’s test framework than in its collaborative spec process. (Refer to ‑‑BDD; Automation without Collaboration and BDD‑‑; Collaboration without Automation.)

Ultimately, you must decide what practices are best for your project. The main reason I uphold the Cardinal Rule of BDD so strongly is that it makes for good specs and good tests. I’ve seen engineers write extremely long, intensive test procedures (and I mean, dozens of duplicate behaviors per test) that are alright for manual testing but do not transition well into automation because they are too fragile and they don’t yield useful information upon failure. The Cardinal Rule is a way to break out of the procedure-driven mindset, and banning multiple When-Then pairs per Gherkin scenario is an effective rule for enforcing it.

Good Gherkin Scenario Titles

Don’t know about Behavior-Driven Development or Gherkin? Start here!

The Golden Gherkin Rule states:

Treat other readers as you would want to be treated. Write Gherkin so that people who don’t know the feature will understand it.

Part of writing good Gherkin (or any other specification-by-example language) includes writing good behavior scenario titles. The title is the face of the scenario: it summarizes what the behavior is all about. Good titles make collaboration and test triage a breeze, whereas bad titles make it tougher. But what makes a title “good”? Below are some helpful pointers.


Good titles should be short one-liners. One simple statement should be sufficient to concisely capture the intended behavior. Anything longer likely means that either the author doesn’t truly understand the behavior in focus, or that the scenario does not focus on one main behavior. Extra comments may be added to supplement the scenario’s description if necessary to avoid lengthy titles. Also, most BDD test automation frameworks will print scenario titles to logs for traceability.

Bad Example Good Example
The user can log into the app, navigate to the profile page, and see their full name, address, phone number, email, and username The profile page displays the user’s personal info

Conjunction Disjunction

Watch out for conjunction words like “and,” “or,” and “but.” Conjunctions typically imply that more than one thing will be done, which for scenario titles implies that more than one behavior will be covered. Or, it indicates that a Scenario Outline may be appropriate Don’t break the Cardinal Rule of BDD! Keep each scenario focused on one main behavior.

Avoid other conjunctions like “because,” “since,” and “so” as well. Phrases starting with those words often give an explanation for why the scenario exists. However, for conciseness, scenario titles should focus on what the behavior is. The why can either be deduced from the steps or made plain with comments.

Bad Example Good Example
The user can request an insurance quote from the big “Get-A-Quote” button on the home page or from the “Insurance Policies” page Two Scenarios: The user requests an insurance quote from the “Get-A-Quote” button on the home page / The user requests an insurance quote from the “Insurance Policies” page


Scenario Outline: The user requests an insurance quote

The last five search phrases are saved so that the user can rerun them from the history page The history page saves the last five search phrases

Avoid Assertion Language

Don’t use the words “verify,” “assert,” or “should” in scenario titles. They put the scenario’s emphasis on the assertion rather than the behavior. Assertions are merely a facet of behavior testing – they verify that something exists or that two values are equal. Behavior scenarios, however, are full software specifications. BDD is a development practice for making better software products – it’s not just a test tool. Don’t reduce the behavior-driven mindset to a test-only mindset.

Furthermore, leading every scenario title with “verify” or “assert” becomes very repetitive. The words just don’t enhance the meaningfulness of the title. They also thwart alphabetical order.

Bad Example Good Example
Verify the user can change their address on the profile page Profile page address change
Assert that a stock quote is displayed in green text when its value is higher than its previous closing value A stock quote has green text when its value is higher than its previous closing value
The goodbye page should be displayed after a successful logout Logout displays the goodbye page


Do you have any more suggestions? Put them in the comments below!

To Infinity and Beyond: A Guide to Parallel Testing

Are your automated tests running in parallel? If not, then they probably should be. Together with continuous integration, parallel testing the best way to fail fast during software development and ultimately enforce higher software quality. Switching tests from serial to parallel execution, however, is not a simple task. Tests themselves must be designed to run concurrently without colliding, and extra tools and systems are needed to handle the extra stress. This article is a high-level guide to good parallel testing practices.

What is Parallel Testing?

Parallel testing means running multiple automated tests simultaneously to shorten the overall start-to-end runtime of a test suite. For example, if 10 tests take a total of 10 minutes to run, then 2 parallel processes could execute 5 tests each and cut the total runtime down to 5 minutes. Even better, 10 processes could execute 1 test each to shrink runtime to 1 minute. Parallel testing is usually managed by either a test framework or a continuous integration tool. It also requires more compute resources than serial testing.

Why Go Parallel?

Running automated tests in parallel does require more effort (and potentially cost) than running tests serially. So, why go through the trouble?

The answer is simple: time. It is well documented that software bugs cost more when they are discovered later. That’s why current development practices like Agile and BDD strive to avoid problems from the start through small iterations and healthy collaboration (“shift left“), while CI/CD defensively catches regressions as soon as they happen (“fail fast“). Reducing the time to discover a problem after it has been introduced means higher quality and higher productivity.

Ideally, a developer should be told if a code change is good or bad immediately after committing it. The change should automatically trigger a new build that runs all tests. Unfortunately, tests are not instantaneous – they could take minutes, hours, or even days to complete. A test automation strategy based on the Testing Pyramid will certainly shorten start-to-end execution time but likely still require parallelization. Consider the layers of the Testing Pyramid and their tests’ average runtimes, the Testing Pyramid Rule of 1’s:

The Testing Pyramid with Times

Each layer is listed above with the rough runtime of a typical test. Though actual runtimes will vary, the Rule of 1’s focuses on orders of magnitude. Unit tests typically run in milliseconds because they often exercise product code in memory. Integration tests exercise live products but are limited in scope and often cover low-level areas (like REST service calls). End-to-end tests, however, cover full paths through a live system, which requires extra setup and waiting (like Selenium WebDriver interaction).

Now, consider how many tests from each layer could be run within given time limits, if the tests are run serially:

Test Layer 1 Minute
10 Minutes
Coffee Break
1 Hour
There Goes Today
Unit 60,000 600,000 3,600,000
Integration 60 600 3,600
End-to-End 1 10 60

Unit test numbers look pretty good, though keep in mind 1 millisecond is often the best-case runtime for a unit test. Integration and end-to-end runtimes, however, pose a more pressing problem. It is not uncommon for a project to have thousands of above-unit tests, yet not even a hundred end-to-end tests could complete within an hour, nor could a thousand integration tests complete within 10 minutes. Now, consider two more facts: (1) tests often run as different phases in a CI pipeline, to total runtimes are stacked, and (2) multiple commits would trigger multiple builds, which could cause a serious backup. Serial test execution would starve engineering feedback in any continuous integration system of scale. A team would need to drastically shrink test coverage or give up on being truly “continuous” in favor of running tests daily or weekly. Neither alternative is acceptable these days. CI needs parallel testing to be truly continuous.

The Danger of Collisions

The biggest danger for parallel testing is collision – when tests interfere with each other, causing invalid test failures. Collisions may happen in the product under test if product state is manipulated by more than one test at a time, or they may happen in the automation code itself if the code is not thread-safe. Collisions are also inherently intermittent, which makes them all the more difficult to diagnose. As a design principle, automated tests must avoid collisions for correct parallel execution.

Making tests run in parallel is not as simple as flipping a switch or adding a new config file. Automated tests must be specifically designed to run in parallel. A team may need to significantly redevelop their automation code to make parallel execution work right.


A train collision in Iran in November 2016. Don’t let this happen to your tests!

Handling Product-Level Collisions

Product-level collisions essentially reduce to how environments are set up and handled.

Separate Environments

The most basic way to avoid product-level collisions would be to run each test thread or process against its own instance of the product in an exclusive environment. (In the most extreme case, every single test could have its own product instance.) No collisions would happen in the product because each product instance would be touched by only one test instance at a time. Separate environments are possible to implement using various configuration and deployment tools. Docker containers are quick and easy to spin up. VMs with Vagrant, Puppet, Chef, and/or Ansible can also get it done.

However, it may not always be sensible to make separate environments for each test thread/process:

  • Creating a new environment is inefficient – it takes extra time to set up that may cancel out any time saved from parallel execution.
  • Many projects simply don’t have the money or the compute resources to handle a massive scale-out.
  • Some tests may not cause collisions and therefore may not need total isolation.
  • Some product environments are extremely large and complicated and would not be practical to replicate for each test individually.

Shared Environments

Environments with a shared product instance are quite common. One could be a common environment that everyone on a team shares, or one could be freshly created during a CI run and accessed by multiple test threads/processes. Either way, product-level collisions are possible, and tests must be designed to avoid clashing product states. Any test covering a persistent state is vulnerable; usually, this is the vast majority of tests. Consider web app testing as an example. Tests to load a page and do some basic interactions can probably run in parallel without extra protection, but tests that use a login to enter data or change settings could certainly collide. In this case, collisions could be avoided by using different logins for each simultaneous test instance – by using either a pool of logins, a unique login per test case, or a unique login per thread/process. Each product is different and will require different strategies for avoiding collisions.


We all share certain environments. Take care of them when you do. (Photo: The Blue Marble, taken by the Apollo 17 crew on Dec 7, 1972)

Handling Automation-Level Collisions

Automation-level collisions can happen when automation code is not thread-safe, which could mean more than simply locks and semaphores.

#1: Test Independence

Test cases must be completely independent of each other. One test must not require another test to run before it for the sake of setup. A test case should be able to run by itself without any others. A test suite should be able to run successfully in random order.

#2: Proper Variable Scope

If parallel tests will be run in the same memory address space, then it is imperative to properly scope all variables. Global or static mutable variables (e.g., “non-constants”) must not be allowed because they could be changed unexpectedly. The best pattern for handling scope is dependency injection. Thread-safe singletons would be a second choice. (Typically, global or static variables are used to subvert design patterns, so they may reveal further necessary automation rework when discovered.)

#3: External Resources

Automation may sometimes interact with external resources, such as test config files or test result databases/services. Make sure no external interactions collide. For example, make sure test run updates don’t overwrite each other.

#4: Logging

Logs are very difficult to trace when multiple tests are simultaneously printed to the same file. The best practice is to generate separate log files for each test case, thread, or process to make them readable.

#5: Result Aggregation

A test suite is a unified collection of tests, no matter how many threads/processes are used to run its tests in parallel. Make sure test results are aggregated together into one report. Some frameworks will do this automatically, while others will require custom post-processing.

#6: Test Filtering

One strategy to avoid collisions may be to run non-colliding partitions (subsets) of tests in parallel. Test tagging and filtering would make this possible. For example, tests that require a special login could be tagged as such and run together on one thread.

Test Scalability

The previous section on collisions discussed how to handle product environments. It is also important to consider how to handle the test automation environment. These are two different things: the product environment contains the live product under test, while the test environment contains the automation software and resources that run tests against the product. The test environment is where the parallel tests will be executed, and, as such, it must be scalable to handle the parallelization. A common example of a test environment could be a Jenkins master with a few agents for running build pipelines. There are two primary ways to scale the test environment: scale-up and scale-out.

Parallel Scale-Up

Scale-up is when one machine is configured to handle more tests in parallel. For example, scale-up would be when a machine switches from one (serial) thread to two, three, or even more in parallel. Many popular test runners support this type of scale-up by spawning and joining threads in a common memory address space or by forking processes. (For example, the SpecFlow+ Runner lets you choose.)

Scale-up is a simple way to squeeze as much utility out of an existing machine as possible. If tests are designed to handle collisions, and the test runner has out-of-the-box support, then it’s usually pretty easy to add more test threads/processes. However, parallel test scale-up is inherently limited by the machine’s capacity. Each additional test process succumbs to the law of diminishing returns as more memory and processor cycles are used. Eventually, adding more threads will actually slow down test execution because the processor(s) will waste time constantly switching between tests. (Anecdotally, I found the optimal test-thread-to-processor ratio to be 2-to-1 for running C#/SpecFlow/Selenium-WebDriver tests on Amazon EC2 M4 instances.) A machine itself could be upgraded with more threads and processors, but nevertheless, there are limits to a single machine’s maximum capacity. Weird problems like TCP/IP port exhaustion may also arise.

Scale Up

Scale-up adds more threads to one machine.

Parallel Scale-Out

Scale-out is when multiple machines are configured to run tests in parallel. Whereas scale-up had one machine running multiple tests, scale-out has multiple machines each running tests. Scale-out can be achieved in a number of ways. A few examples are:

  • One master test execution machine launches multiple Web UI tests that each use a remote Selenium WebDriver with a service like Selenium Grid, Sauce Labs, or BrowserStack.
  • A Jenkins pipeline launches tests across ten agents in parallel, in which each agent executes a tenth of the tests independently.

Scale-out is a better long-term solution than scale-up because scale-out can handle an unlimited number of machines for parallel testing. The limiting factor with scale-out is not the maximum capacity of the hardware but rather the cost of running more machines. However, scale-out is much harder to implement than scale-up. It requires tests to be evenly divided with some sort of balancer and filter. It also requires some sort of test result aggregation for joint reporting – people won’t want to piece together a bunch of separate reports to get an overall snapshot of quality. Plus, the test environment is more complicated to build and maintain (though tools like CloudBees Jenkins Enterprise or Amazon EC2 can make it easier.)

Scale Out

Scale-out distributes tests across multiple machines.

Upwards and Outwards

Of course, scale-up and scale-out are not mutually exclusive. Scaled-out nodes could individually be scaled-up. Consider a test environment with 10 powerful VMs that could each handle 10 tests in parallel – that means 100 tests could run simultaneously. Using the Rule of 1’s, it would take only about a minute to run 100 Web UI tests, which serially would have taken over an hour and a half! Use both strategies to shorten start-to-end runtime as much as possible.


Parallel testing is a worthwhile endeavor. When done properly, it will not only reduce development time but also improve the development experience. For readers who want to start doing parallel testing, I recommend researching the tools and frameworks you want to use. Many popular test frameworks support parallel execution, and even if the one you choose doesn’t, you can always invoke tests in parallel from the command line. Do well!

Missing Error Messages with Angular Testing

Logs are an essential part of test automation – they leave a trace of execution that is indispensable when backtracking through failures. Missing logs can make it much, much harder to figure out problems in the code. Recently, I hit this problem while writing unit tests for an Angular project: neither the console nor Google Chrome’s debugger showed any helpful error messages! Thankfully, there was a pretty easy solution. This article will explain the problem and the solution.

Update (January 18, 2018):

After further research, it appears that this problem was fixed in the @angular/cli 1.3.x release. I updated to 1.3.2, removed the “–sourcemaps=false” option, and verified that the error messages are printed. Furthermore, the source mapping is correct – the errors map to the correct line and column in the sources files!

If you are stuck using a version prior to 1.3.x, then use the workaround detailed below. Otherwise, upgrade the package and avoid the problem altogether!


Disable source maps when running Angular tests:

$ ng test --sourcemaps=false

Angular Project Setup

This article presumes the standard Angular 4 project setup, as automatically generated by the “ng new” command. Jasmine unit tests are written in “*.spec.ts” files and run with Karma using Google Chrome as the browser.

The Problem

The Angular testing utilities provide great support for isolating and exercising parts of Angular code for unit testing. However, programmers need to use them properly, or else they won’t work. When I tried writing some unit tests for ngrx, I quickly hit dependency problems. However, it took me hours to figure it out because the console output was not helpful – all it would print was “ERROR”:

Angular Test Errors 1

As a newbie, I had no idea what went wrong. I tried debugging with Chrome, but the error message I got there was cryptic and not much more helpful:

Angular Test Errors 2

The Solution

After googling for a while, I discovered that there is a bug with source maps in the Angular CLI (Issue #7296). The workaround is to add the “–sourcemaps=false” option to the “ng test” command. If the package.json file contains a “test” script that calls “ng test”, the option may be added there. Now, the console prints error messages:

Angular Test Errors 3

Errors also appear on the Karma page in the browser:

Angular Test Errors 4

One side effect of this workaround, however, is that the line and column numbers don’t correctly line up to the TypeScript files. I presume that they map to the compiled JavaScript files instead. Nevertheless, error messages with wrong line numbers are better than no error messages at all. There may be a way to fix the source mapping, but that’s a problem for another day. Hopefully, the Angular team will fix this “feature” for us.

Now, time to go fix those test errors!

Please Hang Up and Dial Again: Handling Test Interruptions in CI/CD

This post was originally published by Sealights on December 19, 2017 as part of their article Test Quality in CI/CD – Expert Roundup. I was honored to contribute my thoughts on automatic recovery in test automation, and I reblogged the text of my contribution here for Automation Panda readers. Please check out contributions from other experts in the full article!

Test automation is an essential part of CI/CD, but it must be extremely robust.
Unfortunately, tests running in live environments (integration and end-to-end)
often suffer rare but pesky “interruptions” that, if unhandled, will cause tests to fail.
These interruptions could be network blips, web pages not fully loaded, or
temporarily downed services – any environment issues unrelated to product bugs.
Interruptive failures are problematic because they (a) are intermittent and thus
difficult to pinpoint, (b) waste engineering time, (c) potentially hide real failures,
and (d) cast doubt over process/product quality. Furthermore, CI/CD magnifies
even rare issues. If an interruption has only a 1% chance of happening during a test,
then considering binomial probabilities, there is a 63% chance it will happen after
100 tests, and a 99% chance it will happen after 500 tests. Keep in mind that it is not
uncommon for thousands of tests to run daily in CI – Google Guava had over 286K
tests back in July 2012!

It is impossible to completely avoid interruptions – they will happen. Therefore, it is
imperative to handle interruptions at multiple layers:

  1. First, secure the platform upon which the tests run. Make sure system
    performance is healthy and that network connections are stable.
  2. Second, add failover logic to the automated tests. Any time an interruption
    happens, catch it as close to its source as possible, pause briefly, and retry the
    operation(s). Do not catch any type of error: pinpoint specific interruption
    signatures to avoid false positives. Build failover logic into the framework
    rather than implementing it for one-off cases. For example, wrappers around web element or service calls could automatically perform retries. Aspect-
    oriented programming can help here tremendously. Repeating failed tests in their entirety also works and may be easier to implement but takes much
    more time to run.
  3. Third, log any interruptions and recovery attempts as warnings. Do not
    neglect to report them because they could indicate legitimate problems,
    especially if patterns appear.

It may be difficult to differentiate interruptions from legitimate bugs. Or, certain
retry attempts might take too long to be practical. When in doubt, just fail the test –
that’s the safer approach.

Pipe Character Escape for Gherkin Tables

For the first time today, I had to write a Gherkin behavior scenario in which table text needed to use the pipe character “|”. I wanted a generic step that would find and click web page links by name, and one of the link names had the pipe in it! The first version of the step I wrote looked like this:

When the user follows the links:
  | link              |
  | Category          |
  | Sub-Category      |
  | Index|Description |

Naturally, this step didn’t parse – the “|” was parsed as a table delimiter instead of the intended link text. I could have rewritten the step to search for partial link text, or I could have done a key-value lookup, but I wanted to keep the step simple and direct.

The solution was simple: escape the pipe character “|” with a backslash character “\”. Easy! Thanks, StackOverflow! The updated table looks like this:

When the user follows the links:
 | link               |
 | Category           |
 | Sub-Category       |
 | Index\|Description |

“\|” works for both step tables and scenario outline example tables. It looks like it is fairly standard for test frameworks that use Gherkin. I verified that Cucumber-JVM and SpecFlow support it, and it looks like Cucumber for Ruby does as well. It looks like behave will support it in 1.2.6.

After learning this trick, I updated the BDD 101: The Gherkin Language page.

Note that backslash escape sequences won’t work for quotes in Gherkin steps. Quotes in steps are merely conventions and not part of the Gherkin language standard.

The Airing of Grievances: BDD

Behavior-Driven Development – one of my favorite blog topics. When done right, it’s a wonderful way to foster better collaboration and automation. When it’s not… well, let’s just say I got a lot of problems with bad BDD practices, and now you’re gonna hear about it!


Treating BDD as a Tool and Not as a Process

BDD is a process – it is a set of tools and practices designed to help teams deliver better software. BDD is not just a test automation framework; the framework is just one of the tools that support BDD. Heck, the word “development” is in the name!

Complaining that Gherkin is Too Technical

Really? Really!? Gherkin is basically just plain language with some buzzwords mixed in! It is specifically designed for non-technical people to handle it! It is not a full-fledged programming language – it is essentially a simple format for behavior specification that automation frameworks can easily parse. The steps are meant to be read like plain English (or any other spoken language) so that better collaboration can happen. If Gherkin is “too technical” for you, then I hate to know what isn’t.

No Buy-In from All Roles

The three major roles on an Agile team, a.k.a the “Three Amigos,” are biz, dev, and test (regardless of fancy names or assignments). For BDD to work well, all three role types must embrace it. Otherwise, collaboration will suffer. BDD is not just a QA thing, it’s for everyone. Biz gets better features in shorter time because requirements were communicated better. Dev wastes less time figuring out what biz wants and gets tests faster. Test can start automating right away since test scenarios are defined from the start in Gherkin. Everybody wins if everybody contributes.

No Three Amigos Meetings

Three Amigos meetings are like dietary fiber supplements: they help a team stay regular with collaboration, or else development gets constipated as engineers start building crap instead of the intended behaviors. Then the crap gets blocked up as the team must rework it, meaning it could be another sprint before there’s a healthy flush of new features. Open conversations in regularly scheduled Three Amigos meetings would have avoided the whole obstruction.

Forcing QA to Write All Behavior Scenarios

BDD is not just QA thing – it is for all roles. Pigeonholing the responsibility of writing behavior scenarios onto QA is not only unfair, it is anti-collaborative. The whole reason for writing scenarios in plain language with Gherkin is to let everyone contribute to feature behavior. Scenarios are primarily about capturing behavior, not writing tests. If tests were the main focus, then engineers could just write test cases using traditional automation frameworks directly in general purpose programming languages like Java or Python. BDD offers the benefits of process efficiency and shifting left when the whole team helps to write behavior scenarios.


Bad Gherkin

Only you can prevent bad Gherkin. Or I can – via rejected code reviews.

Typos, Poor Grammar, and Inconsistent Formatting

Gherkin needs to be readable. Steps with typos, poor grammar, and inconsistent formatting will still run fine for test automation, but they make it tough to understand the behaviors they describe. Sometimes, they can even make the meaning ambiguous.

No Double-Quotes Around Step Parameters

How do you know if something is a step parameter? “Double quotes” make it easy. However, Gherkin does not enforce double quotes around parameters. It is merely by programmer’s convention, but it’s a really helpful convention indeed.

No Tags

Tags make it super easy to filter scenarios at runtime. No tags? Good luck remembering long paths and names at runtime, or running related scenarios across different feature files together.

More Than 120 Characters per Line

Any longer is too much to comprehend. Either write the step more concisely, or split it apart. Plus, the line may go off the edge of the screen.

More Than 10 Steps per Scenario

Again, any longer is too much to comprehend. Scenarios should be short and sweet – they should concisely describe behavior. Too many steps means the scenario is too imperative or covers more than one behavior.

Multiple Behaviors per Scenario

Scenarios should not have multiple personality disorder: one scenario, one behavior. Don’t break the Cardinal Rule of BDD! So many people break this rule when they first start BDD because they are locked into procedure-driven thinking. Then, when tests fail, nobody knows exactly what behavior is the culprit. One scenario, one behavior.

Out-of-Order Step Types

Givens, Whens, and Thens each serve a specific, ordered purpose: Given some initial state, When actions are taken, Then verify an expected outcome. Jumbling them up ruins their meaning. Furthermore, duplicate When-Then pairs indicate multiple behaviors per scenario. And don’t just reassign step types to skirt the strict-ordering rule. Do it right – put integrity into the steps!

Gigantic Tables

Have you ever seen an Examples table with 13 columns? Or maybe 517 rows? I have. The horror, the horror! Tables that big make scenarios lose any semblance of specification-by-example. Make sure table rows and columns are actually needed. Use key-value lookups if the data is too gritty.

Being Imperative Rather Than Declarative

Given I’m logged into the app, when I click here, and I click there, and I type P, and I type L, and I type E, and I type A, and I type S, and I type E, and I type D, and I type O, and I type N, and I type T, and I type W, and I type R, and I type I, and I type T, and I type E, and I type S, and I type C, and I type E, and I type N, and I type A, and I type R, and I type I, and I type O, and I type S, and I type L, and I type I, and I type K, and I type E, and I type T, and I type H, and I type I, and I type S, then go directly to jail, and do not pass GO, and do not collect $200. Steps should focus more on what than how.

Prefixing Existing Test Procedure Steps with Gherkin Buzzwords

Let’s just take our existing test procedures from a tool like HP QualityCenter or ALM and put the words “Given,” “When,” and “Then” in front of every step. Ta-da! We’re now doing BDD! …WRONG!! I kid you not, I have see this happen. These people clearly never took BDD 101. It hurts to see.


Unorganized Step Definitions

Programmers like to throw their step definition methods anywhere. Add ’em to an unrelated existing class? Create a whole new class for only two new steps? Mix up the types? Who cares! Don’t bother to alphabetize them, either. Well, that’s how tech debt happens. That’s how duplicate steps get written, because originals can’t be found. Imagine a library without the Dewey Decimal System – that’s what an unorganized step def collection will be.

Putting Cleanup Code in Then Steps

Cleanup code belongs in After hooks, where it will be run no matter what fails during the scenario. Writing Then steps to do cleanup not only breaks step type integrity, the cleanup code will not run if a previous step aborts!

Catching and Burying All Exceptions

Here’s something I see all the time in automation code (and not just for BDD):

// FYI - This is Java, but the same thing can happen in any language
@When("^do something$")
public void doSomething() throws Throwable {
  try {
  catch (Exception e) {

The entirety of a step (or even a whole test) is surrounded by a try-catch that catches every exception. THIS STEP CAN NEVER REGISTER A FAILURE! Even if there was a failed assertion or, worse, an exception that ought to abort the test, it will get caught and buried with not much more than a slight whimper in the log. In this case, the test will carry forth to the next step, which will probably not work, either. I’ve seen projects with this sort of exception handling around every single step definition. In modern test frameworks, the framework will catch all exceptions at the highest level, register the test as failed, and move on safely to the next test. There is no need to catch any exception, unless the test can be recovered.

Changing Steps Without Testing Affected Scenarios

Sharing steps is a wonderful thing, but changing steps without testing all scenarios that use them is a terrible thing. I’ve seen people change step text or step def code and test only their new scenarios. Meanwhile, in the continuous integration environment, a dozen other tests using those steps started failing. (Hell, I’ve seen people push code that doesn’t even compile, but that’s another grievance.)

Multiple Names for the Same Step

Just because you can do something doesn’t mean you should. Different names for the same step may be useful for readability, but please keep the name variants limited.

No Dependency Injection

Dependency injection is the best way to share objects in an automation framework. (Singletons work well, too, but DI allows more careful control of scope.) Many frameworks like Cucumber-JVM even integrate with existing DI frameworks like PicoContainer and Spring. DON’T MAKE NON-CONSTANT VARIABLES GLOBAL! DON’T BLINDLY MAKE THINGS “STATIC” JUST TO SHARE THEM! Globals (or “statics” in Java/C# like languages) are dangerous: they can be easily misused, they are a nightmare to trace, and they can break multithreaded execution. Just use the appropriate design pattern: dependency injection.