SpecFlow

How Q2 uses BDD with SpecFlow for testing PrecisionLender

This case study was written by Andrew Knight, Lead Software Engineer in Test for Q2’s PrecisionLender product, in collaboration with Q2 and Tricentis. It explains the PrecisionLender team’s continuous testing journey and how SpecFlow served as a cornerstone for success.

What is PrecisionLender?

PrecisionLender is a web application that empowers commercial bankers with in-the-moment insights that help them structure and price commercial deals. Andi®, PrecisionLender’s intelligent virtual analyst, delivers these hyper-focused recommendations in real-time, allowing relationship managers to make data-driven decisions while pricing their commercial deals. PrecisionLender is owned and developed by Q2, a financial experience software company dedicated to providing digital banking and lending solutions to banks, credit unions, alternative finance, and fintech companies in the U.S. and internationally.

The PrecisionLender Opportunity Screen
(Picture taken from the PrecisionLender Support Center)

The starting point

The PrecisionLender team had a robust Continuous Integration (CI) delivery pipeline with strong unit test coverage, but they lacked end-to-end feature coverage. Developers would fill this gap by manually inspecting their changes in a shared development environment. However, as the PrecisionLender app grew, manual checks could not cover all possible integrations. The team knew they needed continuous automated testing to provide a safety net for development to remain lean and efficient. In April 2018, they hired Andrew Knight as their first Software Engineer in Test (SET) – a new role for the company – to lead the effort.

Automating tests with SpecFlow

The PrecisionLender team developed the Boa test solution – a project for automating end-to-end tests at scale. Boa would become PrecisionLender’s internal platform for test automation development. The name “Boa” is a loose acronym for “Behavior-Oriented Automation.”

The team chose SpecFlow to be the core framework for Boa tests. Since the PrecisionLender app’s backend is developed using .NET, SpecFlow was a natural fit. SpecFlow’s Gherkin syntax made tests readable and understandable, even to product owners and product support specialists who do not code.

The SpecFlow framework integrates with tools like Selenium WebDriver for testing Web UIs and RestSharp for testing REST APIs to exercise vital pathways for thorough app coverage. SpecFlow’s dependency injection mechanisms are solid yet simple, and the online docs are thorough. Plus, SpecFlow is an open-source project, so anyone can look at its code to learn how things work, open requests for new features, and even offer code contributions.

An example Boa test, written in Gherkin using SpecFlow.

Executing tests with SpecFlow+ Runner

Writing good tests was only part of the challenge. The PrecisionLender team needed to execute Boa tests continuously to provide fast feedback on changes to the app. The team chose to run Boa tests using SpecFlow+ Runner, which is tailored for SpecFlow tests. The team uses SpecFlow+ Runner to launch tests in parallel in TeamCity any time a developer deploys a code change to internal pre-production environments. The entire test suite also runs every night against multiple product configurations. SpecFlow+ Runner produces a helpful test report with everything needed to triage test failures: pass-and-fail tallies overall and per feature, a visual execution timeline, and full system logs. If engineers need to investigate certain failures more closely, they can use SpecFlow tags and SpecFlow+ Runner profiles to selectively filter tests for reruns. SpecFlow+ Runner’s multiple features help the team expedite test execution and investigation.

The SpecFlow+ Runner report for a dozen smoke tests.

Sharing features with SpecFlow+ LivingDoc

Good test cases are more than just verification procedures – they are behavior specifications. They define how features should work. Instead of keeping testing work siloed by role, the PrecisionLender team wanted to share Boa tests as behavior specs with all stakeholders to foster greater collaboration and understanding around features. The team also wanted to share Boa tests with specific customers without sharing the entire automation code.

SpecFlow+ LivingDoc enabled the PrecisionLender team to turn Gherkin feature files into living documentation. Whereas the SpecFlow+ Runner report focuses on automation execution, the SpecFlow+ LivingDoc report focuses on behavior specification apart from coding and automation details. LivingDoc displays Gherkin scenarios in a readable, searchable way that both internal folks and customers can consume. It can also optionally include high-level pass-and-fail results for each scenario, providing just enough information to be helpful and not overwhelming. LivingDoc has also helped PrecisionLender’s engineers identify and eliminate unused step definitions within the automation code. PrecisionLender benefits greatly from complementary reports from SpecFlow+ Runner and SpecFlow+ LivingDoc.

The SpecFlow+ LivingDoc report for a dozen smoke tests with their pass-and-fail results.

Improving interactions with Boa Constrictor

The Boa test solution initially used the Page Object Model to model interactions with the PrecisionLender app. However, as the PrecisionLender team automated more and more Boa tests, it became apparent that page objects did not scale well. Many page object classes had duplicative methods, making automation code messy. Some methods also did not include appropriate waiting mechanisms, introducing flaky failures.

PrecisionLender’s SETs developed Boa Constrictor, a .NET implementation of the Screenplay Pattern, to make better interactions for better automation. In Screenplay, actors use abilities to perform interactions. For example, an ability could be using Selenium WebDriver, and an interaction could be clicking an element. The Screenplay Pattern can be seen as a refactoring of the Page Object Model that minimizes duplicate code through a better separation of concerns. Individual interactions can be hardened for robustness, eliminating flaky hotspots. The Boa test solution now exclusively uses Boa Constrictor for interactions.

In October 2020, Q2 released Boa Constrictor as an open-source project so that anyone can use it. It is fully compatible with SpecFlow and other .NET test frameworks, and it provides rich interactions for Selenium WebDriver and RestSharp out of the box.

Boa Constrictor, the .NET Screenplay Pattern.

Scaling massively with Selenium Grid

When the PrecisionLender team first started automating Boa tests, they ran tests one at a time. That soon became too slow since the average Boa test took 20 to 50 seconds to complete. The team then started running up to 3 tests in parallel on one machine, but that also was not fast enough. They turned to Selenium Grid, a tool for running WebDriver sessions remotely across multiple machines.

PrecisionLender built a set of internal Selenium Grid instances using Microsoft Azure virtual machines to run Boa tests at high scale. As of July 2021, PrecisionLender has over 1800 unique Boa tests that run across four distinct product configurations. Whenever TeamCity detects a code change, it triggers a “continuous” Boa test suite with over 1000 tests running 50 parallel tests using Google Chrome on Selenium Grid. It completes execution in about 10 minutes. TeamCity launches the full test suite every night against all product configurations with 64-100 parallel tests on Selenium Grid. Continuous Integration currently runs up to 10K Boa tests daily against the PrecisionLender app with SpecFlow+ Runner and Selenium Grid.

The Boa test solution architecture, including Continuous Integration through TeamCity and parallel testing with SpecFlow+ Runner and Selenium Grid.

Shifting left with BDD

Better testing and automation practices eventually inspired better development practices. Product owners would create user stories, but developers would struggle to understand requirements and business purposes fully. PrecisionLender’s SETs started bringing together the Three Amigos – business, development, and testing roles – to discuss product behaviors proactively while creating user stories. They introduced Behavior-Driven Development (BDD) activities like Example Mapping to explore behaviors together. Then, well-defined stories could be easily connected to SpecFlow tests written in Gherkin following Specification by Example (SBE). Teams repeatedly saved time by thinking before coding and specifying before testing. They built higher quality into features from the beginning, and they stopped before working on half-baked stories with unjustified value propositions. Developers who participated in these behavior-driven practices were also more likely to automate Boa tests on their own. Furthermore, one of PrecisionLender’s developers loved BDD practices so much that he joined the team of SETs! Through Gherkin, SpecFlow provided a foundation that enabled quality work to shift left.

Challenges along the way

Achieving true continuous testing had its challenges along the way. Intermittent failure was the most significant issue PrecisionLender faced at scale. With so many tests, environments, and infrastructural pieces, arbitrary failures were statistically unavoidable. The PrecisionLender team took a two-pronged approach to handle intermittent failures: (1) eliminate race conditions in automation using good interactions with Boa Constrictor, and (2) use SpecFlow+ Runner to automatically retry failed tests to determine if failures were consistent or intermittent. These two approaches reduced the frequency of flaky failures and helped engineers quickly resolve any remaining issues. As a result, Boa tests enjoy well above a 99% success rate, and most failures are due to actual bugs.

PrecisionLender app performance at scale was a second big challenge. Running up to 100 tests in parallel turned functional tests into de facto load tests. Testing at scale repeatedly uncovered performance bottlenecks in the app. Performance issues caused widespread test failures that were difficult to diagnose because they appeared intermittently. Still, the visual timeline and timestamps in the SpecFlow+ Runner report helped the team identify periods of failure that could be crosschecked against backend logs, metrics, and database queries. Developers resolved many performance issues and significantly boost the app’s response times and load capacity.

Training team members to develop solid test automation was the third challenge. At the start of the journey, test automation, Gherkin, and BDD were all new to PrecisionLender. The PrecisionLender SETs took active steps to train others on how to develop good tests and good automation through group workshops, Three Amigos meetings, and one-on-one mentoring sessions. They shared resources like the Automation Panda blog for how to write good tests and good Gherkin. The investment in education paid off: many developers have joined the SETs in writing readable, reliable Boa tests that run continuously.

Benefits to the business

Developing a continuous testing solution brought many incredible benefits to PrecisionLender. First, the quality of the PrecisionLender app improved because continuous testing provided fast feedback on failures that developers could quickly fix. Instead of relying on manual spot checks, the team could trust the comprehensive safety net of Boa tests to catch bugs. Many issues would be caught within an hour of a developer making a code commit, and the longest feedback cycle would be only one business day for the full nightly test suites to run. Boa tests catch failures before customers ever experience them. The continuous nature of testing enables PrecisionLender to publish new releases every two weeks.

Second, the high reliability of the Boa test solution means that the PrecisionLender team can trust test results. When a test passes, the behavior is working. When a test fails, there is a real bug. Reliability also means that engineers spend less time on automation maintenance and more time on more valuable activities, like developing new features and adding new tests. Quality is present in both the product code and the test code.

Third, continuous testing boosts customer confidence in PrecisionLender. Customers trust the software quality because they know that PrecisionLender thoroughly tests every release. The PrecisionLender team also shares SpecFlow+ LivingDoc reports with specific clients to prove quality.

A bright future

PrecisionLender’s continuous testing journey is not over. Since the PrecisionLender team hired its first SET, it has hired three more, in addition to a testing manager, to grow quality improvement efforts. Multiple development teams have written their own Boa tests, and they plan to write more tests independently. SpecFlow’s tools have been indispensable in helping the PrecisionLender team achieve successful quality assurance. As PrecisionLender welcomes more customers, the Boa solution will be ready to scale with more tests, more configurations, and more executions.

Are Automated Test Retries Good or Bad?

What happens when a test fails? If someone is manually running the test, then they will pause and poke around to learn more about the problem. However, when an automated test fails, the rest of the suite keeps running. Testers won’t get to view results until the suite is complete, and the automation won’t perform any extra exploration at the time of failure. Instead, testers must review logs and other artifacts gathered during testing, and they even might need to rerun the failed test to check if the failure is consistent.

Since testers typically rerun failed tests as part of their investigation, why not configure automated tests to automatically rerun failed tests? On the surface, this seems logical: automated retries can eliminate one more manual step. Unfortunately, automated retries can also enable poor practices, like ignoring legitimate issues.

So, are automated test retries good or bad? This is actually a rather controversial topic. I’ve heard many voices strongly condemn automated retries as an antipattern (see here, here, and here). While I agree that automated retries can be abused, I nevertheless still believe they can add value to test automation. A deeper understanding needs a nuanced approach.

So, how do automated retries work?

To avoid any confusion, let’s carefully define what we mean by “automated test retries.”

Let’s say I have a suite of 100 automated tests. When I run these tests, the framework will execute each test individually and yield a pass or fail result for the test. At the end of the suite, the framework will aggregate all the results together into one report. In the best case, all tests pass: 100/100.

However, suppose that one of the tests fails. Upon failure, the test framework would capture any exceptions, perform any cleanup routines, log a failure, and safely move onto the next test case. At the end of the suite, the report would show 99/100 passing tests with one test failure.

By default, most test frameworks will run each test one time. However, some test frameworks have features for automatically rerunning test cases that fail. The framework may even enable testers to specify how many retries to attempt. So, let’s say that we configure 2 retries for our suite of 100 tests. When that one test fails, the framework would queue that failing test to run twice more before moving onto the next test. It would also add more information to the test report. For example, if one retry passed but another one failed, the report would show 99/100 passing tests with a 1/3 pass rate for the failing test.

In this article, we will focus on automated retries for test cases. Testers could also program other types of retries into automated tests, such as retrying browser page loads or REST requests. Interaction-level retries require sophisticated, context-specific logic, whereas test-level retry logic works the same for any kind of test case. (Interaction-level retries would also need their own article.)

Automated retries can be a terrible antipattern

Let’s see how automated test retries can be abused:

Jeremy is a member of a team that runs a suite of 300 automated tests for their web app every night. Unfortunately, the tests are notoriously flaky. About a dozen different tests fail every night, and Jeremy spends a lot of time each morning triaging the failures. Whenever he reruns failed tests individually on his laptop, they almost always pass.

To save himself time in the morning, Jeremy decides to add automatic retries to the test suite. Whenever a test fails, the framework will attempt one retry. Jeremy will only investigate tests whose retries failed. If a test had a passing retry, then he will presume that the original failure was just a flaky test.

Ouch! There are several problems here.

First, Jeremy is using retries to conceal information rather than reveal information. If a test fails but its retries pass, then the test still reveals a problem! In this case, the underlying problem is flaky behavior. Jeremy is using automated retries to overwrite intermittent failures with intermittent passes. Instead, he should investigate why the test are flaky. Perhaps automated interactions have race conditions that need more careful waiting. Or, perhaps features in the web app itself are behaving unexpectedly. Test failures indicate a problem – either in test code, product code, or infrastructure.

Second, Jeremy is using automated retries to perpetuate poor practices. Before adding automated retries to the test suite, Jeremy was already manually retrying tests and disregarding flaky failures. Adding retries to the test suite merely speeds up the process, making it easier to sidestep failures.

Third, the way Jeremy uses automated retries indicates that the team does not value their automated test suite very much. Good test automation requires effort and investment. Persistent flakiness is a sign of neglect, and it fosters low trust in testing. Using retries is merely a “band-aid” on both the test failures and the team’s attitude about test automation.

In this example, automated test retries are indeed a terrible antipattern. They enable Jeremy and his team to ignore legitimate issues. In fact, they incentivize the team to ignore failures because they institutionalize the practice of replacing red X’s with green checkmarks. This team should scrap automated test retries and address the root causes of flakiness.

green check red x
Testers should not conceal failures by overwriting them with passes.

Automated retries are not the main problem

Ignoring flaky failures is unfortunately all too common in the software industry. I must admit that in my days as a newbie engineer, I was guilty of rerunning tests to get them to pass. Why do people do this? The answer is simple: intermittent failures are difficult to resolve.

Testers love to find consistent, reproducible failures because those are easy to explain. Other developers can’t push back against hard evidence. However, intermittent failures take much more time to isolate. Root causes can become mind-bending puzzles. They might be triggered by environmental factors or awkward timings. Sometimes, teams never figure out what causes them. In my personal experience, bug tickets for intermittent failures get far less traction than bug tickets for consistent failures. All these factors incentivize folks to turn a blind eye to intermittent failures when convenient.

Automated retries are just a tool and a technique. They may enable bad practices, but they aren’t inherently bad. The main problem is willfully ignoring certain test results.

Automated retries can be incredibly helpful

So, what is the right way to use automated test retries? Use them to gather more information from the tests. Test results are simply artifacts of feedback. They reveal how a software product behaved under specific conditions and stimuli. The pass-or-fail nature of assertions simplifies test results at the top level of a report in order to draw attention to failures. However, reports can give more information than just binary pass-or-fail results. Automated test retries yield a series of results for a failing test that indicate a success rate.

For example, SpecFlow and the SpecFlow+ Runner make it easy to use automatic retries the right way. Testers simply need to add the retryFor setting to their SpecFlow+ Runner profile to set the number of retries to attempt. In the final report, SpecFlow records the success rate of each test with color-coded counts. Results are revealed, not concealed.

Here is a snippet of the SpecFlow+ Report showing both intermittent failures (in orange) and consistent failures (in red).

This information jumpstarts analysis. As a tester, one of the first questions I ask myself about a failing test is, “Is the failure reproducible?” Without automated retries, I need to manually rerun the test to find out – often at a much later time and potentially within a different context. With automated retries, that step happens automatically and in the same context. Analysis then takes two branches:

  1. If all retry attempts failed, then the failure is probably consistent and reproducible. I would expect it to be a clear functional failure that would be fast and easy to report. I jump on these first to get them out of the way.
  2. If some retry attempts passed, then the failure is intermittent, and it will probably take more time to investigate. I will look more closely at the logs and screenshots to determine what went wrong. I will try to exercise the product behavior manually to see if the product itself is inconsistent. I will also review the automation code to make sure there are no unhandled race conditions. I might even need to rerun the test multiple times to measure a more accurate failure rate.

I do not ignore any failures. Instead, I use automated retries to gather more information about the nature of the failures. In the moment, this extra info helps me expedite triage. Over time, the trends this info reveals helps me identify weak spots in both the product under test and the test automation.

Automated retries are most helpful at high scale

When used appropriate, automated retries can be helpful for any size test automation project. However, they are arguably more helpful for large projects running tests at high scale than small projects. Why? Two main reasons: complexities and priorities.

Large-scale test projects have many moving parts. For example, at PrecisionLender, we presently run 4K-10K end-to-end tests against our web app every business day. (We also run ~100K unit tests every business day.) Our tests launch from TeamCity as part of our Continuous Integration system, and they use in-house Selenium Grid instances to run 50-100 tests in parallel. The PrecisionLender application itself is enormous, too.

Intermittent failures are inevitable in large-scale projects for many different reasons. There could be problems in the test code, but those aren’t the only possible problems. At PrecisionLender, Boa Constrictor already protects us from race conditions, so our intermittent test failures are rarely due to problems in automation code. Other causes for flakiness include:

  • The app’s complexity makes certain features behave inconsistently or unexpectedly
  • Extra load on the app slows down response times
  • The cloud hosting platform has a service blip
  • Selenium Grid arbitrarily chokes on a browser session
  • The DevOps team recycles some resources
  • An engineer makes a system change while tests were running
  • The CI pipeline deploys a new change in the middle of testing

Many of these problems result from infrastructure and process. They can’t easily be fixed, especially when environments are shared. As one tester, I can’t rewrite my whole company’s CI pipeline to be “better.” I can’t rearchitect the app’s whole delivery model to avoid all collisions. I can’t perfectly guarantee 100% uptime for my cloud resources or my test tools like Selenium Grid. Some of these might be good initiatives to pursue, but one tester’s dictates do not immediately become reality. Many times, we need to work with what we have. Curt demands to “just fix the tests” come off as pedantic.

Automated test retries provide very useful information for discerning the nature of such intermittent failures. For example, at PrecisionLender, we hit Selenium Grid problems frequently. Roughly 1/10000 Selenium Grid browser sessions will inexplicably freeze during testing. We don’t know why this happens, and our investigations have been unfruitful. We chalk it up to minor instability at scale. Whenever the 1/10000 failure strikes, our suite’s automated retries kick in and pass. When we review the test report, we see the intermittent failure along with its exception method. Based on its signature, we immediately know that test is fine. We don’t need to do extra investigation work or manual reruns. Automated retries gave us the info we needed.

Selenium Grid
Selenium Grid is a large cluster with many potential points of failure.
(Image source: LambdaTest.)

Another type of common failure is intermittently slow performance in the PrecisionLender application. Occasionally, the app will freeze for a minute or two and then recover. When that happens, we see a “brick wall” of failures in our report: all tests during that time frame fail. Then, automated retries kick in, and the tests pass once the app recovers. Automatic retries prove in the moment that the app momentarily froze but that the individual behaviors covered by the tests are okay. This indicates functional correctness for the behaviors amidst a performance failure in the app. Our team has used these kinds of results on multiple occasions to identify performance bugs in the app by cross-checking system logs and database queries during the time intervals for those brick walls of intermittent failures. Again, automated retries gave us extra information that helped us find deep issues.

Automated retries delineate failure priorities

That answers complexity, but what about priority? Unfortunately, in large projects, there is more work to do than any team can handle. Teams need to make tough decisions about what to do now, what to do later, and what to skip. That’s just business. Testing decisions become part of that prioritization.

In almost all cases, consistent failures are inherently a higher priority than intermittent failures because they have a greater impact on the end users. If a feature fails every single time it is attempted, then the user is blocked from using the feature, and they cannot receive any value from it. However, if a feature works some of the time, then the user can still get some value out of it. Furthermore, the rarer the intermittency, the lower the impact, and consequentially the lower the priority. Intermittent failures are still important to address, but they must be prioritized relative to other work at hand.

Automated test retries automate that initial prioritization. When I triage PrecisionLender tests, I look into consistent “red” failures first. Our SpecFlow reports make them very obvious. I know those failures will be straightforward to reproduce, explain, and hopefully resolve. Then, I look into intermittent “orange” failures second. Those take more time. I can quickly identify issues like Selenium Grid disconnections, but other issues may not be obvious (like system interruptions) or may need additional context (like the performance freezes). Sometimes, we may need to let tests run for a few days to get more data. If I get called away to another more urgent task while I’m triaging results, then at least I could finish the consistent failures. It’s a classic 80/20 rule: investigating consistent failures typically gives more return for less work, while investigating intermittent failures gives less return for more work. It is what it is.

The only time I would prioritize an intermittent failure over a consistent failure would be if the intermittent failure causes catastrophic or irreversible damage, like wiping out an entire system, corrupting data, or burning money. However, that type of disastrous failure is very rare. In my experience, almost all intermittent failures are due to poorly written test code, automation timeouts from poor app performance, or infrastructure blips.

Context matters

Automated test retries can be a blessing or a curse. It all depends on how testers use them. If testers use retries to reveal more information about failures, then retries greatly assist triage. Otherwise, if testers use retries to conceal intermittent failures, then they aren’t doing their jobs as testers. Folks should not be quick to presume that automated retries are always an antipattern. We couldn’t achieve our scale of testing at PrecisionLender without them. Context matters.

Solving: How to write good UI interaction tests? #GivenWhenThenWithStyle

Writing good Gherkin is a passion of mine. Good Gherkin means good behavior specification, which results in better features, better tests, and ultimately better software. To help folks improve their Gherkin skills, Gojko Adzic and SpecFlow are running a series of #GivenWhenThenWithStyle challenges. I love reading each new challenge, and in this article, I provide my answer to one of them.

The Challenge

Challenge 20 states:

This week, we’re looking into one of the most common pain points with Given-When-Then: writing automated tests that interact with a user interface. People new to behaviour driven development often misunderstand what kind of behaviour the specifications should describe, and they write detailed user interactions in Given-When-Then scenarios. This leads to feature files that are very easy to write, but almost impossible to understand and maintain.

Here’s a typical example:

Scenario: Signed-in users get larger capacity
 
Given a user opens https://www.example.com using Chrome
And the user clicks on "Upload Files"
And the page reloads
And the user clicks on "Spreadsheet Formats"
Then the buttons "XLS" and "XLSX" show
And the user clicks on "XLSX"
And the user selects "500kb-sheet.xlsx"
Then the upload completes
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
And the user clicks on "XLSX"
And the user selects "1mb-sheet.xlsx"
Then the upload fails
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx" 
And the user clicks on "Login"
And the user enters "testuser123" into the "username" field
And the user enters "$Pass123" into the "password" field
And the user clicks on "Sign in"
And the page reloads
Then the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx" 
And the user clicks on "spreadsheet formats"
Then the buttons "XLS" and "XLSX" show
And the user clicks on "XLSX"
And the user selects "1mb-sheet.xlsx"
Then the upload completes
And the table "Uploaded Files" contains a cell with "1mb-sheet.xlsx" 
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx"

A common way to avoid such issues is to rewrite the specification to avoid the user interface completely. We’ve looked into that option several times in this article series. However, that solution only applies if the risk we’re testing is not in the user interface, but somewhere below. To make this challenge more interesting, let’s say that we actually want to include the user interface in the test, since the risk is in the UI interactions.

Indeed, most behavior-driven practitioners would generally recommend against phrasing steps using language specific to the user interface. However, there are times when testing a user interface itself is valid. For example, I work at PrecisionLender, a Q2 Company, and our main web app is very heavy on the front end. It has many, many interconnected fields for pricing commercial lending opportunities. My team has quite a few tests to cover UI-centric behaviors, such as verifying that entering a new interest rate triggers recalculation for summary amounts. If the target behavior is a piece of UI functionality, and the risk it bears warrants test coverage, then so be it.

Let’s break down the example scenario given above to see how to write Gherkin with style for user interface tests.

Understanding Behavior

Behavior is behavior. If you can describe it, then you can do it. Everything exhibits behavior, from the source code itself to the API, UIs, and full end-to-end workflows. Gherkin scenarios should use verbiage that reflects the context of the target behavior. Thus, the example above uses words like “click,” “select,” and “open.” Since the scenario explicitly covers a user interface, I think it is okay to use these words here. What bothers me, however, are two apparent code smells:

  1. The wall of text
  2. Out-of-order step types

The first issue is the wall of text this scenario presents. Walls of text are hard to read because they present too much information at once. The reader must take time to read through the whole chunk. Many readers simply read the first few lines and then skip the remainder. The example scenario has 27 Given-When-Then steps. Typically, I recommend Gherkin scenarios to have single-digit line length. A scenario with less than 10 steps is easier to understand and less likely to include unnecessary information. Longer scenarios are not necessarily “wrong,” but their longer lengths indicate that, perhaps, these scenarios could be rewritten more concisely.

The second issue in the example scenario is that step types are out of order. Given-When-Then is a formula for success. Gherkin steps should follow strict Given → When → Then ordering because this ordering demarcates individual behaviors. Each Gherkin scenario should cover one individual behavior so that the target behavior is easier to understand, easier to communicate, and easier to investigate whenever the scenario fails during testing. When scenarios break the order of steps, such as Given → Then → Given → Then in the example scenario, it shows that either the scenario covers multiple behaviors or that the author did not bring a behavior-driven understanding to the scenario.

The rules of good behavior don’t disappear when the type of target behavior changes. We should still write Gherkin with best practices in mind, even if our scenarios cover user interfaces.

Breaking Down Scenarios

If I were to rewrite the example scenario, I would start by isolating individual behaviors. Let’s look at the first half of the original example:

Given a user opens https://www.example.com using Chrome
And the user clicks on "Upload Files"
And the page reloads
And the user clicks on "Spreadsheet Formats"
Then the buttons "XLS" and "XLSX" show
And the user clicks on "XLSX"
And the user selects "500kb-sheet.xlsx"
Then the upload completes
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
And the user clicks on "XLSX"
And the user selects "1mb-sheet.xlsx"
Then the upload fails
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx"

Here, I see four distinct behaviors covered:

  1. Clicking “Upload Files” reloads the page.
  2. Clicking “Spreadsheet Formats” displays new buttons.
  3. Uploading a spreadsheet file makes the filename appear on the page.
  4. Attempting to upload a spreadsheet file that is 1MB or larger fails.

If I wanted to purely retain the same coverage, then I would rewrite these behavior specs using the following scenarios:

Feature: Example site
 
 
Scenario: Choose to upload files
 
Given the Example site is displayed
When the user clicks the "Upload Files" link
Then the page displays the "Spreadsheet Formats" link
 
 
Scenario: Choose to upload spreadsheets
 
Given the Example site is ready to upload files
When the user clicks the "Spreadsheet Formats" link
Then the page displays the "XLS" and "XLSX" buttons
 
 
Scenario: Upload a spreadsheet file that is smaller than 1MB
 
Given the Example site is ready to upload spreadsheet files
When the user clicks the "XLSX" button
And the user selects "500kb-sheet.xlsx" from the file upload dialog
Then the upload completes
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
 
 
Scenario: Upload a spreadsheet file that is larger than or equal to 1MB
 
Given the Example site is ready to upload spreadsheet files
When the user clicks the "XLSX" button
And the user selects "1mb-sheet.xlsx" from the file upload dialog
Then the upload fails
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx"

Now, each scenario covers each individual behavior. The first scenario starts with the Example site in a “blank” state: “Given the Example site is displayed”. The second scenario inherently depends upon the outcome of the first scenario. Rather than repeat all the steps from the first scenario, I wrote a new starting step to establish the initial state more declaratively: “Given the Example site is ready to upload files”. This step’s definition method may need to rerun the same operations as the first scenario, but it guarantees independence between scenarios. (The step could also optimize the operations, but that should be a topic for another challenge.) Likewise, the third and fourth scenarios have a Given step to establish the state they need: “Given the Example site is ready to upload spreadsheet files.” Both scenarios can share the same Given step because they have the same starting point. All three of these new steps are descriptive more than prescriptive. They declaratively establish an initial state, and they leave the details to the automation code in the step definition methods to determine precisely how that state is established. This technique makes it easy for Gherkin scenarios to be individually clear and independently executable.

I also added my own writing style to these scenarios. First, I wrote concise, declarative titles for each scenario. The titles dictate interaction over mechanics. For example, the first scenario’s title uses the word “choose” rather than “click” because, from the user’s perspective, they are “choosing” an action to take. The user will just happen to mechanically “click” a link in the process of making their choice. The titles also provide a level of example. Note that the third and fourth scenarios spell out the target file sizes. For brevity, I typically write scenario titles using active voice: “Choose this,” “Upload that,” or “Do something.” I try to avoid including verification language in titles unless it is necessary to distinguish behaviors.

Another stylistic element of mine was to remove explicit details about the environment. Instead of hard coding the website URL, I gave the site a proper name: “Example site.” I also removed the mention of Chrome as the browser. These details are environment-specific, and they should not be specified in Gherkin. In theory, this site could have multiple instances (like an alpha or a beta), and it should probably run in any major browser (like Firefox and Edge). Environmental characteristics should be specified as inputs to the automation code instead.I also refined some of the language used in the When and Then steps. When I must write steps for mechanical actions like clicks, I like to specify element types for target elements. For example, “When the user clicks the “Upload Files” link” specifies a link by a parameterized name. Saying the element is a link helps provides context to the reader about the user interface. I wrote other steps that specify a button, too. These steps also specified the element name as a parameter so that the step definition method could possibly perform the same interaction for different elements. Keep in mind, however, that these linguistic changes are neither “required” nor “perfect.” They make sense in the immediate context of this feature. While automating step definitions or writing more scenarios, I may revisit the verbiage and do some refactoring.

Determining Value for Each Behavior

The four new scenarios I wrote each covers an independent, individual behavior of the fictitious Example site’s user interface. They are thorough in their level of coverage for these small behaviors. However, not all behaviors may be equally important to cover. Some behaviors are simply more important than others, and thus some tests are more valuable than others. I won’t go into deep detail about how to measure risk and determine value for different tests in this article, but I will offer some suggestions regarding these example scenarios.

First and foremost, you as the tester must determine what is worth testing. These scenarios aptly specify behavior, and they will likely be very useful for collaborating with the Three Amigos, but not every scenario needs to be automated for testing. You as the tester must decide. You may decide that all four of these example scenarios are valuable and should be added to the automated test suite. That’s a fine decision. However, you may instead decide that certain user interface mechanics are not worth explicitly testing. That’s also a fine decision.

In my opinion, the first two scenarios could be candidates for the chopping block:

  1. Choose to upload files
  2. Choose to upload spreadsheets

Even though these are existing behaviors in the Example site, they are tiny. The tests simply verify that a user clicks makes certain links or buttons appear. It would be nice to verify them, but test execution time is finite, and user interface tests are notoriously slow compared to other tests. Consider the Rule of 1’s: typically, by orders of magnitude, a unit test takes about 1 millisecond, a service API test takes about 1 second, and a web UI test takes about 1 minute. Furthermore, these behaviors are implicitly exercised by the other scenarios, even if they don’t have explicit assertions.

One way to condense the scenarios could be like this:

Feature: Example site
 
 
Background:
 
Given the Example site is displayed
When the user clicks the "Upload Files" link
And the user clicks the "Spreadsheet Formats" link
And the user clicks the "XLSX" button
 
 
Scenario: Upload a spreadsheet file that is smaller than 1MB
 
When the user selects "500kb-sheet.xlsx" from the file upload dialog
Then the upload completes
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
 
 
Scenario: Upload a spreadsheet file that is larger than or equal to 1MB
 
When the user selects "1mb-sheet.xlsx" from the file upload dialog
Then the upload fails
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx" 

This new feature file eliminates the first two scenarios and uses a Background section to cover the setup steps. It also eliminates the need for special Given steps in each scenario to set unique starting points. Implicitly, if the “Upload Files” or “Spreadsheet Formats” links fail to display the expected elements, then those steps would fail.

Again, this modification is not necessarily the “best” way or the “right” way to cover the desired behaviors, but it is a reasonably good way to do so. However, I would assert that both the 4-scenario feature file and the 2-scenario feature file are much better approaches than the original example scenario.

More Gherkin

What I showed in my answer to this Gherkin challenge is how I would handle UI-centric behaviors. I try to keep my Gherkin scenarios concise and focused on individual, independent behaviors. Try using these style techniques to rewrite the second half of Gojko’s original scenario. Feel free to drop your Gherkin in the comments below. I look forward to seeing how y’all write #GivenWhenThenWithStyle!

SpecFlow’s Online Gherkin Editor

Finding a good Gherkin editor is difficult. Some editors like Visual Studio Code and similar IDEs work great for engineers but aren’t suitable for product owners and non-programmer Amigos who want to contribute. Other editors like Notepad++ and Atom are lighter in weight but still require extensions and a little expertise. Fancy BDD tools like CucumberStudio and Cucumber for Jira provide Gherkin editors together with a bunch of other nifty features, but they require paid licenses.

For years, I’ve wanted a lightweight Gherkin editor that’s easy to use and accessible to all. Now, one finally exists: the Online Gherkin Editor by SpecFlow!

SpecFlow is the most popular BDD test automation framework for .NET. It’s also my favorite BDD framework. Over the past few years, I’ve built two large-scale test automation solutions with SpecFlow.

The Online Gherkin Editor by SpecFlow is just an editor on a web page. When you first load the page, the editor has example scenarios for you to reference. You can type your own Gherkin into the text area, and the editor highlights it for you. The editor provides line numbers and visual scrolling, too. My language is English, but if you happen to speak German, French, Spanish, or Dutch, then you can change the language setting via a dropdown. Once you’re done writing your Gherkin, you can clear it, copy it to the clipboard, or download it as a feature file using icons in the top-right corner. Be warned, though, that this editor won’t save your Gherkin in the cloud.

If you want to give this new editor a try, here’s the link: https://specflow.org/gherkin-editor/

You can also read SpecFlow’s official announcement here: https://specflow.org/blog/introducing-the-specflow-online-gherkin-editor/

Thanks, SpecFlow! Happy “Gherk-ing”!

What is BDD, and How Do We Practice It? (Webinar + Q&A)

On March 18, 2019, I gave a webinar entitled, “What is Behavior-Driven Development, and How Do We Practice It?” in collaboration with Paul Merrill and his company, Beaufort Fairmont. It was both a pleasure and an honor to do this webinar with them. Paul is a top-notch test automation expert, and Beaufort Fairmont is doing really exciting things. Check out their two-day BDD training offering, as well as their blog and other webinars.

To see my webinar recording, register here.

During the webinar, attendees asked more questions than we could answer. I’m excited that so many people asked questions. My answers are below.

Questions about Process

How is BDD different from TDD (Test-Driven Development)?

BDD is an evolution of TDD. In TDD, developers (1) write unit tests and watch them fail, (2) develop the feature to make the tests pass, (3) refactor the code to make it stronger, and (4) repeat the cycle. In BDD, teams do this same loop with feature tests (a.k.a “acceptance” or “black-box” tests) as well as unit tests. Furthermore, BDD adds shift left practices like Example Mapping and Specification by Example so that teams know what they are doing and focus on developing the right things.

Check out Dan North’s article, Introducing BDD, for a more thorough answer.

Can BDD be used with manual testing?

Yes! BDD is not merely an automation tool – it is a set of pragmatic practices to help teams develop better software. Gherkin scenarios are first and foremost behavior specs that help a team’s collaboration and accountability. They function secondarily as test cases that can be executed either manually or with automation.

Can we use BDD with technical stories or backend features?

Yes! If you can describe it, then you can do it.

How many Gherkin scenarios should one story have?

There’s no hard rule, but I recommend no more than a handful of rules per story, and no more than a handful of examples per rule. If you do Example Mapping and feel overwhelmed by the number of cards for a story, then the story should probably be broken into smaller stories.

Should we do Example Mapping for every story? Spending 20-30 minutes for each story would take a long time.

Try doing Example Mapping on one or two stories to start. The first time is always rough, but as you iterate on it, you’ll get better as a team. Even though Example Mapping has an upfront time cost, it will save a lot of time later in the sprint because (a) acceptance criteria is clear, (b) tests are already written, and (c) everyone has a mutual understanding of the story. The team won’t suffer through the inefficiencies of miscommunication and poor planning. You may even want to replace planning meeting with Example Mapping meetings.

What metrics should we use with BDD?

All metrics are flawed, but some metrics are useful. All the standard testing and Agile metrics still apply: code coverage, story velocity, etc. Here are some additional metrics you may consider for BDD:

  • the percentage of stories that undergo Example Mapping before the sprint
  • the number of rules and examples that get “missed” during Example Mapping and need to be added later
  • the percentage of Gherkin scenarios that get automated in the sprint

If you choose to track metrics, make sure their feedback is used to improve team practices. For more info on metrics, please read my Quality Metrics 101 series.

What were the resources you recommended at the end of the webinar?

Questions about Tools

What test management tools should we use with BDD?

I’m sure there are BDD plugins for test management tools, but I don’t have any that I can personally recommend. To be honest, I try to stay away from large test management tools like HP ALM, qTest, VersionOne. When doing BDD, the Gherkin feature files themselves should be the single source of truth for feature-level tests, and they should be version-controlled in a repository. Don’t fall into the trap of slapping “Given-When-Then” keywords onto existing functional tests – that’s not BDD.

Does Jira support Example Mapping?

I have not personally used any Jira plugin for Example Mapping. It looks like there is an Easy Agile User Story Maps plugin that is similar to but slightly different from Example Mapping.

Are there other good tools for BDD and Example Mapping?

What’s the difference between Gherkin, Cucumber, and SpecFlow?

  • Gherkin is the Given-When-Then spec language.
  • Cucumber is a company and its eponymous test framework that uses Gherkin.
  • SpecFlow is Cucumber for .NET.

Questions about Testing

Can BDD test frameworks be used for unit testing?

Yes, but I don’t recommend it. BDD frameworks shine for black-box feature testing. They’re a bit too verbose for code-level unit tests. Read BDD 101: Unit, Integration, and End-to-End Tests for more info.

Can BDD test frameworks be used for integration testing?

Yes! See BDD 101: Unit, Integration, and End-to-End Tests.

How long should Gherkin scenarios be?

Scenarios should be bite-sized. Each scenario should focus on one individual behavior. There’s no hard rule, but I recommend single-digit step counts. Read BDD 101: Writing Good Gherkin for more info.

What are “step definitions” in Cucumber?

Step definitions are the methods in the automation code that execute the steps. When a BDD framework runs a Gherkin scenario as a test, it “glues” each step to a step definition based on some sort of string matching.

How can we minimize duplicate code within a BDD test framework?

Know your steps. Always search for existing steps before writing new steps. Refactor existing steps whenever appropriate. Reuse steps when writing new scenarios. Do pair programming or mob programming when writing scenarios. Put scenarios through code reviews. Apply good coding practices – remember, test automation is software.

I write Gherkin scenarios, but I don’t write test automation code. What’s the best way to write Gherkin scenarios so that they can be automated?

Do pair programming with the automation engineers to write Gherkin scenarios together. Become familiar with existing steps by reading and searching feature files. Otherwise, the Gherkin steps you write in isolation might not be usable. Remember, BDD is a team effort!

The examples in the webinar were all fairly basic. Do you have any examples with more complex systems?

I have some example projects on GitHub in Python and Java with some basic unit, integration, and end-to-end tests, but I don’t have any large-scale examples that I can share publicly.

We wrote hundreds of SpecFlow tests without the other Amigos. Now, there are large test gaps, and many steps aren’t reusable. What should we do?

I’m sorry to hear that. It’s not an uncommon story. There are two paths: (1) refactoring or (2) starting over. Without really knowing the situation, I don’t think it’s my place to say which way is better. Here are some questions to help guide your decision:

  • What are your goals for testing and automation?
  • What’s your overall quality and testing strategy?
  • What parts of the code base are salvageable?
  • What parts of the code base should be removed?
  • If you started again from scratch, what would you do differently to make sure the same problems don’t reoccur?

I strongly recommend taking the Setting a Foundation for Successful Test Automation course from Test Automation University. (It’s free.) I also gave a talk about this very problem, Egad! How Do We Start Writing (Better) Tests?, at a few Python conferences.

We have a large BDD test suite with heavy coupling and slow execution times. The business amigos have also left the company. Should we try to fix what we have or just start over?

Sorry to hear that; same answer as before.

Final Questions

Why do you call yourself the “Automation Panda”?

Pandas are awesome. Everybody loves them. And nobody forgets my moniker.

Where can I get team training in BDD?

Beaufort Fairmont provides a one- or two-day course in BDD and writing Gherkin. Sign up for more information here.

Behavior-Driven Blasphemy

This is my 100th post on Automation Panda! I’m thrilled to see how much this blog has grown and how many people it has helped. For such a monumental occasion, I have chosen to voice a rather controversial opinion about test automation.

Behavior-driven development seems to be the software testing buzzword of the decade. What started as a refinement of test-driven development by developers in Europe and the UK quickly became the big process fad of the 2010’s. The Cucumber project (now 10 years old) developed or inspired Gherkin-based test automation frameworks in all the major programming languages. Companies started requiring Given-When-Then format for acceptance criteria and test scenarios. Three Amigos meetings became standard calendar fixtures during sprints. Organizations that once undertook “Agile transformations” now have similar initiatives for BDD. For better or worse, BDD exists and cannot be ignored.

The dogmatic benefits of BDD are better collaboration and automation. However, leaders frequently insist that Gherkin-style test frameworks add value only when paired with practices like Example Mapping. “BDD is a process, not a tool,” is a common mantra. “Otherwise, the Gherkin just gets in the way.” Although I wholeheartedly agree that behavior-driven practices add significant value to the development process, I nevertheless espouse a rather blasphemous opinion:

BDD test automation frameworks are better than traditional frameworks for black box functional testing even when BDD processes are not followed.

What Exactly Are You Saying?

My claim is that behavior-driven test frameworks like Cucumber, SpecFlow, and behave are significantly better than traditional xUnit-style frameworks for testing live features. For example, I would rather use SpecFlow than NUnit for testing a Web app with Selenium WebDriver, whether or not the other two Amigos are with me. The resulting automation code has better structure, readability, and reusability.

I’m not saying that teams shouldn’t do BDD practices, and I’m not saying that the Three Amigos should be separated. Collaboration is key to success, and BDD really helps. Example Mapping is one of the most useful practices a development team can do. I’m also not saying that BDD frameworks should be used for all testing purposes – they are poorly suited for unit testing and for performance testing.

Objection!

I find myself very lonely in this opinion. BDD leaders repeatedly insist that BDD is not about testing and automation:

The most outspoken BDDers (mostly coalescing around the Cucumber community) have largely moved their focus to the collaboration benefits, almost forsaking the automation benefits. (This may not necessarily be true, but it appears that way based on the literature and materials floating on the Web.) That outlook is somewhat disingenuous because the main tools supporting BDD are, in fact, test frameworks.

BDD also has outspoken opponents – it’s love or hate. I’ve personally spoken with several engineers who despise Gherkin-based frameworks. “I can see how it would be valuable when a whole team embraces behavior-driven practices,” many have told me, “but otherwise, the Gherkin layer just gets in the way of automation.” I’ve heard it called “plaster” and “garbage.” Engineers just want to code their tests. And code should always be readable, right?

hqdefault

Testing is an inherently opinionated space. People can never seem to agree on things.

The Bigger Picture

Test automation must be developed regardless of any specific development practices, and its architecture must stand firmly in its own right. Unfortunately, both sides miss the bigger picture:

The best solution for test automation is a domain-specific language.

A domain-specific language (DSL) is a programming language with a purpose. It is designed to handle very specific needs, rather than general-purpose programming. For example:

  • SQL is a DSL for database queries.
  • XPath is a DSL for finding elements in an XML document.
  • YAML is a DSL for object serialization.

Gherkin is also a DSL – for behavior specification.

Domain-specific languages naturally suit test automation due to the clear difference between test cases and test code. Test cases are procedures that exercise product behavior. Anyone can write a test case. They are dictated or explained in plain language. Test code, however, is the software implementation of test cases. Test code handles function calls, logging, exceptions, and all those other little programming details that help run tests. A test automation DSL separates those concerns: test cases are written in a special language, and the interpreter handles repetitive, low-level details. Some type of extensions can handle product-specific interactions. The purpose of a language is to effectively express intention – and the intention is to test the product.

To truly achieve an optimal solution, however, the DSL and its interpreter must be treated as part of the automation software, just like the test cases and extensions. Remember, a language’s interpreter is just another piece of software. The interpreter is part of the separation of concerns and the single responsibility principle. Concerns that would typically be handled by classes and functions in traditional test code should be moved to the interpreter. For example, the interpreter should automatically log every test case step, rather that forcing the author to write explicit logging statements.

When I worked at NetApp years ago, I implemented a DSL to test platform-level features of our operating system. I called it DS – short for “Design Steps” (from HP ALM) (but also not without an affinity for the Nintendo DS). NetApp’s entire test automation code was developed in Perl at the time, so I implemented the DS interpreter in Perl to reuse existing libraries. DS test cases were typically only a dozen lines long each, and DS expressions could call specially-written Perl modules directly for complete extendability. During the first big release using DS, my team saved countless hours of automation development time as compared to the previous release while delivering a higher number of tests. I also did this before I had ever heard of BDD.

Unfortunately, most teams have neither the time to develop their own testing DSL nor the understanding of compiler theory to build it right. And if they were given such a language, they typically limit themselves to the provided implementation instead of taking ownership to extend the language for their needs.

nintendo-ds-1

The original Nintendo DS. Fun times!

Who Truly Misunderstands Gherkin?

Enter Gherkin: the world’s first major general-purpose, off-the-shelf language for test automation. It is general enough to cover any case through its plain language steps, yet specific enough to standardize tests. Users don’t need to be compiler theory experts – they just make up their own step names and provide the definition code to execute them. Early BDD projects like JBehave and Cucumber packaged an interpreter as a test framework and delivered it to a testing world still stuck on JUnit. The need for a testing DSL was there, whether or not the BDD folks meant to serve it.

Cucumber-ites frequently bemoan that their framework is misunderstood by the masses. They shudder to see teams using their framework purely for test automation. However, Cucumber effectively lowered the entry barrier for teams to make their own testing DSLs. Kodak did the same thing for film: they made it cheap and standard so anyone could be a photographer. Not everyone who uses a BDD framework misunderstands its purpose: some (like me) just see an alternative value proposition than what is preached by orthodox BDD practitioners. Gherkin fills a need that nobody knew. Its popularity validates that claim.

Benefits Apart from Process

Using a BDD framework adds much value to testing and development even without BDD processes. Below are just a handful of benefits:

  1. Focus first on good scenarios. Gherkin forces authors to think before they code.
  2. Faster automation development. Gherkin steps are reusable and parametrizable.
  3. Stronger structure. Engineers know where to put things in the framework.
  4. Test understandability. Anyone can read scenarios because they are written in plain language. Business people can help. New people can pick it up fast.
  5. Test sharing. Feature files can be shared apart from test code, which can be helpful for business partners.
  6. Test similarity. Tests all look the same. Team members can more easily help each other.
  7. Clearer failures. When a scenario fails, reports show exactly what step failed.
  8. Simpler bug reports. Use scenario steps as instructions to reproduce the failure.
  9. 2-phase test reviews. Review Gherkin first and then test code second to make sure the test cases are good before implementing the wrong things.
  10. BDD enablement. Using a BDD framework opens the door for a team to embrace better behavioral practices in the future.

I wrote about these advantages before:

Case Studies

I’m also not the only one who finds value in BDD test frameworks outside of the full BDD process. Below are five case studies.

radish

radish is a Python test framework inspired by Cucumber. Its DSL syntax is a superset of Gherkin that adds preconditions, loops, variables, and expressions. These language additions indicate a bias towards automation because they enable engineers to write tests more programmatically, albeit in a Gherkin-ese way.

Karate

Karate is a test framework with a full DSL based on Gherkin with steps specifically tailored to Web service calls. Although it is implemented in Java, testers do not need to do any Java programming to write complete tests cases from day one. Peter Thomas, the creator of Karate, unabashedly declares that Karate does not truly adhere to BDD but nevertheless uses Cucumber for its automation benefits. (Note: Karate is working to move completely off of Cucumber. See GitHub issue #444.)

REST Assured

REST Assured is a Java package for testing REST APIs. Unlike Karate, REST Assured provides a fluent syntax (and not a DSL) for writing service calls directly in Java code. The fluent syntax is based on Gherkin: given() a request spec is created, when() the call is made, then() verify the response. Although REST Assured is not a full testing framework, it nevertheless pulls inspiration from BDD frameworks for order and structure.

Cycle

Cycle is a BDD-focused test automation platform from Cycle Labs for testing Web, terminal, and desktop apps. Cycle is unique because it provides out-of-the-box steps for all types of supported testing so that no programming experience is required. Testers write feature files using Cycle 2.0’s slick new Electron app. Scenarios are written in CycleScript, a Gherkin-ese language with additions like variables and sub-scenario calls. Steps tend to be imperative, but that’s the tradeoff for not requiring lower-level programming.

Hexawise

Hexawise is a combinatorial testing tool designed to maximize coverage with minimal test counts by smartly joining feature variations. It helps testers write better tests with less redundancy and fewer gaps. Although Hexawise has historically assisted manual testers, it also can generate Gherkin feature files for test variations.

mexican-coast-dried-sea-cucumber

Not all cucumbers are the same. Above is a sea cucumber.

Good Enough?

Gherkin-based test frameworks are not perfect, but they do provide good structure. They gained popularity outside of the pure BDD movement because they genuinely added value to testing and automation. Like any other tool, teams will use them in both good and bad ways. (Trust me, I’ve seen scary Gherkin.)

It’s interesting to see how groups outside the Cucumber diaspora are attempting to solve the limitations of pure Gherkin. Each case study above showed a unique path. Clearly, the test automation problem has not yet been completely solved, but current BDD frameworks are the best off-the-shelf solutions we have until a new software testing movement comes along.

5 Things I Love About SpecFlow

SpecFlow, a.k.a. “Cucumber for .NET,” is a leading BDD test automation framework for .NET. Created by Gáspár Nagy and maintained as a free, open source project on GitHub by TechTalk, SpecFlow presently has almost 3 million total NuGet downloads. I’ve used it myself at a few companies, and, I must say as an automationeer, it’s awesome! SpecFlow shares a lot in common with other Cucumber frameworks like Cucumber-JVM, but it is not a knockoff – it excels in many ways. Below are five features I love about SpecFlow.

#1: Declarative Specification by Example

SpecFlow is a behavior-driven test framework. Test cases are written as Given-When-Then scenarios in Gherkin “.feature” files. For example, imagine testing a cucumber basket:

Feature: Cucumber Basket
  As a gardener,
  I want to carry many cucumbers in a basket,
  So that I don’t drop them all.
  
  @cucumber-basket
  Scenario: Fill an empty basket with cucumbers
    Given the basket is empty
    When "10" cucumbers are added to the basket
    Then the basket is full

Notice a few things:

  • It is declarative in that steps indicate what should be done at a high level.
  • It is concise in that a full test case is only a few lines long.
  • It is meaningful in that the coverage and purpose of the test are intuitively obvious.
  • It is focused in that the scenario covers only one main behavior.

Gherkin makes it easy to specify behaviors by example. That way, everybody can understand what is happening. C# code will implement each step in lower layers. Even if your team doesn’t do the full-blown BDD process, using a BDD framework like SpecFlow is still great for test automation. Test code naturally abstracts into separate layers, and steps are reusable, too!

#2: Context is King

Safely sharing data (e.g., “context”) between steps is a big challenge in BDD test frameworks. Using static variables is a simple yet terrible solution – any class can access them, but they create collisions for parallel test runs. SpecFlow provides much better patterns for sharing context.

Context injection is SpecFlow’s simple yet powerful mechanism for inversion of control (using BoDi). Any POCOs can be injected into any step definition class, either using default values or using a specific initialization, by declaring the POCO as a step def constructor argument. Those instances will also be shared instances, meaning steps across different classes can share the same objects! For example, steps for Web tests will all need a reference to the scenario’s one WebDriver instance. The context-injected objects are also created fresh for each scenario to protect test case independence.

Another powerful context mechanism is ScenarioContext. Every scenario has a unique context: title, tags, feature, and errors. Arbitrary objects can also be stored in the context object like a Dictionary, which is a simple way to pass data between steps without constructor-level context injection. Step definition classes can access the current scenario context using the static ScenarioContext.Current variable, but a better, thread-safe pattern is to make all step def classes extend the Steps class and simply reference the ScenarioContext instance variable.

#3: Hooks for Any Occasion

Hooks are special methods that insert extra logic at critical points of execution. For example, WebDriver cleanup should happen after a Web test scenario completes, no matter the result. If the cleanup routine were put into a Then step, then it would not be executed if the scenario had a failure in a When step. Hooks are reminiscent of Aspect-Oriented Programming.

Most BDD frameworks have some sort of hooks, but SpecFlow stands out for its hook richness. Hooks can be applied before and after steps, scenario blocks, scenarios, features, and even around the whole test run. (Cucumber-JVM, by contrast, does not support global hooks.) Hooks can be selectively applied using tags, and they can be assigned an order if a project has multiple hooks of the same type. Hook methods will also be picked up from any step definition class. SpecFlow hooks are just awesome!

#4: Thorough Outline Templating

Scenario Outlines are a standard part of Gherkin syntax. They’re very useful for templating scenarios with multiple input combinations. Consider the cucumber basket again:

Feature: Cucumber Basket
  
  Scenario Outline: Add cucumbers to the basket
    Given the basket has "<initial>" cucumbers
    When "<some>" cucumbers are added to the basket
    Then the basket has "<total>" cucumbers

    Examples: Counts
      | initial | some | total |
      | 1       | 2    | 3     |
      | 5       | 3    | 8     |

All BDD frameworks can parametrize step inputs (shown in double quotes). However, SpecFlow can also parametrize the non-input parts of a step!

Feature: Cucumber Basket
  
  Scenario Outline: Use the cucumber basket
    Given the basket has "<initial>" cucumbers
    When "<some>" cucumbers are <handled-with> the basket
    Then the basket has "<total>" cucumbers

    Examples: Counts
      | initial | some | handled-with | total |
      | 1       | 2    | added to     | 3     |
      | 5       | 3    | removed from | 2     |

The step definitions for the add and remove steps are separate. The step text for the action is parametrized, even though it is not a step input:

[When(@"""(\d+)"" cucumbers are added to the basket")]
public void WhenCucumbersAreAddedToTheBasket(int count) { /* */ }

[When(@"""(\d+)"" cucumbers are removed from the basket")]
public void WhenCucumbersAreRemovedFromTheBasket(int count) { /* */ }

That’s cool!

#5: Test Thread Affinity

SpecFlow can use any unit test runner (like MsTest, NUnit, and xUnit.net), but TechTalk provides the official SpecFlow+ Runner for a licensed fee. I’m not associated with TechTalk in any way, but the SpecFlow+ Runner is worth the cost for enterprise-level projects. It has a friendly command line, a profile file to customize execution, parallel execution, and nice integrations.

The major differentiator, in my opinion, is its test thread affinity feature. When running tests in parallel, the major challenge is avoiding collisions. Test thread affinity is a simple yet powerful way to control which tests run on which threads. For example, consider testing a website with user accounts. No two tests should use the same user at the same time, for fear of collision. Scenarios can be tagged for different users, and each thread can have the affinity to run scenarios for a unique user. Some sort of parallel isolation management like test thread affinity is absolutely necessary for test automation at scale. Given that the SpecFlow+ Runner can handle up to 64 threads (according to TechTalk), massive scale-up is possible.

But Wait, There’s More!

SpecFlow is an all-around great test automation framework, whether or not your team is doing full BDD. Feel free to add comments below about other features you love (or *gasp* hate) about SpecFlow!

 

Pipe Character Escape for Gherkin Tables

For the first time today, I had to write a Gherkin behavior scenario in which table text needed to use the pipe character “|”. I wanted a generic step that would find and click web page links by name, and one of the link names had the pipe in it! The first version of the step I wrote looked like this:

When the user follows the links:
  | link              |
  | Category          |
  | Sub-Category      |
  | Index|Description |

Naturally, this step didn’t parse – the “|” was parsed as a table delimiter instead of the intended link text. I could have rewritten the step to search for partial link text, or I could have done a key-value lookup, but I wanted to keep the step simple and direct.

The solution was simple: escape the pipe character “|” with a backslash character “\”. Easy! Thanks, StackOverflow! The updated table looks like this:

When the user follows the links:
 | link               |
 | Category           |
 | Sub-Category       |
 | Index\|Description |

“\|” works for both step tables and scenario outline example tables. It looks like it is fairly standard for test frameworks that use Gherkin. I verified that Cucumber-JVM and SpecFlow support it, and it looks like Cucumber for Ruby does as well. It looks like behave will support it in 1.2.6.

After learning this trick, I updated the BDD 101: The Gherkin Language page.

Note that backslash escape sequences won’t work for quotes in Gherkin steps. Quotes in steps are merely conventions and not part of the Gherkin language standard.

BDD 101: Manual Testing

Behavior-driven development takes an automation-first philosophy: behavior specs should become automated tests. However, BDD can also accommodate manual testing. Manual testing has a place and a purpose, even in BDD. Remember, behavior scenarios are first and foremost behavior specifications, and they provide value beyond testing and automation. Any behavior scenario could be run as a manual test. The main questions, then, are (1) when is manual testing appropriate and (2) how should it be handled.

(Check the Automation Panda BDD page for the full BDD 101 table of contents.)

When is Manual Testing Appropriate?

Automation is not a silver bullet – it doesn’t satisfy all testing needs. Scenarios should be written for all behaviors, but they likely shouldn’t be automated under the following circumstances:

  • The return-on-investment to automate the scenarios is too low.
  • The scenarios won’t be included in regression or continuous integration.
  • The behaviors are temporary (ex: hotfixes).
  • The automation itself would be too complex or too fragile.
  • The nature of the feature is non-functional (ex: performance, UX, etc.).
  • The team is still learning BDD and is not yet ready to automate all scenarios.

Manual testing is also appropriate for exploratory testing, in which engineers rely upon experience rather than explicit test procedures to “explore” the product under test for bugs and quality concerns. It complements automation because both testing styles serve different purposes. However, behavior scenarios themselves are incompatible with exploratory testing. The point of exploring is for engineers to go “unscripted” – without formal test plans – to find problems only a user would catch. Rather than writing scenarios, the appropriate way to approach behavior-driven exploratory testing is more holistic: testers should assume the role of a user and exercise the product under test as a collection of interacting behaviors. If exploring uncovers any glaring behavior gaps, then new behavior scenarios should be added to the catalog.

How Should Manual Testing Be Handled?

Manual testing fits into BDD in much the same way as automated testing because both formats share the same process for behavior specification. Where the two ways diverge is in how the tests are run. There are a few special considerations to make when writing scenarios that won’t be automated.

Repository

Both manual and automated behavior scenarios should be stored in the same repository. The natural way to organize behaviors is by feature, regardless of how the tests will be run. All scenarios should also be managed by some form of version control.

Furthermore, all scenarios should be co-located for document-generation tools like Pickles. Doc tools make it easy to expose behavior specs and steps to everyone. They make it easier for the Three Amigos to collaborate. Non-technical people are not likely to dig into programming projects.

Tags

Scenarios must be classified as manual or automated. When BDD frameworks run tests, they need a way to exclude tests that are not automated. Otherwise, test reports would be full of errors! In Gherkin, scenarios should be classified using tags. For example, scenarios could be tagged as either “@manual” or “@automated”. A third tag, “@automatable”, could be used to distinguish scenarios that are not yet automated but are targeted for automation.

Some BDD frameworks have nifty features for tags. In Cucumber-JVM, tags can be set as runner class options for convenience. This means that tag options could be set to “~@manual” to avoid manual tests. In SpecFlow, any scenario with the special “@ignore” tag will automatically be skipped. Nevertheless, I strongly recommend using custom tags to denote manual tests, since there are many reasons why a test may be ignored (such as known bugs).

Extra Comments

The conciseness of behavior scenarios is problematic for manual testing because steps don’t provide all the information a tester may need. For example, test data may not be written explicitly in the spec. The best way to add extra information to a scenario is to add comments. Gherkin allows any number of lines for comments and description. Comments provide extra information to the reader but are ignored by the automation.

It may be tempting to simply write new Gherkin steps to handle the extra information for manual testing. However, this is not a good approach. Principles of good Gherkin should be used for all scenarios, regardless of whether or not the scenarios will be automated. High-quality specification should be maintained for consistency, for documentation tools, and for potential future automation.

An Example

Below is a feature that shows how to write behavior scenarios for manual tests:

Feature: Google Searching

  @automated
  Scenario: Search from the search bar
    Given a web browser is at the Google home page
    When the user enters "panda" into the search bar
    Then links related to "panda" are shown on the results page

  @manual
  Scenario: Image search
    # The Google home page URL is: http://www.google.com/
    # Make sure the images shown include pandas eating bamboo
    Given Google search results for "panda" are shown
    When the user clicks on the "Images" link at the top of the results page
    Then images related to "panda" are shown on the results page

It’s not really different from any other behavior scenarios.

 

As stated in the beginning, BDD should be automation-first. Don’t use the content of this article to justify avoiding automation. Rather, use the techniques outlined here for manual testing only as needed.

 

BDD 101: Frameworks

Every major programming language has a BDD automation framework. Some even have multiple choices. Building upon the structural basics from the previous post, this post provides a survey of the major frameworks available today. Since I cannot possibly cover every BDD framework in depth in this 101 series, my goal is to empower you, the reader, to pick the best framework for your needs. Each framework has support documentation online justifying its unique goodness and detailing how to use it, and I would prefer not to duplicate documentation. Use this post primarily as a reference. (Check the Automation Panda BDD page for the full table of contents.)

Major Frameworks

Most BDD frameworks are Cucumber versions, JBehave derivatives inspired by Dan North, or non-Gherkin spec runners. Some put behavior scenarios into separate files, while others put them directly into the source code.

C# and Microsoft .NET

SpecFlow, created by Gáspár Nagy, is arguably the most popular BDD framework for Microsoft .NET languages. Its tagline is “Cucumber for .NET” – thus fully compliant with Gherkin. SpecFlow also has polished, well-designed hookscontext injection, and parallel execution (especially with test thread affinity). The basic package is free and open source, but SpecFlow also sells licenses for SpecFlow+ extensions. The free version requires a unit test runner like MsTest, NUnit, or xUnit.net in order to run scenarios. This makes SpecFlow flexible but also feels jury-rigged and inelegant. The licensed version provides a slick runner named SpecFlow+ Runner (which is BDD-friendly) and a Microsoft Excel integration tool named SpecFlow+ Excel. Microsoft Visual Studio has extensions for SpecFlow to make development easier.

There are plenty of other BDD frameworks for C# and .NET, too. xBehave.net is an alternative that pairs nicely with xUnit.net. A major difference of xBehave.net is that scenario steps are written directly in the code, instead of in separate text (feature) files. LightBDD bills itself as being more lightweight than other frameworks and basically does some tricks with partial classes to make the code more readable. NSpec is similar to RSpec and Mocha and uses lambda expressions heavily. Concordion offers some interesting ways to write specs, too. NBehave is a JBehave descendant, but the project appears to be dead without any updates since 2014.

Java and JVM Languages

The main Java rivalry is between Cucumber-JVM and JBehave. Cucumber-JVM is the official Cucumber version for Java and other JVM languages (Groovy, Scala, Clojure, etc.). It is fully compliant with Gherkin and generates beautiful reports. The Cucumber-JVM driver can be customized, as well. JBehave is one of the first and foremost BDD frameworks available. It was originally developed by Dan North, the “father of BDD.” However, JBehave is missing key Gherkin features like backgrounds, doc strings, and tags. It was also a pure-Java implementation before Cucumber-JVM existed. Both frameworks are widely used, have plugins for major IDEs, and distribute Maven packages. This popular but older article compares the two in slight favor of JBehave, but I think Cucumber-JVM is better, given its features and support.

The Automation panda article Cucumber-JVM for Java is a thorough guide for the Cucumber-JVM framework.

Java also has a number of other BDD frameworks. JGiven uses a fluent API to spell out scenarios, and pretty HTML reports print the scenarios with the results. It is fairly clean and concise. Spock and JDave are spec frameworks, but JDave has been inactive for years. Scalatest for Scala also has spec-oriented features. Concordion also provides a Java implementation.

JavaScript

Almost all JavaScript BDD frameworks run on Node.js. Jasmine and Mocha are two of the most popular general-purpose JS test frameworks. They differ in that Jasmine has many features included (like assertions and spies) that Mocha does not. This makes Jasmine easier to get started (good for beginners) but makes Mocha more customizable (good for power users). Both claim to be behavior-driven because they structure tests using “describe” and “it-should” phrases in the code, but they do not have the advantage of separate, reusable steps like Gherkin. Personally, I consider Jasmine and Mocha to be behavior-inspired but not fully behavior-driven.

Other BDD frameworks are more true to form. Cucumber provides Cucumber.js for Gherkin-compliant happiness. Yadda is Gherkin-like but with a more flexible syntax. Vows provides a different way to approach behavior using more formalized phrase partitions for a unique form of reusability. The Cucumber blog argues that Cucumber.js is best due to its focus on good communication through plain language steps, whereas other JavaScript BDD frameworks are more code-y. (Keep in mind, though, that Cucumber would naturally boast of its own framework.) Other comparisons are posted here, here, here, and here.

PHP

The two major BDD frameworks for PHP are Behat and Codeception. Behat is the official Cucumber version for PHP, and as such is seen as the more “pure” BDD framework. Codeception is more programmer-focused and can handle other styles of testing. There are plenty of articles comparing the two – here, here, and here (although the last one seems out of date). Both seem like good choices, but Codeception seems more flexible.

Python

Python has a plethora of test frameworks, and many are BDD. behave and lettuce are probably the two most popular players. Feature comparison is analogous to Cucumber-JVM versus JBehave, respectively: behave is practically Gherkin compliant, while lettuce lacks a few language elements. Both have plugins for major IDEs. pytest-bdd is on the rise because it integrates with all the wonderful features of pytestradish is another framework that extends the Gherkin language to include scenario loops, scenario preconditions, and variables. All these frameworks put scenarios into separate feature files. They all also implement step definitions as functions instead of classes, which not only makes steps feel simpler and more independent, but also avoids unnecessary object construction.

Other Python frameworks exist as well. pyspecs is a spec-oriented framework. Freshen was a BDD plugin for Nose, but both Freshen and Nose are discontinued projects.

Ruby

Cucumber, the gold standard for BDD frameworks, was first implemented in Ruby. Cucumber maintains the official Gherkin language standard, and all Cucumber versions are inspired by the original Ruby version. Spinach bills itself as an enhancement to Cucumber by encapsulating steps better. RSpec is a spec-oriented framework that does not use Gherkin.

Which One is Best?

There is no right answer – the best BDD framework is the one that best fits your needs. However, there are a few points to consider when weighing your options:

  • What programming language should I use for test automation?
  • Is it a popular framework that many others use?
  • Is the framework actively supported?
  • Is the spec language compliant with Gherkin?
  • What type of testing will you do with the framework?
  • What are the limitations as compared to other frameworks?

Frameworks that separate scenario text from implementation code are best for shift-left testing. Frameworks that put scenario text directly into the source code are better for white box testing, but they may look confusing to less experienced programmers.

Personally, my favorites are SpecFlow and pytest-bdd. At LexisNexis, I used SpecFlow and Cucumber-JVM. For Python, I used behave at MaxPoint, but I have since fallen in love with pytest-bdd since it piggybacks on the wonderfulness of pytest. (I can’t wait for this open ticket to add pytest-bdd support in PyCharm.) For skill transferability, I recommend Gherkin compliance, as well.

Reference Table

The table below categorizes BDD frameworks by language and type for quick reference. It also includes frameworks in languages not described above. Recommended frameworks are denoted with an asterisk (*). Inactive projects are denoted with an X (x).

Language Framework Type
C Catch In-line Spec
C++ Igloo In-line Spec
C# and .NET Concordion
LightBDD
NBehave x
NSpec
SpecFlow *
xBehave.net
In-line Spec
In-line Gherkin
Separated semi-Gherkin
In-line Spec
Separated Gherkin
In-line Gherkin
Golang Ginkgo In-line Spec
Java and JVM Cucumber-JVM *
JBehave
JDave x
JGiven *
Scalatest
Spock
Separated Gherkin
Separated semi-Gherkin
In-line Spec
In-line Gherkin
In-line Spec
In-line Spec
JavaScript Cucumber.js *
Yadda
Jasmine
Mocha
Vows
Separated Gherkin
Separated semi-Gherkin
In-line Spec
In-line Spec
In-line Spec
Perl Test::BDD::Cucumber Separated Gherkin
PHP Behat
Codeception *
Separated Gherkin
Separated or In-line
Python behave *
freshen x
lettuce
pyspecs
pytest-bdd *
radish
Separated Gherkin
Separated Gherkin
Separated semi-Gherkin
In-line Spec
Separated semi-Gherkin
Separated Gherkin-plus
Ruby Cucumber *
RSpec
Spinach
Separated Gherkin
In-line Spec
Separated Gherkin
Swift / Objective C Quick In-line Spec

 

[4/22/2018] Update: I updated info for C# and Python frameworks.