testing

Behavior-Driven Blasphemy

This is my 100th post on Automation Panda! I’m thrilled to see how much this blog has grown and how many people it has helped. For such a monumental occasion, I have chosen to voice a rather controversial opinion about test automation.

Behavior-driven development seems to be the software testing buzzword of the decade. What started as a refinement of test-driven development by developers in Europe and the UK quickly became the big process fad of the 2010’s. The Cucumber project (now 10 years old) developed or inspired Gherkin-based test automation frameworks in all the major programming languages. Companies started requiring Given-When-Then format for acceptance criteria and test scenarios. Three Amigos meetings became standard calendar fixtures during sprints. Organizations that once undertook “Agile transformations” now have similar initiatives for BDD. For better or worse, BDD exists and cannot be ignored.

The dogmatic benefits of BDD are better collaboration and automation. However, leaders frequently insist that Gherkin-style test frameworks add value only when paired with practices like Example Mapping. “BDD is a process, not a tool,” is a common mantra. “Otherwise, the Gherkin just gets in the way.” Although I wholeheartedly agree that behavior-driven practices add significant value to the development process, I nevertheless espouse a rather blasphemous opinion:

BDD test automation frameworks are better than traditional frameworks for black box functional testing even when BDD processes are not followed.

What Exactly Are You Saying?

My claim is that behavior-driven test frameworks like Cucumber, SpecFlow, and behave are significantly better than traditional xUnit-style frameworks for testing live features. For example, I would rather use SpecFlow than NUnit for testing a Web app with Selenium WebDriver, whether or not the other two Amigos are with me. The resulting automation code has better structure, readability, and reusability.

I’m not saying that teams shouldn’t do BDD practices, and I’m not saying that the Three Amigos should be separated. Collaboration is key to success, and BDD really helps. Example Mapping is one of the most useful practices a development team can do. I’m also not saying that BDD frameworks should be used for all testing purposes – they are poorly suited for unit testing and for performance testing.

Objection!

I find myself very lonely in this opinion. BDD leaders repeatedly insist that BDD is not about testing and automation:

BDD is not about Testing (slides by Dan North)
The world’s most misunderstood collaboration tool by Aslak Hellesøy
BDD is – BDD is not by Augusto Evangelisti
BDD Tool Cucumber is Not a Testing Tool by Jan Stenberg
3 misconceptions about BDD from ThoughtWorks

The most outspoken BDDers (mostly coalescing around the Cucumber community) have largely moved their focus to the collaboration benefits, almost forsaking the automation benefits. (This may not necessarily be true, but it appears that way based on the literature and materials floating on the Web.) That outlook is somewhat disingenuous because the main tools supporting BDD are, in fact, test frameworks.

BDD also has outspoken opponents – it’s love or hate. I’ve personally spoken with several engineers who despise Gherkin-based frameworks. “I can see how it would be valuable when a whole team embraces behavior-driven practices,” many have told me, “but otherwise, the Gherkin layer just gets in the way of automation.” I’ve heard it called “plaster” and “garbage.” Engineers just want to code their tests. And code should always be readable, right?

Testing is an inherently opinionated space. People can never seem to agree on things.

The Bigger Picture

Test automation must be developed regardless of any specific development practices, and its architecture must stand firmly in its own right. Unfortunately, both sides miss the bigger picture:

The best solution for test automation is a domain-specific language.

A domain-specific language (DSL) is a programming language with a purpose. It is designed to handle very specific needs, rather than general-purpose programming. For example:

SQL is a DSL for database queries.
XPath is a DSL for finding elements in an XML document.
YAML is a DSL for object serialization.

Gherkin is also a DSL – for behavior specification.

Domain-specific languages naturally suit test automation due to the clear difference between test cases and test code. Test cases are procedures that exercise product behavior. Anyone can write a test case. They are dictated or explained in plain language. Test code, however, is the software implementation of test cases. Test code handles function calls, logging, exceptions, and all those other little programming details that help run tests. A test automation DSL separates those concerns: test cases are written in a special language, and the interpreter handles repetitive, low-level details. Some type of extensions can handle product-specific interactions. The purpose of a language is to effectively express intention – and the intention is to test the product.

To truly achieve an optimal solution, however, the DSL and its interpreter must be treated as part of the automation software, just like the test cases and extensions. Remember, a language’s interpreter is just another piece of software. The interpreter is part of the separation of concerns and the single responsibility principle. Concerns that would typically be handled by classes and functions in traditional test code should be moved to the interpreter. For example, the interpreter should automatically log every test case step, rather that forcing the author to write explicit logging statements.

When I worked at NetApp years ago, I implemented a DSL to test platform-level features of our operating system. I called it DS – short for “Design Steps” (from HP ALM) (but also not without an affinity for the Nintendo DS). NetApp’s entire test automation code was developed in Perl at the time, so I implemented the DS interpreter in Perl to reuse existing libraries. DS test cases were typically only a dozen lines long each, and DS expressions could call specially-written Perl modules directly for complete extendability. During the first big release using DS, my team saved countless hours of automation development time as compared to the previous release while delivering a higher number of tests. I also did this before I had ever heard of BDD.

Unfortunately, most teams have neither the time to develop their own testing DSL nor the understanding of compiler theory to build it right. And if they were given such a language, they typically limit themselves to the provided implementation instead of taking ownership to extend the language for their needs.

The original Nintendo DS. Fun times!

Who Truly Misunderstands Gherkin?

Enter Gherkin: the world’s first major general-purpose, off-the-shelf language for test automation. It is general enough to cover any case through its plain language steps, yet specific enough to standardize tests. Users don’t need to be compiler theory experts – they just make up their own step names and provide the definition code to execute them. Early BDD projects like JBehave and Cucumber packaged an interpreter as a test framework and delivered it to a testing world still stuck on JUnit. The need for a testing DSL was there, whether or not the BDD folks meant to serve it.

Cucumber-ites frequently bemoan that their framework is misunderstood by the masses. They shudder to see teams using their framework purely for test automation. However, Cucumber effectively lowered the entry barrier for teams to make their own testing DSLs. Kodak did the same thing for film: they made it cheap and standard so anyone could be a photographer. Not everyone who uses a BDD framework misunderstands its purpose: some (like me) just see an alternative value proposition than what is preached by orthodox BDD practitioners. Gherkin fills a need that nobody knew. Its popularity validates that claim.

Benefits Apart from Process

Using a BDD framework adds much value to testing and development even without BDD processes. Below are just a handful of benefits:

Focus first on good scenarios. Gherkin forces authors to think before they code.
Faster automation development. Gherkin steps are reusable and parametrizable.
Stronger structure. Engineers know where to put things in the framework.
Test understandability. Anyone can read scenarios because they are written in plain language. Business people can help. New people can pick it up fast.
Test sharing. Feature files can be shared apart from test code, which can be helpful for business partners.
Test similarity. Tests all look the same. Team members can more easily help each other.
Clearer failures. When a scenario fails, reports show exactly what step failed.
Simpler bug reports. Use scenario steps as instructions to reproduce the failure.
2-phase test reviews. Review Gherkin first and then test code second to make sure the test cases are good before implementing the wrong things.
BDD enablement. Using a BDD framework opens the door for a team to embrace better behavioral practices in the future.

I wrote about these advantages before:

Case Studies

I’m also not the only one who finds value in BDD test frameworks outside of the full BDD process. Below are five case studies.

radish

radish is a Python test framework inspired by Cucumber. Its DSL syntax is a superset of Gherkin that adds preconditions, loops, variables, and expressions. These language additions indicate a bias towards automation because they enable engineers to write tests more programmatically, albeit in a Gherkin-ese way.

Karate

Karate is a test framework with a full DSL based on Gherkin with steps specifically tailored to Web service calls. Although it is implemented in Java, testers do not need to do any Java programming to write complete tests cases from day one. Peter Thomas, the creator of Karate, unabashedly declares that Karate does not truly adhere to BDD but nevertheless uses Cucumber for its automation benefits. (Note: Karate is working to move completely off of Cucumber. See GitHub issue #444.)

REST Assured

REST Assured is a Java package for testing REST APIs. Unlike Karate, REST Assured provides a fluent syntax (and not a DSL) for writing service calls directly in Java code. The fluent syntax is based on Gherkin: given() a request spec is created, when() the call is made, then() verify the response. Although REST Assured is not a full testing framework, it nevertheless pulls inspiration from BDD frameworks for order and structure.

Cycle

Cycle is a BDD-focused test automation platform from Cycle Labs for testing Web, terminal, and desktop apps. Cycle is unique because it provides out-of-the-box steps for all types of supported testing so that no programming experience is required. Testers write feature files using Cycle 2.0’s slick new Electron app. Scenarios are written in CycleScript, a Gherkin-ese language with additions like variables and sub-scenario calls. Steps tend to be imperative, but that’s the tradeoff for not requiring lower-level programming.

Hexawise

Hexawise is a combinatorial testing tool designed to maximize coverage with minimal test counts by smartly joining feature variations. It helps testers write better tests with less redundancy and fewer gaps. Although Hexawise has historically assisted manual testers, it also can generate Gherkin feature files for test variations.

Not all cucumbers are the same. Above is a sea cucumber.

Good Enough?

Gherkin-based test frameworks are not perfect, but they do provide good structure. They gained popularity outside of the pure BDD movement because they genuinely added value to testing and automation. Like any other tool, teams will use them in both good and bad ways. (Trust me, I’ve seen scary Gherkin.)

It’s interesting to see how groups outside the Cucumber diaspora are attempting to solve the limitations of pure Gherkin. Each case study above showed a unique path. Clearly, the test automation problem has not yet been completely solved, but current BDD frameworks are the best off-the-shelf solutions we have until a new software testing movement comes along.

How Do I Know My Tests Add Value?

Software testing is a huge effort, especially for automation. Teams can spend a lot of time, money, and resources on testing (or not). People literally make careers out of it. That investment ought to be worthwhile – we shouldn’t test for the sake of testing.

So, therein lies the million-dollar question: How do we know that our tests add meaningful value?

Or, more bluntly: How do we know that testing isn’t a waste of time?

That’s easy: bugs!

The stock answer goes something like this: We know tests add value when they find bugs! So, let’s track the number of bugs we find.

That answer is wrong, despite its good intentions. Bug count is a terrible metric for judging the value of tests.

What do you mean bug counts aren’t good?

I know that sounds blasphemous. Let’s unpack it. Finding bugs is a good thing, and tests certainly should find bugs in the features they cover. But, the premise that the value of testing lies exclusively in the bugs found is wrong. Here’s why:

The main value of testing is fast feedback. Testing serves two purposes: (1) validating goodness and (2) identifying badness. Passing tests are validated goodness. Failing tests, meaning uncovered bugs, are identified badness. Both types of feedback add value to the development process. Developers can proceed confidently with code changes when trustworthy tests are passing, and management can assess lower risk. Unfortunately, bug counts don’t measure that type of goodness.
Good testing might actually reduce bug count. Testing means accountability for development. Developers must think more carefully about design. They can also run tests locally before committing changes. They could even do Test-Driven Development. Better practices could prevent many bugs from ever happening.
Tracking bug count can drive bad behavior. Whether a high bug discovery rate looks good (or, worse, has quotas), testers will strive to post numbers. If they don’t find critical bugs, they will open bug reports for nitpicks and trivialities. The extra effort they spend to report inconsequential problems may not be of value to the business – wasting their time and the developers’ time all for the sake of metrics.
Bugs are usually rare. Unless a team is dysfunctional, the product usually works as expected. Hundreds of test runs may not yield a single bug. That’s a wonderful thing if the tests have good coverage. Those tests still add value. Saying they don’t belittles the whole testing effort.

Then what metrics should we use?

Bugs happen arbitrarily, and unlimited testing is not possible. Metrics should focus on the return-on-investment for testing efforts. Here are a few:

Time-to-bug-discovery. Rather than track bug counts, track the time until each bug is discovered. This metric genuinely measures the feedback loop for test results. Make sure to track the severity of each bug, too. For example, if high-severity bugs are not caught until production, then the tests don’t have enough coverage. Teams should strive for the shortest time possible – fast feedback means lower development costs. This metric also encourages teams to follow the Testing Pyramid.
Coverage. Coverage is the degree to which tests exercise product behavior. Higher coverage means more feedback and greater chances of identifying badness. Most unit test frameworks can use code coverage tools to verify paths through code. Feature coverage requires extra process or instrumentation. Tests should avoid duplicate coverage, too.
Test failure proportions. Tests fail for a variety of reasons. Ideally, tests should fail only when they discover bugs. However, tests may also fail for other reasons: unexpected feature changes, environment instability, or even test automation bugs. Non-bug failures disrupt the feedback loop: they force a team to fix testing problems rather than product problems, and they might cause engineers to devalue the whole testing effort. Tracking failure proportions will reveal what problems inhibit tests from delivering their top value.

More resources

EGAD! How Do We Start Writing (Better) Tests?

Some have never automated tests and can’t check themselves before they wreck themselves. Others have 1000s of tests that are flaky, duplicative, and slow. Wa-do-we-do? Well, I gave a talk about this problem at a few Python conferences. The language used for example code was Python, but the principles apply to any language.

Here’s the PyTexas 2019 talk:

And here’s the PyGotham 2018 talk:

And here’s the first time I gave this talk, at PyOhio 2018:

I also gave this talk at PyCaribbean 2019 and PyTennessee 2020 (as an impromptu talk), but it was not recorded.

The Testing Pyramid

The “Testing Pyramid” is an industry-standard guideline for functional test case development. Love it or hate it, the Pyramid has endured since the mid-2000’s because it continues to be practical. So, what is it, and how can it help us write better tests?

Layers

The Testing Pyramid has three classic layers:

Unit tests are at the bottom. Unit tests directly interact with product code, meaning they are “white box.” Typically, they exercise functions, methods, and classes. Unit tests should be short, sweet, and focused on one thing/variation. They should not have any external dependencies – mocks/monkey-patching should be used instead.
Integration tests are in the middle. Integration tests cover the point where two different things meet. They should be “black box” in that they interact with live instances of the product under test, not code. Service call tests (REST, SOAP, etc.) are examples of integration tests.
End-to-end tests are at the top. End-to-end tests cover a path through a system. They could arguably be defined as a multi-step integration test, and they should also be “black box.” Typically, they interact with the product like a real user. Web UI tests are examples of integration tests because they need the full stack beneath them.

All layers are functional tests because they verify that the product works correctly.

Proportions

The Testing Pyramid is triangular for a reason: there should be more tests at the bottom and fewer tests at the top. Why?

Distance from code. Ideally, tests should catch bugs as close to the root cause as possible. Unit tests are the first line of defense. Simple issues like formatting errors, calculation blunders, and null pointers are easy to identify with unit tests but much harder to identify with integration and end-to-end tests.
Execution time. Unit tests are very quick, but end-to-end tests are very slow. Consider the Rule of 1’s for Web apps: a unit test takes ~1 millisecond, a service test takes ~1 second, and a Web UI test takes ~1 minute. If test suites have hundreds to thousands of tests at the upper layers of the Testing Pyramid, then they could take hours to run. An hours-long turnaround time is unacceptable for continuous integration.
Development cost. Tests near the top of the Testing Pyramid are more challenging to write than ones near the bottom because they cover more stuff. They’re longer. They need more tools and packages (like Selenium WebDriver). They have more dependencies.
Reliability. Black box tests are susceptible to race conditions and environmental failures, making them inherently more fragile. Recovery mechanisms take extra engineering.

The total cost of ownership increases when climbing the Testing Pyramid. When deciding the level at which to automate a test (and if to automate it at all), taking a risk-based strategy to push tests down the Pyramid is better than writing all tests at the top. Each proportionate layer mitigates risk at its optimal return-on-investment.

Practice

The Testing Pyramid should be a guideline, not a hard rule. Don’t require hard proportions for test counts at each layer. Why not? Arbitrary metrics cause bad practices: a team might skip valuable end-to-end tests or write needless unit tests just to hit numbers. W. Edwards Deming would shudder!

Instead, use loose proportions to foster better retrospectives. Are we covering too many input combos through the Web UI when they could be checked via service tests? Are there unit test coverage gaps? Do we have a pyramid, a diamond, a funnel, a cupcake, or some other wonky shape? Each layer’s test count should be roughly an order of magnitude smaller than the layer beneath it. Large Web apps often have 10K unit tests, 1K service tests, and a few hundred Web UI tests.

Resources

Check out these other great articles on the Testing Pyramid:

TestPyramid by Martin Fowler
The Practical Test Pyramid by Ham Vocke
Test Pyramid: the key to good automated test strategy by Tim Cochran
Just Say No to More End-to-End Tests by Mike Wacker

Why Python is Great for Test Automation

Python is an incredible programming language. As Dan Callahan said in his PyCon 2018 keynote, “Python is the second best language for anything, and that’s an amazing aspiration.” For test automation, however, I believe it is one of the best choices. Here are ten reasons why:

#1: The Zen of Python

The Zen of Python, as codified in PEP 20, is an ideal guideline for test automation. Test code should be a natural bridge between plain-language test steps and the programming calls to automate them. Tests should be readable and descriptive because they describe the features under test. They should be explicit in what they cover. Simple steps are better than complicated ones. Test code should add minimal extra verbiage to the tests themselves. Python, in its concise elegance, is a powerful bridge from test case to test code.

(Want a shortcut to the Zen of Python? Run “import this” at the Python interpreter.)

#2: pytest

pytest is one of the best test frameworks currently available in any language, not just for Python. It can handle any functional tests: unit, integration, and end-to-end. Test cases are written simply as functions (meaning no side effects as long as globals are avoided) and can take parametrized inputs. Fixtures are a generic, reusable way to handle setup and cleanup operations. Basic “assert” statements have automatic introspection so failure messages print meaningful values. Tests can be filtered when executed. Plugins extent pytest to do code coverage, run tests in parallel, use Gherkin scenarios, and integrate with other frameworks like Django and Flask. Other Python test frameworks are great, but pytest is by far the best-in-show. (Pythonic frameworks always win in Python.)

#3: Packages

For all the woes about the CheeseShop, Python has a rich library of useful packages for testing: pytest, unittest, doctest, tox, logging, paramiko, requests, Selenium WebDriver, Splinter, Hypothesis, and others are available as off-the-shelf ingredients for custom automation recipes. They’re just a “pip install” away. No reinventing wheels here!

#4: Multi-Paradigm

Python is object-oriented and functional. It lets programmers decide if functions or classes are better for the needs at hand. This is a major boon for test automation because (a) stateless functions avoid side effects and (b) simple syntax for those functions make them readable. pytest itself uses functions for test cases instead of shoehorning them into classes (à la JUnit).

#5: Typing Your Way

Python’s out-of-the-box dynamic duck typing is great for test automation because most feature tests (“above unit”) don’t need to be picky about types. However, when static types are needed, projects like mypy, Pyre, and MonkeyType come to the rescue. Python provides typing both ways!

#6: IDEs

Good IDE support goes a long way to make a language and its frameworks easy to use. For Python testing, JetBrains PyCharm supports visual testing with pytest, unittest, and doctest out of the box, and its Professional Edition includes support for BDD frameworks (like pytest-bdd, behave, and lettuce) and Web development. For a lighter offering, Visual Studio Code is taking the world by storm. Its Python extensions support all the good stuff: snippets, linting, environments, debugging, testing, and a command line terminal right in the window. Atom, Sublime, PyDev, and Notepad++ also get the job done.

#7: Command Line Workflow

Python and the command line are like peanut butter and jelly – a match made in heaven. The entire test automation workflow can be driven from the command line. Pipenv can manage packages and environments. Every test framework has a console runner to discover and launch tests. There’s no need to “build” test code first because Python is an interpreted language, further simplifying execution. Rich command line support makes testing easy to manage manually, with tools, or as part of build scripts / CI pipelines.

As a bonus, automation modules can be called from the Python REPL interpreter or, even better, a Jupyter notebook. What does this mean? Automation-assisted exploratory testing! Imagine using Python calls to automatically steer a Web app to a point that requires a manual check. Calls can be swapped out, rerun, skipped, or changed on the fly. Python makes it possible.

#8: Ease of Entry

Python has always been friendly to beginners thanks to its Zen, whether those beginners are programming newbies or expert engineers. This gives Python a big advantage as an automation language choice because tests need to be done quickly and easily. Nobody wants to waste time when the features are in hand and just need to be verified. Plus, many manual software testers (often without programming experience) are now starting to do automation work (by choice or by force) and benefit from Python’s low learning curve.

#9: Strength for Scalability

Even though Python is great for beginners, it’s also no toy language. Python has industrial-grade strength because its design always favors one right way to get a job done. Development can scale thanks to meaningful syntax, good structure, modularity, and a rich ecosystem of tools and packages. Command line versatility enables it to fit into any tool or workflow. The fact that Python may be slower than other languages is not an issue for feature tests because system delays (such as response times for Web pages and REST calls) are orders of magnitude slower than language-level performance hits.

#10: Popularity

Python is one of the most popular programming languages in the world today. It is consistently ranked near the top on TIOBE, Stack Overflow, and GitHub (as well as GitHut). It is a beloved choice for Web developers, infrastructure engineers, data scientists, and test automationeers alike. The Python community also powers it forward. There is no shortage of Python developers, nor is there any dearth of support online. Python is not going away anytime soon. (Python 3, that is.)

Other Languages?

The purpose of this article is to highlight what makes Python great for test automation based on its own merits. Although I strongly believe that Python is one of the best automation languages, other choices like Java, C#, and Ruby are also viable. Check out my article The Best Programming Language for Test Automation for a comparison.

This article was posted with the author’s permission on both Automation Panda and PyBites.

Cypress.io and the Future of Web Testing

What is Cypress.io?

Cypress.io is an up-and-coming Web test automation framework. It is open source and written entirely in JavaScript. Unlike Selenium WebDriver tests that work outside the browser, Cypress works directly inside the browser. It enables developers to write front-end tests entirely in JavaScript, directly accessing everything within the browser. As a result, tests run much more quickly and reliably than Selenium-based tests.

Some nifty features include:

A rich yet simple API for interactions with automatic waiting
Mocha, Chai, and Sinon bundled in
A sleek dashboard with automatic reloads for Test-Driven Development
Easy debugging
Network traffic control for validation and mocking
Automatic screenshots and videos

Cypress was clearly developed for developers. It enables rapid test development with rapid feedback. The Cypress Test Runner is free, while the Cypress Dashboard Service (for better reporting and CI) will require a paid license.

How Do I Start Using Cypress?

I won’t post examples or instructions for using Cypress here. Please refer to the Cypress documentation for getting started and the tutorial video below. Make sure your machine is set up for JavaScript development.

Will Cypress Replace WebDriver?

TL;DR: No.

Cypress has its niche. It is ideal for small teams whose stacks are exclusively JavaScript and whose developers are responsible for all testing. However, WebDriver still has key advantages.

While Selenium WebDriver supports nearly all major browsers, Cypress currently supports only one browser: Google Chrome. That’s a major limitation. Web apps do not work the same across browsers. Many industries (especially banking and finance) put strict controls on browser types and versions, too.
Cypress is JavaScript only. Its website proudly touts its JavaScript purity like a badge of honor. However, that has downsides. First, all testing must happen inside the bubble of the browser, which makes parallel testing and system interactions much more difficult. Second, testers must essentially be developers, which may not work well for all teams. Third, other programming languages that may offer advantages for testing (like Python) cannot be used. Selenium WebDriver, on the other hand, has multiple language bindings and lets tests live outside the browser.
Within the JavaScript ecosystem, Cypress is not the only all-in-one end-to-end framework. Protractor is more mature, more customizable, and easier to parallelize. It wraps Selenium WebDriver calls for simplification and safety in a similar way to how Cypress’s API is easy to use.
The WebDriver standard is a W3C Recommendation. What does this mean? All major browsers have a vested interest in implementing the standard. Selenium is simply the most popular implementation of the standard. It’s not going away. Cypress, however, is just a cool project backed with commercial intent.

What Does Cypress Mean for the Future?

There are a few big takeaways.

JavaScript is taking over the world. It was the most popular language on GitHub in 2017. JavaScript-only stacks like MEAN and MERN are increasingly popular. The demand for a complete JavaScript-only test framework like Cypress is further evidence.
“Bundled” test frameworks are becoming popular. Historically, a test framework simply provided test structure, basic fixtures, and maybe an assertion library (like JUnit). Then, extra test packages became popular (like Selenium WebDriver, REST APIs, mocking, logging, etc.). Now, new frameworks like Cypress and Protractor aim to provide pre-canned recipes of all these pieces to simplify the setup.
Many new test frameworks will likely be developer-centric. There is a trend in the software industry (especially with Agile) of eliminating traditional tester roles and putting testing work onto developers. The role of the “Software Engineer in Test” – a developer who builds test systems – is also on the rise. Test automation tools and frameworks will need to provide good developer experience (DX) to survive. Cypress is poised to ride that wave.
WebDriver is not perfect. Cypress was developed in large part to address WebDriver’s shortcomings, namely the slowness, difficulty, and unreliability (though unreliability is often a result of poor implementation). Many developers don’t like to use Selenium WebDriver, and so there will be a constant itch to make something better. Cypress isn’t there yet, but it might get there one day.

Clicking Web Elements with Selenium WebDriver

Selenium WebDriver is the most popular open source package for Web UI test automation. It allows tests to interact directly with a web page in a live browser. However, using Selenium WebDriver can be very frustrating because basic interactions often lack robustness, causing intermittent errors for tests.

The Basics

One such vulnerable interaction is clicking elements on a page. Clicking is probably the most common interaction for tests. In C#, a basic click would look like this:

webDriver.FindElement(By.Id("my-id")).Click();

This is the easy and standard way to click elements using Selenium WebDriver. However, it will work only if the targeted element exists and is visible on the page. Otherwise, the WebDriver will throw exceptions. This is when programmers pull their hair out.

Waiting for Existence

To avoid race conditions, interactions should not happen until the target element exists on the page. Even split-second loading times can break automation. The best practice is to use explicit waits before interactions with a reasonable timeout value, like this:

const int timeoutSeconds = 15;
var ts = new TimeSpan(0, 0, timeoutSeconds);
var wait = new WebDriverWait(webDriver, ts);

wait.Until((driver) => driver.FindElements(By.Id("my-id")).Count > 0);
webDriver.FindElement(By.Id("my-id")).Click();

Other Preconditions

Sometimes, Web elements won’t appear without first triggering something else. Even if the element exists on the page, the WebDriver cannot click it until it is made visible. Always look for the proper way to make that element available for clicking. Click on any parent panels or expanders first. Scroll if necessary. Make sure the state of the system should permit the element to be clickable.

If the element is scrolled out of view, move to the element before clicking it:

new Actions(webDriver)
    .MoveToElement(webDriver.FindElement(By.Id("my-id")))
    .Click()
    .Perform();

Last Ditch Efforts

Nevertheless, there are times when clickable elements just don’t cooperate. They just can’t seem to be made visible. When all else fails, drop directly into JavaScript:

((IJavaScriptExecutor)webDriver).ExecuteScript(
    "arguments[0].click();",
    webDriver.FindElement(By.Id("my-id")));

Do this only when absolutely necessary. It is a best practice to use Selenium WebDriver methods because they make automated interaction behave more like a real user than raw JavaScript calls. Make sure to give good reasons in code comments whenever doing this, too.

Final Advice

This article was written specifically for clicks, but its advice can be applied to other sorts of interactions, too. Just be smart about waits and preconditions.

Note: Code examples on this page are written in C#, but calls are similar for other languages supported by Selenium WebDriver.

5 Things I Love About SpecFlow

SpecFlow, a.k.a. “Cucumber for .NET,” is a leading BDD test automation framework for .NET. Created by Gáspár Nagy and maintained as a free, open source project on GitHub by TechTalk, SpecFlow presently has almost 3 million total NuGet downloads. I’ve used it myself at a few companies, and, I must say as an automationeer, it’s awesome! SpecFlow shares a lot in common with other Cucumber frameworks like Cucumber-JVM, but it is not a knockoff – it excels in many ways. Below are five features I love about SpecFlow.

#1: Declarative Specification by Example

SpecFlow is a behavior-driven test framework. Test cases are written as Given-When-Then scenarios in Gherkin “.feature” files. For example, imagine testing a cucumber basket:

Feature: Cucumber Basket
  As a gardener,
  I want to carry many cucumbers in a basket,
  So that I don’t drop them all.
  
  @cucumber-basket
  Scenario: Fill an empty basket with cucumbers
    Given the basket is empty
    When "10" cucumbers are added to the basket
    Then the basket is full

Notice a few things:

It is declarative in that steps indicate what should be done at a high level.
It is concise in that a full test case is only a few lines long.
It is meaningful in that the coverage and purpose of the test are intuitively obvious.
It is focused in that the scenario covers only one main behavior.

Gherkin makes it easy to specify behaviors by example. That way, everybody can understand what is happening. C# code will implement each step in lower layers. Even if your team doesn’t do the full-blown BDD process, using a BDD framework like SpecFlow is still great for test automation. Test code naturally abstracts into separate layers, and steps are reusable, too!

#2: Context is King

Safely sharing data (e.g., “context”) between steps is a big challenge in BDD test frameworks. Using static variables is a simple yet terrible solution – any class can access them, but they create collisions for parallel test runs. SpecFlow provides much better patterns for sharing context.

Context injection is SpecFlow’s simple yet powerful mechanism for inversion of control (using BoDi). Any POCOs can be injected into any step definition class, either using default values or using a specific initialization, by declaring the POCO as a step def constructor argument. Those instances will also be shared instances, meaning steps across different classes can share the same objects! For example, steps for Web tests will all need a reference to the scenario’s one WebDriver instance. The context-injected objects are also created fresh for each scenario to protect test case independence.

Another powerful context mechanism is ScenarioContext. Every scenario has a unique context: title, tags, feature, and errors. Arbitrary objects can also be stored in the context object like a Dictionary, which is a simple way to pass data between steps without constructor-level context injection. Step definition classes can access the current scenario context using the static ScenarioContext.Current variable, but a better, thread-safe pattern is to make all step def classes extend the Steps class and simply reference the ScenarioContext instance variable.

#3: Hooks for Any Occasion

Hooks are special methods that insert extra logic at critical points of execution. For example, WebDriver cleanup should happen after a Web test scenario completes, no matter the result. If the cleanup routine were put into a Then step, then it would not be executed if the scenario had a failure in a When step. Hooks are reminiscent of Aspect-Oriented Programming.

Most BDD frameworks have some sort of hooks, but SpecFlow stands out for its hook richness. Hooks can be applied before and after steps, scenario blocks, scenarios, features, and even around the whole test run. (Cucumber-JVM, by contrast, does not support global hooks.) Hooks can be selectively applied using tags, and they can be assigned an order if a project has multiple hooks of the same type. Hook methods will also be picked up from any step definition class. SpecFlow hooks are just awesome!

#4: Thorough Outline Templating

Scenario Outlines are a standard part of Gherkin syntax. They’re very useful for templating scenarios with multiple input combinations. Consider the cucumber basket again:

Feature: Cucumber Basket
  
  Scenario Outline: Add cucumbers to the basket
    Given the basket has "<initial>" cucumbers
    When "<some>" cucumbers are added to the basket
    Then the basket has "<total>" cucumbers

    Examples: Counts
      | initial | some | total |
      | 1       | 2    | 3     |
      | 5       | 3    | 8     |

All BDD frameworks can parametrize step inputs (shown in double quotes). However, SpecFlow can also parametrize the non-input parts of a step!

Feature: Cucumber Basket
  
  Scenario Outline: Use the cucumber basket
    Given the basket has "<initial>" cucumbers
    When "<some>" cucumbers are <handled-with> the basket
    Then the basket has "<total>" cucumbers

    Examples: Counts
      | initial | some | handled-with | total |
      | 1       | 2    | added to     | 3     |
      | 5       | 3    | removed from | 2     |

The step definitions for the add and remove steps are separate. The step text for the action is parametrized, even though it is not a step input:

[When(@"""(\d+)"" cucumbers are added to the basket")]
public void WhenCucumbersAreAddedToTheBasket(int count) { /* */ }

[When(@"""(\d+)"" cucumbers are removed from the basket")]
public void WhenCucumbersAreRemovedFromTheBasket(int count) { /* */ }

That’s cool!

#5: Test Thread Affinity

SpecFlow can use any unit test runner (like MsTest, NUnit, and xUnit.net), but TechTalk provides the official SpecFlow+ Runner for a licensed fee. I’m not associated with TechTalk in any way, but the SpecFlow+ Runner is worth the cost for enterprise-level projects. It has a friendly command line, a profile file to customize execution, parallel execution, and nice integrations.

The major differentiator, in my opinion, is its test thread affinity feature. When running tests in parallel, the major challenge is avoiding collisions. Test thread affinity is a simple yet powerful way to control which tests run on which threads. For example, consider testing a website with user accounts. No two tests should use the same user at the same time, for fear of collision. Scenarios can be tagged for different users, and each thread can have the affinity to run scenarios for a unique user. Some sort of parallel isolation management like test thread affinity is absolutely necessary for test automation at scale. Given that the SpecFlow+ Runner can handle up to 64 threads (according to TechTalk), massive scale-up is possible.

But Wait, There’s More!

SpecFlow is an all-around great test automation framework, whether or not your team is doing full BDD. Feel free to add comments below about other features you love (or *gasp* hate) about SpecFlow!

Quality Metrics 101: Test Quality

New to the series? Start from the beginning!

Test quality metrics make sure that testing efforts are worthwhile. Though “testing” and “quality” may be synonymous as organizational titles, testing is only one method of enforcing quality. In software, it just happens to be the most effective one. Testing is expensive, though, because it slows down time-to-market. Some people even devalue testing work because it doesn’t add new features to a product. Below are aspects of test quality to consider measuring to prove and even increase the value of testing efforts.

Coverage

Quality Aspect

How much functionality is covered by tests?

Desired State

High – More coverage means less risk. Note that 100% complete coverage is impossible.

Metrics

Coverage may be measured for both manual and automated tests. However, automated test coverage is usually more important because automated tests are meant to be defensive without gaps.

Code Coverage – Code coverage tools check what paths of code are actually exercised by automated tests. While they cannot tell if tests are good or bad, they are great for exposing gaps in coverage. Unit test code coverage is easy because most frameworks have plugins, but above-unit code coverage requires instrumented builds. Look for tools that track more than just lines of code. Target 90%+ coverage. Add new tests to cover any major gaps.

Feature Coverage – Feature coverage is a manual way to score features on test coverage based on planning and review. For this metric to be successful, a team must consistently specify features well; otherwise, this metric will give useless data. Gherkin scenarios a great way to do this – for example, each scenario can be marked as untested, manual, or automated. Feature coverage is unscientific, but it can give a better picture of functionalities actually covered (instead of just the raw lines of code covered).

Automation Debt – Technical debt increases when tests are not automated and thus lack coverage. Teams are often unable to automate all tests originally planned, and test automation is frequently jettisoned from the Definition of Done. Or, a project may not start automating tests until a large chunk of the project is already complete. The best way to track automation debt is to create a backlog for incomplete automation work. Backlog tasks can be sized, prioritized, and planned according to whatever development process is used (Scrum, Kanban, etc.). Appropriate process metrics can then be used to understand the magnitude of the work and, thus, the lack of automated test coverage.

Warning: Test case count, test length, and test code line count are terrible metrics for coverage because they encourage largeness rather than uniqueness. The goal of testing is to have the greatest coverage with the lowest risk for the least work. Anybody can blindly write tests or variations that add no meaningful value.

Reliability

Quality Aspect

Do automated tests consistently reach completion? And how trustworthy are the results?

Desired State

High – Reliability means less time for failure triage or (horrors) reruns.

Metrics

Failure Reasons – Track the failure reason for each test case run. Ideally, tests should fail only when they discover product bugs. However, tests may also fail when:

an acceptable product change caused an automation error because tests were not updated, indicating poor communication or careless updates
an environmental change or interruption caused an automation error, indicating deployment or sysadmin problems
the automation code itself has a bug

Remember, “successful” test runs either pass with appropriate coverage or fail due to product bugs. “Unsuccessful” test runs fail or crash for reasons other than product bugs. Aim to minimize unsuccessful test runs. Never hack a test just to get it passing – always work to fix the problems behind test failures.

Speed

Quality Aspect

How much time do test runs take?

Desired State

Fast – Tests should complete in the shortest time possible.

Metrics

Test Case Execution Time – Test case execution times indicate the efficiency of the automation code. Track the start-to-end execution time for every individual test case run. Then, analyze the data using common sense. For example, outliers may be inefficient tests that need tuning or should be removed altogether. It may be wise to separate test runs by result type or coverage area. Historical data can also be used as a baseline to determine performance impacts when making cross-cutting automation changes.

Test Suite Execution Time – Test suites are sets of test cases, but their execution times are not merely the sum of their tests’ times. A test suite run may include environmental setup, deployment, parallel execution, reporting, and other things. The purpose of tracking test suite execution time is to determine the start-to-end time of the suite in total, because that indicates the speed of feedback and, in CI, delivery. Tracking test suite execution time will also reveal the effect of adding more test cases to the suite, which then factors into the risk-based decisions of including or excluding tests.

Test Pyramid Balance – The Test Pyramid separates tests between unit (bottom), integration (middle), and end-to-end (top) layers. Ideally, there should be more tests at the bottom than at the top. Why? Higher-level tests are more expensive – they take more time to develop, they are more time consuming to triage, and they have slower execution times. Consider the “Rule of 1’s”: a unit test takes ~1ms, an integration test takes ~1s, and an end-to-end test takes ~1m. When scaled to thousands of tests with continuous integration, end-to-end tests simply take too much time. Tracking the proportion of tests at each layer will give a rough picture of the balance. There’s no perfect ratio between layers, but make sure that the tests form a pyramid and not a cupcake, hourglass, or ice cream cone. Rebalance test efforts as appropriate.

Return on Investment

Quality Aspect

Do the tests add greater value than their cost?

Desired State

High – Tests need to be worth the effort. Don’t test for the sake of testing!

Metrics

Measuring return on investment in terms of hard dollars is objectively impossible. The true cost of bugs can never be fully known: if a bug is caught early, the potential cost to fix it later can merely be estimated. The intangible value of protecting brand reputation may be more important than the tangible value of money saved by finding specific bugs. Better quality practices might prevent developers from causing bugs that would have otherwise happened – and there’s no good way to measure that.

Instead, return on investment is better measured by a collection of metrics that validate both code line protection and defect discovery. Use a weighted scorecard to get a more holistic view of ROI. Scorecards can be used with estimates for planning tests, as well as plugged in with actual values to measure the degree of success. Note that some aspects of ROI may be too difficult to measure accurately – in those cases, a LOW-MID-HIGH grading scale may be best. Others may seem like micromanagement.

Priority – Assign each test a priority for its coverage importance. Core functionalities should have the highest priority, while fringe functionalities should have the lowest priority. Focus on high-priority tests. Another way to look at importance is risk, or the chances that bugs will escape if explicit testing for a feature is not done.
Test Execution Frequency – Track how many times tests are actually run. Higher frequency is better. Tests that are rarely run should either be included in more regular runs or removed/archived. This could easily be tracked by a test management tool or database.
Coverage Uniqueness – Duplicate test coverage wastes resources. Unfortunately, this one is difficult to measure. Tools for code coverage or static analysis might help. Manual review, however, is typically a better approach.
Development Cost and Maintenance Cost – Track how much effort it takes to make and keep tests, including man-hours and resources. Lower costs are better, of course. Planning tools may help with this.
Bug Discovery – Track bugs discovered in terms of severity and when and how they were caught. Ideally, the number of bugs caught by customers after a release (meaning, not caught by tests during development) should be minimal, and their severity should be low. Bug tracking tools should easily provide this data. Be warned, though, that the raw bug count is a poor metric. Consider this question: Is a high bug count good or bad? Trick question – during a release, it indicates good test quality but poor product quality; after a release, it indicates all-around poor quality. What matters is that a minimal number of bugs happen at all, and that most of those bugs are caught and fixed before a release. Plus, keep in mind that bugs happen by accident. Finally, focusing exclusively on bug count to determine test value ignores the positive side of testing – that passing tests give confidence that features work correctly.

Quality Metrics 101: The Good, The Bad, and The Ugly

metric – [me-trik] – (noun) a standard for measuring or evaluating something

(Courtesy of dictionary.com)

When developing software, metrics can be a good way to track progress and evaluate quality. Managers typically love them because they provide insights that could otherwise be hard to see. Come on, who doesn’t love pretty charts with rainbow colors? However, gathering metrics is not easy, especially for quality. Some metrics are downright useless, and others encourage bad behavior when used improperly. It is far more important to focus on the most important aspects of quality than to blindly promulgate numbers. This article will cover quality metrics in depth, giving guidance on what quality aspects matter most and how they can be measured.

What are Quality Metrics?

Quality is the degree of a feature’s excellence. Quality metrics attempt to impartially measure a feature’s excellence. The word “attempt” is notable – quality is inherently relative, and metrics can sometimes be subjective. Take pizza as an example: How would the quality of a pizza be measured? One method could be to analyze the freshness and nutritious value of the ingredients, but, Pizza Hut notoriously fought Papa John’s Pizza over the assertion that better ingredients make better pizza. Another method could be to analyze the cooking process, like bake time or the order of toppings, but that would be better for identifying carelessness than quality. The delivery process could also be considered, like Domino’s delivery robots, but that evaluates customer service and not the pizza itself. Ultimately, what matters are the taste and the visual appearance, which are totally subjective to the consumer. Surveys are unreliable. Taste tests have limited selection. Appearance is an art, not a science. Each of these metrics gives a glimpse into quality but does not fully reveal what actually makes a “good” pizza. Together, though, they provide a reasonable picture when the desired metrics are gathered well.

tony_pepperoni-rochester-ny-pizza-coupon

Is that really high quality pizza? Well, what aspects of quality are we measuring? We won’t get a perfect picture of quality from metrics, but we can get a rough idea. Software quality metrics work the same way.

Software Quality

In software, there are three primary types of quality metrics:

Test Quality
- How effective are tests at enforcing high quality standards?
- Examples: code coverage, test failure reasons.
Process Quality
- How effective are processes at delivering good features?
- Examples: time to fix broken builds, time to discover bugs.
Product Quality
- How good is the software product?
- Examples: test failure rate, up-time, customer satisfaction.

The main purpose of software quality metrics is to validate successes and find areas for improvement in the development process. Metrics expose problems like gaps in coverage or slow feedback loops so that a team knows what to improve. They are meant to be informative but not punitive – they should simply report accurate data. Don’t shoot the messenger! For example, if the test failure rate is high, fix the bugs instead of blaming each other.

However, be warned by W. Edwards Deming‘s red bead experiment: Quality cannot be inspected into a product – it must be built in from the beginning! Metrics alone cannot solve problems – they can merely expose them. It is up to the development team to affect the proper change based on what metrics reveal. Awareness is useless without action. And action should ultimately lead to better features, faster delivery, and higher profits.

Choosing Quality Metrics

Metrics are nothing but tools to improve aspects of quality. Not every job needs the full toolbox! Always pick the quality aspect first, and then find the right measuring stick. Don’t just pick some metrics that others say are good. For example, if build stability is the quality aspect that is deemed important (and it should be), then the metric to track it could be the average time to fix a build after it is broken.

The best process for choosing quality metrics is:

Identify a quality aspect that adds value.
Decide if the aspect is worth measuring.
Determine the desired state for that aspect.
Derive the best way to measure progress toward the desired state impartially.
Implement the metric gathering, storage, and analysis.
Revisit the metric periodically to assert its value.
Stop gathering the metric when it ceases to provide value.

Keep in mind that metrics have a cost: they must be gathered, stored, and analyzed. That’s why it’s important to pick the quality aspects that matter most.

This Series

The articles in this series will cover each of the quality metric types in detail. Each will list major quality aspects with meaningful metrics to track them and advice on how to use them. Remember, metrics should be constructive and not destructive.