Some have never automated tests and can’t check themselves before they wreck themselves. Others have 1000s of tests that are flaky, duplicative, and slow. Wa-do-we-do? Well, I gave a talk about this problem at a few Python conferences. The language used for example code was Python, but the principles apply to any language.
Are your automated tests running in parallel? If not, then they probably should be. Together with continuous integration, parallel testing the best way to fail fast during software development and ultimately enforce higher software quality. Switching tests from serial to parallel execution, however, is not a simple task. Tests themselves must be designed to run concurrently without colliding, and extra tools and systems are needed to handle the extra stress. This article is a high-level guide to good parallel testing practices.
What is Parallel Testing?
Parallel testing means running multiple automated tests simultaneously to shorten the overall start-to-end runtime of a test suite. For example, if 10 tests take a total of 10 minutes to run, then 2 parallel processes could execute 5 tests each and cut the total runtime down to 5 minutes. Even better, 10 processes could execute 1 test each to shrink runtime to 1 minute. Parallel testing is usually managed by either a test framework or a continuous integration tool. It also requires more compute resources than serial testing.
Why Go Parallel?
Running automated tests in parallel does require more effort (and potentially cost) than running tests serially. So, why go through the trouble?
The answer is simple: time. It is well documented that software bugs cost more when they are discovered later. That’s why current development practices like Agile and BDD strive to avoid problems from the start through small iterations and healthy collaboration (“shift left“), while CI/CD defensively catches regressions as soon as they happen (“fail fast“). Reducing the time to discover a problem after it has been introduced means higher quality and higher productivity.
Ideally, a developer should be told if a code change is good or bad immediately after committing it. The change should automatically trigger a new build that runs all tests. Unfortunately, tests are not instantaneous – they could take minutes, hours, or even days to complete. A test automation strategy based on the Testing Pyramid will certainly shorten start-to-end execution time but likely still require parallelization. Consider the layers of the Testing Pyramid and their tests’ average runtimes, the Testing Pyramid Rule of 1’s:
Now, consider how many tests from each layer could be run within given time limits, if the tests are run serially:
1 Minute Near-Instant
10 Minutes Coffee Break
1 Hour There Goes Today
Unit test numbers look pretty good, though keep in mind 1 millisecond is often the best-case runtime for a unit test. Integration and end-to-end runtimes, however, pose a more pressing problem. It is not uncommon for a project to have thousands of above-unit tests, yet not even a hundred end-to-end tests could complete within an hour, nor could a thousand integration tests complete within 10 minutes. Now, consider two more facts: (1) tests often run as different phases in a CI pipeline, to total runtimes are stacked, and (2) multiple commits would trigger multiple builds, which could cause a serious backup. Serial test execution would starve engineering feedback in any continuous integration system of scale. A team would need to drastically shrink test coverage or give up on being truly “continuous” in favor of running tests daily or weekly. Neither alternative is acceptable these days. CI needs parallel testing to be truly continuous.
The Danger of Collisions
The biggest danger for parallel testing is collision – when tests interfere with each other, causing invalid test failures. Collisions may happen in the product under test if product state is manipulated by more than one test at a time, or they may happen in the automation code itself if the code is not thread-safe. Collisions are also inherently intermittent, which makes them all the more difficult to diagnose. As a design principle, automated tests must avoid collisions for correct parallel execution.
Making tests run in parallel is not as simple as flipping a switch or adding a new config file. Automated tests must be specifically designed to run in parallel. A team may need to significantly redevelop their automation code to make parallel execution work right.
Handling Product-Level Collisions
Product-level collisions essentially reduce to how environments are set up and handled.
The most basic way to avoid product-level collisions would be to run each test thread or process against its own instance of the product in an exclusive environment. (In the most extreme case, every single test could have its own product instance.) No collisions would happen in the product because each product instance would be touched by only one test instance at a time. Separate environments are possible to implement using various configuration and deployment tools. Docker containers are quick and easy to spin up. VMs with Vagrant, Puppet, Chef, and/or Ansible can also get it done.
However, it may not always be sensible to make separate environments for each test thread/process:
Creating a new environment is inefficient – it takes extra time to set up that may cancel out any time saved from parallel execution.
Many projects simply don’t have the money or the compute resources to handle a massive scale-out.
Some tests may not cause collisions and therefore may not need total isolation.
Some product environments are extremely large and complicated and would not be practical to replicate for each test individually.
Environments with a shared product instance are quite common. One could be a common environment that everyone on a team shares, or one could be freshly created during a CI run and accessed by multiple test threads/processes. Either way, product-level collisions are possible, and tests must be designed to avoid clashing product states. Any test covering a persistent state is vulnerable; usually, this is the vast majority of tests. Consider web app testing as an example. Tests to load a page and do some basic interactions can probably run in parallel without extra protection, but tests that use a login to enter data or change settings could certainly collide. In this case, collisions could be avoided by using different logins for each simultaneous test instance – by using either a pool of logins, a unique login per test case, or a unique login per thread/process. Each product is different and will require different strategies for avoiding collisions.
Handling Automation-Level Collisions
Automation-level collisions can happen when automation code is not thread-safe, which could mean more than simply locks and semaphores.
#1: Test Independence
Test cases must be completely independent of each other. One test must not require another test to run before it for the sake of setup. A test case should be able to run by itself without any others. A test suite should be able to run successfully in random order.
#2: Proper Variable Scope
If parallel tests will be run in the same memory address space, then it is imperative to properly scope all variables. Global or static mutable variables (e.g., “non-constants”) must not be allowed because they could be changed unexpectedly. The best pattern for handling scope is dependency injection. Thread-safe singletons would be a second choice. (Typically, global or static variables are used to subvert design patterns, so they may reveal further necessary automation rework when discovered.)
#3: External Resources
Automation may sometimes interact with external resources, such as test config files or test result databases/services. Make sure no external interactions collide. For example, make sure test run updates don’t overwrite each other.
Logs are very difficult to trace when multiple tests are simultaneously printed to the same file. The best practice is to generate separate log files for each test case, thread, or process to make them readable.
#5: Result Aggregation
A test suite is a unified collection of tests, no matter how many threads/processes are used to run its tests in parallel. Make sure test results are aggregated together into one report. Some frameworks will do this automatically, while others will require custom post-processing.
#6: Test Filtering
One strategy to avoid collisions may be to run non-colliding partitions (subsets) of tests in parallel. Test tagging and filtering would make this possible. For example, tests that require a special login could be tagged as such and run together on one thread.
The previous section on collisions discussed how to handle product environments. It is also important to consider how to handle the test automation environment. These are two different things: the product environment contains the live product under test, while the test environment contains the automation software and resources that run tests against the product. The test environment is where the parallel tests will be executed, and, as such, it must be scalable to handle the parallelization. A common example of a test environment could be a Jenkins master with a few agents for running build pipelines. There are two primary ways to scale the test environment: scale-up and scale-out.
Scale-up is when one machine is configured to handle more tests in parallel. For example, scale-up would be when a machine switches from one (serial) thread to two, three, or even more in parallel. Many popular test runners support this type of scale-up by spawning and joining threads in a common memory address space or by forking processes. (For example, the SpecFlow+ Runner lets you choose.)
Scale-up is a simple way to squeeze as much utility out of an existing machine as possible. If tests are designed to handle collisions, and the test runner has out-of-the-box support, then it’s usually pretty easy to add more test threads/processes. However, parallel test scale-up is inherently limited by the machine’s capacity. Each additional test process succumbs to the law of diminishing returns as more memory and processor cycles are used. Eventually, adding more threads will actually slow down test execution because the processor(s) will waste time constantly switching between tests. (Anecdotally, I found the optimal test-thread-to-processor ratio to be 2-to-1 for running C#/SpecFlow/Selenium-WebDriver tests on Amazon EC2 M4 instances.) A machine itself could be upgraded with more threads and processors, but nevertheless, there are limits to a single machine’s maximum capacity. Weird problems like TCP/IP port exhaustion may also arise.
Scale-out is when multiple machines are configured to run tests in parallel. Whereas scale-up had one machine running multiple tests, scale-out has multiple machines each running tests. Scale-out can be achieved in a number of ways. A few examples are:
A Jenkins pipeline launches tests across ten agents in parallel, in which each agent executes a tenth of the tests independently.
Scale-out is a better long-term solution than scale-up because scale-out can handle an unlimited number of machines for parallel testing. The limiting factor with scale-out is not the maximum capacity of the hardware but rather the cost of running more machines. However, scale-out is much harder to implement than scale-up. It requires tests to be evenly divided with some sort of balancer and filter. It also requires some sort of test result aggregation for joint reporting – people won’t want to piece together a bunch of separate reports to get an overall snapshot of quality. Plus, the test environment is more complicated to build and maintain (though tools like CloudBees Jenkins Enterprise or Amazon EC2 can make it easier.)
Upwards and Outwards
Of course, scale-up and scale-out are not mutually exclusive. Scaled-out nodes could individually be scaled-up. Consider a test environment with 10 powerful VMs that could each handle 10 tests in parallel – that means 100 tests could run simultaneously. Using the Rule of 1’s, it would take only about a minute to run 100 Web UI tests, which serially would have taken over an hour and a half! Use both strategies to shorten start-to-end runtime as much as possible.
Parallel testing is a worthwhile endeavor. When done properly, it will not only reduce development time but also improve the development experience. For readers who want to start doing parallel testing, I recommend researching the tools and frameworks you want to use. Many popular test frameworks support parallel execution, and even if the one you choose doesn’t, you can always invoke tests in parallel from the command line. Do well!
This post was originally published by Sealights on December 19, 2017 as part of their article Test Quality in CI/CD – Expert Roundup. I was honored to contribute my thoughts on automatic recovery in test automation, and I reblogged the text of my contribution here for Automation Panda readers. Please check out contributions from other experts in the full article!
Test automation is an essential part of CI/CD, but it must be extremely robust.
Unfortunately, tests running in live environments (integration and end-to-end)
often suffer rare but pesky “interruptions” that, if unhandled, will cause tests to fail.
These interruptions could be network blips, web pages not fully loaded, or
temporarily downed services – any environment issues unrelated to product bugs.
Interruptive failures are problematic because they (a) are intermittent and thus
difficult to pinpoint, (b) waste engineering time, (c) potentially hide real failures,
and (d) cast doubt over process/product quality. Furthermore, CI/CD magnifies
even rare issues. If an interruption has only a 1% chance of happening during a test,
then considering binomial probabilities, there is a 63% chance it will happen after
100 tests, and a 99% chance it will happen after 500 tests. Keep in mind that it is not
uncommon for thousands of tests to run daily in CI – Google Guava had over 286K
tests back in July 2012!
It is impossible to completely avoid interruptions – they will happen. Therefore, it is
imperative to handle interruptions at multiple layers:
First, secure the platform upon which the tests run. Make sure system
performance is healthy and that network connections are stable.
Second, add failover logic to the automated tests. Any time an interruption
happens, catch it as close to its source as possible, pause briefly, and retry the
operation(s). Do not catch any type of error: pinpoint specific interruption
signatures to avoid false positives. Build failover logic into the framework
rather than implementing it for one-off cases. For example, wrappers around web element or service calls could automatically perform retries. Aspect-
oriented programming can help here tremendously. Repeating failed tests in their entirety also works and may be easier to implement but takes much
more time to run.
Third, log any interruptions and recovery attempts as warnings. Do not
neglect to report them because they could indicate legitimate problems,
especially if patterns appear.
It may be difficult to differentiate interruptions from legitimate bugs. Or, certain
retry attempts might take too long to be practical. When in doubt, just fail the test –
that’s the safer approach.
This post is intended to be a quick personal reference for Jenkins Pipelines so I don’t forget things I learned or lose links to valuable info. Feel free to recommend additional resources!
Today, a few of my LexisNexis coworkers and I went to the CloudBees office down the street (since we are both located at NCSU Centennial Campus) for a Jenkins Pipeline workshop. I’ve used Jenkins for a few years now, and I handle my team’s freestyle projects for running .NET/SpecFlow/Selenium automated tests, but the declarative pipeline style for Jenkins jobs is new to me. (I feel so behind the times.) I’m glad I attended the workshop because I learned a few cool things.
Below are links to helpful resources for learning about Jenkins Pipelines:
A friend recently asked me this question (albeit with some rephrasing):
Can a unit test be a performance test? For example, can a unit test wait for an action to complete and validate that the time it took is below a preset threshold?
I cringed when I heard this question, not only because it is poor practice, but also because it reflects common misunderstandings about types of testing.
QA Buzzword Bingo
The root of this misunderstanding is the lack of standard definitions for types of tests. Every company where I’ve worked has defined test types differently. Individuals often play fast and loose with buzzword bingo, especially when new hires from other places used different buzzwords. Here are examples of some of those buzzwords:
Measurements / benchmarks / metrics
Continuous integration testing
And here are some games of buzzword bingo gone wrong:
Trying to separate “systemic” tests from “system” tests.
Claiming that “unit” tests should interact with a live web page.
Separating “regression” tests from other test types.
Before any meaningful discussions about testing can happen, everyone must agree to a common and explicit set of testing type definitions. For example, this could be a glossary on a team wiki page. Whenever I have discussions with others on this topic, I always seek to establish definitions first.
What defines a unit test?
Here is my definition:
A unit test is a functional, white box test that verifies the correctness of a single unit of software code. It is functional in that it gives a deterministic pass-or-fail result. It is white box in that the test code directly calls the product source code under test. The unit is typically a function or method, and there should be separate unit tests for each equivalence class of inputs.
Unit tests should focus on one thing, and they are typically short – both in lines of code and in execution time. Unit tests become extremely useful when they are automated. Every major programming language has unit test frameworks. Some popular examples include JUnit, xUnit.net, and pytest. These frameworks often integrate with code coverage, too.
In continuous integration, automated unit tests can be run automatically every time a new build is made to indicate if the build is good or bad. That’s why unit tests must be deterministic – they must yield consistent results in order to trust build status and expedite failure triage. For example, if a build was green at 10am but turned red at 11am, then, so long as the tests were deterministic, it is reasonable to deduce that a defective change was committed to the code line between 10-11am. Good build status indicates that the build is okay to deploy to a test environment and then hopefully to production.
(As a side note, I’ve heard arguments that unit tests can be black box, but I disagree. Even if a black box test covers only one “unit”, it is still at least an integration test because it covers the connection between the actual product and some caller (script, web browser, etc.).)
What defines a performance test?
Again, here’s my definition:
A performance test is a test that measures aspects of a controlled system. It is white box if it tests code directly, such as profiling individual functions or methods. It is black box if it tests a real, live, deployed product. Typically, when people talk about testing software performance, they mean black box style testing. The aspects to measure must be pre-determined, and the system under test must be controlled in order to achieve consistent measurements.
Performance tests are not functional tests:
Functional tests answer if a thing works.
Performance tests answer how efficiently a thing works.
Rather than yield pass-or-fail results, performance tests yield measurements. These measurements could track things as general as CPU or memory usage, or they could track specific product features like response times. Once measurements are gathered, data analysis should evaluate the goodness of the measurements. This often means comparison to other measurements, which could be from older releases or with other environment controls.
Performance testing is challenging to set up and measure properly. While unit tests will run the same in any environment, performance tests are inherently sensitive to the environment. For example, an enterprise cloud server will likely have better response time than a 7-year-old Macbook.
Why should performance tests not be unit tests?
Returning to the original question, it is theoretically possible to frame a performance test as a functional test by validating a specific measurement against a preset threshold. However, there are 3 main reasons why a unit test should not be a performance test:
Performance checks in unit tests make the build process more vulnerable to environmental issues. Bad measurements from environment issues could cause unit tests to fail for reasons unrelated to code correctness. Any unit test failure will block a build, trigger triage, and stall progress. This means time and money. The build process must not be interrupted by environment problems.
Proper performance tests require lots of setup beyond basic unit test support. Unit tests should be short and sweet, and unit testing frameworks don’t have the tools needed to take good measurements. Unit test environments are often not set up in tightly controlled environments, either. It would take a lot of work to properly put performance checks into a unit test.
Performance tests yield metrics that should not be shoehorned into a binary pass/fail status. Performance data is complex and rich with information. Teams should analyze performance data, especially over time. It can also be volatile.
These points are based on the explicit definitions provided above. Note that I am not saying that performance testing should not be done, but rather that performances checks should not be part of unit testing. Unit testing and performance testing should be categorically separate types of testing.
Which programming languages are best for writing test automation? There are several choices – just look at this list on Wikipedia and this cool decision graphs for choosing languages. While this topic can quickly devolve into a spat over personal tastes, I do believe there are objective reasons for why some languages are better for automating test cases than others.
Dividing Test Layers
First of all, unit tests should always be written in the same language as the product under test. Otherwise, they would definitionally no longer be unit tests! Unit tests are white box and need direct access to the product source code. This allows them to cover functions, methods, and classes.
The question at hand pertains more to higher-layer functional tests. These tests fall into many (potentially overlapping) categories: integration, end-to-end, system, acceptance, regression, and even performance. Since they are all typically black box, higher-layer tests do not necessarily need to be written in the same language as the product under test.
My Opinionated Choices
Personally, I think Python is today’s best all-around language for test automation. Python is wonderful because its conciseness lets the programmer expressively capture the essence of the test case. It also has very rich test support packages. Check out this article: Why Python is Great for Test Automation. Java is a good choice as well – it has a rich platform of tools and packages, and continuous integration with Java is easy with Maven/Gradle/ANT and Jenkins. I’ve heard that Ruby is another good choice for reasons similar to Python, but I have not used it myself.
Other languages are poor choices for test automation. While they could be used for automation, they likely should not be used. C and C++ are inconvenient because they are very low-level and lack robust frameworks. Perl is dangerous because it simply does not provide the consistency and structure for scalable, self-documenting code. Functional languages like LISP and Haskell are difficult because they do not translate well from test case procedures. They may be useful, however, for some lower-level data testing.
8 Criteria for Evaluation
There are eight major points to consider when evaluating any language for automation. These criteria specifically assess the language from a perspective of purity and usability, not necessarily from a perspective of immediate project needs.
Usability. A good automation language is fairly high-level and should handle rote tasks like memory management. Lower learning curves are preferable. Development speed is also important for deadlines.
Elegance. The process of translating test case procedures into code must be easy and clear. Test code should also be concise and self-documenting for maintainability.
Available Test Frameworks. Frameworks provide basic needs such as fixtures, setup/cleanup, logging, and reporting. Examples include Cucumber and xUnit.
Available Packages. It is better to use off-the-shelf packages for common operations, such as web drivers (Selenium), HTTP requests, and SSH.
Powerful Command Line. A good CLI makes launching tests easy. This is critical for continuous integration, where tests cannot be launched manually.
Easy Build Integration. Build automation should launch tests and report results. Difficult integration is a DevOps nightmare.
IDE Support. Because Notepad and vim just don’t cut it for big projects.
Industry Adoption. Support is good. If the language remains popular, then frameworks and packages will be maintained well.
Below, I rated each point for a few popular languages:
Available Test Frameworks
Powerful Command Line
Easy Build Integration
I won’t shy away from my preference for Python, but I recognize that they may not be the right choice for all situations. For example, when I worked at LexisNexis, we used C# because management wanted developers, who wrote the app in C#, to contribute to test automation.
Now, a truly nifty idea would be to create a domain-specific language for test automation, but that must be a topic for another post.
UPDATE: I changed some recommendations on 4/18/2018.
Automation has a lot of potential to improve software development. Unfortunately, though, automation is often seen as a luxury. Deadlines in the real word are unforgiving, and since test code isn’t product code, automation tasks are given lower priority and dunked into the black hole of the backlog. Some might argue that this is okay because it is lean or because a new project is just getting started. Once, I even heard it quipped that the first ones cut during a layoff are the automation folks. And it is true that automation requires a nontrivial resource investment.
However, I want to turn the tables. Instead of thinking about automation in terms of the opportunity, think about automation in terms of the opportunity cost. What happens if you don’t automate your tests from the get-go? There are 10 major things you lose:
#1: Man Hours
Automated tests will automatically run. Manual tests must be manually run. That’s ontological. If you only run a test one time, then automation has no return-on-investment. But if you run a test more than once, automation saves a tester from repeating themselves. Plus, it’s easy: push the button and wait for results. Automated tests almost always run faster than manual tests, too. Considering that time is money and engineer salaries aren’t cheap, man hours are a clear opportunity cost.
Automated tests can achieve greater coverage than manual tests, particularly for regression testing. As product development progresses, the sheer number of test cases increases. For example, in Agile, new tests will be created every sprint. Older tests must be run periodically to verify that new features don’t break existing features. If regression tests are manual, then testers must burn hours grinding through the same tests repeatedly. Often, for expediency, this means that they skip some tests – not in the sense of being lazy, but rather as part of a risk-based approach. Weaker coverage plus risk of missing bugs are accepted for the sake of shorter testing time. If those regression tests were automated, then there would be no reason to shrink coverage, because they would be easy to run.
People make mistakes. It’s human nature – nobody’s perfect. And manual tests are prone to human error because humans run them. I remember how nervous I felt running manual on-call system checks at MaxPoint for the first time, afraid that I would miss a problem that could bring down a million-dollar bidding system. Automated scripts run the same way every time.
Continuous integration (CI) protects code against defects by building and testing every code change in real time. A CI system will automatically trigger tests all the time.Tests not running in CI (like manual tests) are effectively dead. At NetApp, failing code changes would immediately be kicked out of the code line, making automated tests act like a vaccine against bugs. On the other hand, I remember a project at MaxPoint that was riddled with bugs and perpetually delayed. When I asked the developers to see their unit tests, they said they never wrote unit tests because “it wasn’t a requirement.”
#5: Delivery Time
Continuous delivery (CD) is the natural extension of continuous integration, in which software products can automatically be delivered (and potentially even deployed) as the final step in a CI pipeline. This is how big companies like Google, Facebook, and Netflix can deliver so rapidly. No automation means no CD.
#6: Results and Metrics
Non-engineers (managers, product owners, scrum masters, oh my!) love to ask questions about tests. “Are we red or green?” “How many tests do we have for this feature?” “What’s our coverage?” “How often do we run the tests?” Automated tests simply yield more accurate and more comprehensive results. Automation can also generate test reports, so engineers don’t need to waste time drafting emails or updating wiki pages.
Numbers don’t lie. Scripts don’t lie. Engineers typically don’t lie, but… results from manual tests can have a fudge factor, or a mistake in reporting, or any other sort of inconsistency. Inaccurate results may lead to poor business decisions. Automated results tell it like it is.
Manual testing can devolve into repetitive, menial labor: just follow steps 1-10 again and again and again. It would be much more effective for manual testers to focus on exploratory testing rather than deterministic testing. While automated tests can cover the fixed, repetitive test scenarios, exploratory testing lets testers find creative ways to uncover defects and judge how well a product actually works. Lack of automation ties up human capital.
#9: Peace of Mind
Are you sure that your product is “good”? Can you run enough tests to make sure? I learned the value of peace of mind while I was still in college. In my compiler theory course, I had to develop a simple programming language and build a compiler for it. Every week, we had to add new language features: arithmetic, strings, arrays, functions, etc. And every week, I wrote a slew of mini-programs to test grammar updates to my new language. By the time the project was complete, I had 1000+ automated test cases running through JUnit with 100% coverage, and the entire suite took a mere few minutes to run. And there were many late nights when the tests caught bugs in my language right away before committing code. There was no way I could have passed that class without my automated tests.
The ultimate purpose of test automation is product quality. Having automation doesn’t necessarily mean product quality is good, but not having automation severely limits how quality can be pursued. Anecdotally, I’ve seen much better code quality come out of projects that have good test automation than ones without it. If I were a product owner, I know what I would want.