best practices

Democratizing the Screenplay Pattern

I started Boa Constrictor back in 2018 because I loathed page objects. On a previous project, I saw page objects balloon to several thousand lines long with duplicative methods. Developing new tests became a nightmare, and about 10% of tests failed daily because they didn’t handle waiting properly.

So, while preparing a test strategy at a new company, I invested time in learning the Screenplay Pattern. To be honest, the pattern seemed a bit confusing at first, but I was willing to try anything other than page objects again. Eventually, it clicked for me: Actors use Abilities to perform Interactions. Boom! It was a clean separation of concerns.

Unfortunately, the only major implementations I could find for the Screenplay Pattern at the time were Serenity BDD in Java and JavaScript. My company was a .NET shop. I looked for C# implementations, but I didn’t find anything that I trusted. So, I took matters into my own hands and implemented the Screenplay Pattern myself in .NET. Initially, I implemented Selenium WebDriver interactions. Later, my team and I added RestSharp interactions. We eventually released Boa Constrictor as an open source project in October 2020 as part of Hacktoberfest.

With Boa Constrictor, I personally sought to reinvigorate interest in the Screenplay Pattern. By bringing the Screenplay Pattern to .NET, we enabled folks outside of the Java and JavaScript communities to give it a try. With our rich docs, examples, and videos, we made it easy to onboard new users. And through conference talks and webinars, we popularized the concepts behind Screenplay, even for non-C# programmers. It’s been awesome to see so many other folks in the testing community start talking about the Screenplay Pattern in the past few years.

I also wanted to provide a standalone implementation of the Screenplay Pattern. Since the Screenplay Pattern is a design for automating interactions, it could and should integrate with any .NET test framework: SpecFlow, MsTest, NUnit, xUnit.net, and any others. With Boa Constrictor, we focused singularly on making interactions as excellent as possible, and we let other projects handle separate concerns. I did not want Boa Constrictor to be locked into any particular tool or system. In this sense, Boa Constrictor diverged from Serenity BDD – it was not meant to be a .NET version of Serenity, despite taking much inspiration from Serenity.

Furthermore, in the design and all the messaging for Boa Constrictor, I strived to make the Screenplay Pattern easy to understand. So many folks I knew gave up on Screenplay in the past because they thought it was too complicated. I wanted to break things down so that any automation developer could pick it up quickly. Hence, I formed the soundbite, “Actors use Abilities to perform Interactions,” to describe the pattern in one line. I also coined the project’s slogan, “Better Interactions for Better Automation,” to clearly communicate why Screenplay should be used over alternatives like raw calls or page objects.

So far, Boa Constrictor has succeeded modestly well in these goals. Now, the project is pursuing one more goal: democratizing the Screenplay Pattern.

At its heart, the Screenplay Pattern is a generic pattern for any kind of interactions. The core pattern should not favor any particular tool or package. Anyone should be able to implement interaction libraries using the tools (or “Abilities”) they want, and each of those libraries should be treated equally without preference. Recently, in our plans for Boa Constrictor 3, we announced that we want to create separate packages for the “core” pattern and for each library of interactions. We also announced plans to add new libraries for Playwright and Applitools. The existing libraries – Selenium WebDriver and RestSharp – need not be the only libraries. Boa Constrictor was never meant to be merely a WebDriver wrapper or a superior page object. It was meant to provide better interactions for any kind of test automation.

In version 3.0.0, we successfully separated the Boa.Constrictor project into three new .NET projects and released a NuGet package for each:

This separation enables folks to pick the parts they need. If they only need Selenium WebDriver interactions, then they can use just the Boa.Constrictor.Selenium package. If they want to implement their own interactions and don’t need Selenium or RestSharp, then they can use the Boa.Constrictor.Screenplay package without being forced to take on those extra dependencies.

Furthermore, we continued to maintain the “classic” Boa.Constrictor package. Now, this package simply claims dependencies on the other three packages in order to preserve backwards compatibility for folks who used previous version of Boa Constrictor. As part of the upgrade from 2.0.x to 3.0.x, we did change some namespaces (which are documented in the project changelog), but the rest of the code remained the same. We wanted the upgrade to be as straightforward as possible.

The core contributors and I will continue to implement our plans for Boa Constrictor 3 over the coming weeks. There’s a lot to do, and we will do our best to implement new code with thoughtfulness and quality. We will also strive to keep everything documented. Please be patient with us as development progresses. We also welcome your contributions, ideas, and feedback. Let’s make Boa Constrictor excellent together.

My Upside-Down QA Story for Global Testers Day

Happy Global Testers Day! For 2021, QA Touch is celebrating with webinars, games, competitions, blogs, and videos. I participated by sharing an “upside-down” story from years ago when I accidentally wiped out all of NetApp’s continuous integration testing. Please watch my story below. I hope you find it both insightful and entertaining!

Skin Rashes and Software Testing

Many of you know me as the “Automation Panda” through my blog, my Twitter handle, or my online courses. Maybe you’ve attended one of my conference talks. I love connecting with others, but many people don’t get to know me personally. Behind my black-and-white façade, I’m a regular guy. When I’m not programming, I enjoy cooking and video gaming. I’m also currently fixing up a vintage Volkswagen Beetle. However, for nearly two years, I’ve suffered a skin rash that will not go away. I haven’t talked about it much until recently, when it became unbearable.

For a while, things turned bad. Thankfully, things are a little better now. I’d like to share my journey publicly because it helps to humanize folks in tech like me. We’re all people, not machines. Vulnerability is healthy. My journey also reminded me of a few important tenants for testing software. I hope you’ll read my story, and I promise to avoid the gross parts.

Distant Precursors

I’ve been blessed with great health and healthcare my entire life. However, when I was a teenager, I had a weird skin issue: the skin around my right eye socket turned dry, itchy, and red. Imagine dandruff, but on your face – it was flaky like a bad test (ha!). Lotions and creams did nothing to help. After it persisted for several weeks, my parents scheduled an appointment with my pediatrician. They didn’t know what caused it, but they gave me a sample tube of topical steroids to try. The steroids worked great. The skin around my eye cleared up and stayed normal. This issue resurfaced again while I was in college, but it went away on its own after a month or two.

Rise of the Beast

Around October 2019, I noticed this same rash started appearing again around my right eye for the first time in a decade. The exact date is fuzzy in my mind, but I remember it started before TestBash San Francisco 2019. At that time, my response was to ignore it. I’d continue to keep up my regular hygiene (like washing my face), and eventually it would go away like last time.

Unfortunately, the rash got worse. It started spreading to my cheek and my forehead. I started using body lotion on it, but it would burn whenever I’d apply it, and the rash would persist. My wife started trying a bunch of fancy, high-dollar lotions (like Kiehl’s and other brands I didn’t know), but they didn’t help at all. By Spring 2020, my hands and forearms started breaking out with dry, itchy red spots, too. These were worse: if I scratched them, they would bleed. I also remember taking my morning shower one day and thinking to myself, “Gosh, my whole body is super itchy!”

I had enough, so I visited a local dermatologist. He took one look at my face and arms and prescribed topical steroids. He told me to use them for about four weeks and then stop for two weeks before a reevaluation. The steroids partially worked. The itching would subside, but the rash wouldn’t go away completely. When I stopped using the steroids, the rash returned to the same places. I also noticed the rash slowly spreading, too. Eventually, it crept up my upper arms to my neck and shoulders, down my back and torso, and all the way down to my legs.

On a second attempt, the dermatologist prescribed a much stronger topical steroid in addition to a round of oral steroids. My skin healed much better with the stronger medicine, but, inevitably, when I stopped using them, the rash returned. By now, patches of the rash could be found all over my body, and they itched like crazy. I couldn’t wear white shirts anymore because spots would break out and bleed, as if I nicked myself while shaving. I don’t remember the precise timings, but I also asked the dermatologist to remove a series of three moles on my body that became infected badly by the rash.

Fruitless Mitigations

As a good tester, I wanted to know the root cause for my rash. Was I allergic to something? Did I have a deeper medical issue? Steroids merely addressed the symptoms, and they did a mediocre job at best. So, I tried investigating what triggered my rash.

When the dry patch first appeared above my eye, I suspected cold, dry weather. “Maybe the crisp winter air is drying my skin too much.” That’s when I tried using an assortment of creams. When the rash started spreading, that’s when I knew the cause was more than winter weather.

Then, my mind turned to allergies. I knew I was allergic to pet dander, but I never had reactions to anything else before. At the same time the rash started spreading from my eye, I noticed I had a small problem with mangos. Whenever I would bite into a fresh mango, my lips would become severely, painfully chapped for the next few days. I learned from reading articles online that some folks who are allergic to poison ivy (like me) are also sensitive to mangos because both plants contain urushiol in their skin and sap. At that time, my family and I were consuming a boxful of mangos. I immediately cut out mangos and hoped for the best. No luck.

When the rash spread to my whole body, I became serious about finding the root cause. Every effect has a cause, and I wanted to investigate all potential causes in order of likelihood. I already crossed dry weather and mangoes off the list. Since the rash appeared in splotches across my whole body, then I reasoned its trigger could either be external – something coming in contact with all parts of skin – or internal – something not right from the inside pushing out.

What comes in contact with the whole body? Air, water, and cloth. Skin reactions to things in air and water seemed unlikely, so I focused on clothing and linens. That’s when I remembered a vague story from childhood. When my parents taught me how to do laundry, they told me they used a scent-free, hypoallergenic detergent because my dad had a severe skin reaction to regular detergent one time long before I was born. Once I confirmed the story with my parents, I immediately sprung to action. I switched over my detergent and fabric softener. I rewashed all my clothes and linens – all of them. I even thoroughly cleaned out my dryer duct to make sure no chemicals could leech back into the machine. (Boy, that was a heaping pile of dust.) Despite these changes, my rash persisted. I also changed my soaps and shampoos to no avail.

At the same time, I looked internally. I read in a few online articles that skin rashes could be caused by deficiencies. I started taking a daily multivitamin. I also tried supplements for calcium, Vitamin B6, Vitamin D, and collagen. Although I’m sure my body was healthier as a result, none of these supplements made a noticeable difference.

My dermatologist even did a skin punch test. He cut a piece of skin out of my back about 3mm wide through all layers of the skin. The result of the biopsy was “atopic dermatitis.” Not helpful.

For relief, I tried an assortment of creams from Eucerin, CeraVe, Aveeno, and O’Keeffe’s. None of them staved off the persistent itching or reduced the redness. They were practically useless. The only cream that had any impact (other than steroids) was an herbal Chinese medicine. With a cooling burn, it actually stopped the itch and visibly reduced the redness. By Spring 2021, I stopped going to the dermatologist and simply relied on the Chinese cream. I didn’t have a root cause, but I had an inexpensive mitigation that was 差不多 (chà bù duō; “almost even” or “good enough”).

Insufferability

Up until Summer 2021, my rash was mostly an uncomfortable inconvenience. Thankfully, since everything shut down for the COVID pandemic, I didn’t need to make public appearances with unsightly skin. The itchiness was the worst symptom, but nobody would see me scratch at home.

Then, around the end of June 2021, the rash got worse. My whole face turned red, and my splotches became itchier than ever. Worst of all, the Chinese cream no longer had much effect. Timing was lousy, too, since my wife and I were going to spend most of July in Seattle. I needed medical help ASAP. I needed to see either my primary care physician or an allergist. I called both offices to schedule appointments. The wait time for my primary doctor was 1.5 months, while the wait time for an allergist was 3 months! Even if I wouldn’t be in Seattle, I couldn’t see a doctor anyway.

My rash plateaued while in Seattle. It was not great, but it didn’t stop me from enjoying our visit. I was okay during a quick stop in Salt Lake City, too. However, as soon as I returned home to the Triangle, the rash erupted. It became utterly unbearable – to the point where I couldn’t sleep at night. I was exhausted. My skin was raw. I could not focus on any deep work. I hit the point of thorough debilitation.

When I visited my doctor on August 3, she performed a series of blood tests. Those confirmed what the problem was not:

My metabolic panel was okay.
My cholesterol was okay.
My thyroid was okay.
I did not have celiac disease (gluten intolerance).
I did not have hepatitis C.

Nevertheless, these results did not indicate any culprit. My doctor then referred me to an allergist for further testing on August 19.

The two weeks between appointments was hell. I was not allowed to take steroids or antihistamines for one week before the allergy test. My doctor also prescribed me hydroxyzine under the presumption that the root cause was an allergy. Unfortunately, I did not react well to hydroxyzine. It did nothing to relieve the rash or the itching. Even though I took it at night, I would feel off the next day, to the point where I literally could not think critically while trying to do my work. It affected me so badly that I accidentally ran a red light. During the two weeks between appointments, I averaged about 4 hours of sleep per night. I had to take sick days off work, and on days I could work, I had erratic hours. (Thankfully, my team, manager, and company graciously accommodated my needs.) I had no relief. The creams did nothing. I even put myself on an elimination diet in a desperate attempt to avoid potential allergens.

If you saw this tweet, now you know the full story behind it:

Friends, I’m not okay.

I have a crippling skin rash that is interrupting sleep and making everyday tasks difficult.

I have an allergy test on Thursday. Please pray we get helpful answers.

In the meantime, please don’t expect much from me. #PandaDown 🐼
— Pandy Knight (@AutomationPanda) August 16, 2021

A Possible Answer

On August 19, 2021, the allergist finally performed a skin prick allergy test. A skin prick test is one of the fastest, easiest ways to reveal common allergies. The nurse drew dots down both of my forearms. She then lightly scratched my skin next to each dot with a plastic nub that had been dunked in an allergen. After waiting 15 minutes, she measured the diameter of each spot to determine the severity of the allergic reaction, if any. She must have tested about 60 different allergens.

The results yielded immediate answers:

I did not have any of the “Big 8” food allergies.
I am allergic to cats and dogs, which I knew.
I am allergic to certain pollens, which I suspected.
I am allergic to certain fungi, which is no surprise.

Then, there was the major revelation: I am allergic to dust mites.

Once I did a little bit of research, this made lots of sense. Dust mites are microscopic bugs that live in plush environments (like mattresses and pillows) and eat dead skin cells. The allergy is not to the mite itself but rather to its waste. They can appear anywhere but are typically most prevalent in bedrooms. My itchiness always seemed strongest at night while in bed. The worst areas on my skin were my face and upper torso, which have the most contact with my bed pillows and covers. Since I sleep in bed every night, the allergic reaction would be recurring and ongoing. No wonder I couldn’t get any relief!

I don’t yet know if eliminating dust mites will completely cure my skin problems, but the skin prick test at least provides hard evidence that I have a demonstrable allergy to dust mites.

Lessons for Software Testing

After nearly two years of suffering, I’m grateful to have a probable root cause for my rash. Nevertheless, I can’t help but feel frustrated that it took so long to find a meaningful answer. As a software tester, I feel like I should have been able to figure it out much sooner. Reflecting on my journey reminds me of important lessons for software testing.

First and foremost, formal testing with wide coverage is better than random checking. When I first got my rash, I tried to address it based on intuition. Maybe I should stop eating mangoes? Maybe I should change my shower soap, or my laundry detergent? Maybe eating probiotics will help? These ideas, while not entirely bad, were based more on conjecture than evidence. Checking them took weeks at a time and yielded unclear results. Compare that to the skin prick test, which took half an hour in total time and immediately yielded definite answers. So many software development teams do their testing more like tossing mangoes than like a skin prick test. They spot-check a few things on the new features they develop and ship them instead of thoroughly covering at-risk behaviors. Spot checks feel acceptable when everything is healthy, but they are not helpful when something systemic goes wrong. Hard evidence is better than wild guesses. Running an established set of tests, even if they seem small or basic, can deliver immense value in short time.

When tests yield results, knowing what is “good” is just as important as knowing what is “bad.” Frequently, software engineers only look at failing tests. If a test passes, who cares? Failures are the things that need attention. Well, passing tests rule out potential root causes. One of the best results from my allergy test is that I’m not allergic to any of the “Big 8” food allergies: eggs, fish, milk, peanuts, shellfish, soy, tree nuts, and wheat. That’s a huge relief, because I like to eat good food. When we as software engineers get stuck trying to figure out why things are broken, it may be helpful to remember what isn’t broken.

Unfortunately, no testing has “complete” coverage, either. My skin prick test covered about 60 common allergens. Thankfully, it revealed my previously-unknown allergy to dust mites, but I recognize that I might be allergic to other things not covered by the test. Even if I mitigate dust mites in my house, I might still have this rash. That worries me a bit. As a software tester, I worry about product behaviors I am unable to cover with tests. That’s why I try to maximize coverage on the riskiest areas with the resources I have. I also try to learn as much about the behaviors under test to minimize unknowns.

Testing is expensive but worthwhile. My skin prick allergy test cost almost $600 out of pocket. To me, that cost is outrageously high, but it was grudgingly worthwhile to get a definitive answer. (I won’t digress into problems with American healthcare costs.) Many software teams shy away from regular, formal testing work because they don’t want to spend the time doing it or pay the dollars for someone else to do it. I would’ve gladly shelled out a grand a year ago if I could have known the root cause to my rash. My main regret is not visiting an allergist sooner.

Finally, test results are useless without corrective action. Now that I know I have a dust mite allergy, I need to mitigate dust mites in my house:

I need to encase my mattress and pillow with hypoallergenic barriers that keep dust mites out.
I need to wash all my bedding in hot water (at least 130° F) (or freeze it for over 24 hours).
I need to deeply clean my bedroom suite to eliminate existing dust.
I need to maintain a stricter cleaning schedule for my house.
I need to upgrade my HVAC air filters.
I need to run an air purifier in my bedroom to eliminate any other airborne allergens.

In the worst case, I can take allergy shots to abate my symptoms.

Simply knowing my allergies doesn’t fix them. The same goes for software testing – testing does not improve quality, it merely indicates problems with quality. We as engineers must improve software behaviors based on feedback from testing, whether that means fixing bugs, improving user experience, or shipping warnings for known issues.

Next Steps

Now that I know I have an allergy to dust mites, I will do everything I can to abate them. I already ordered covers and an air purifier from Amazon. I also installed new HVAC air filters that catch more allergens. For the past few nights, I slept in a different bed, and my skin has noticeably improved. Hopefully, this is the main root cause and I won’t need to do more testing!

Are Automated Test Retries Good or Bad?

What happens when a test fails? If someone is manually running the test, then they will pause and poke around to learn more about the problem. However, when an automated test fails, the rest of the suite keeps running. Testers won’t get to view results until the suite is complete, and the automation won’t perform any extra exploration at the time of failure. Instead, testers must review logs and other artifacts gathered during testing, and they even might need to rerun the failed test to check if the failure is consistent.

Since testers typically rerun failed tests as part of their investigation, why not configure automated tests to automatically rerun failed tests? On the surface, this seems logical: automated retries can eliminate one more manual step. Unfortunately, automated retries can also enable poor practices, like ignoring legitimate issues.

So, are automated test retries good or bad? This is actually a rather controversial topic. I’ve heard many voices strongly condemn automated retries as an antipattern (see here, here, and here). While I agree that automated retries can be abused, I nevertheless still believe they can add value to test automation. A deeper understanding needs a nuanced approach.

So, how do automated retries work?

To avoid any confusion, let’s carefully define what we mean by “automated test retries.”

Let’s say I have a suite of 100 automated tests. When I run these tests, the framework will execute each test individually and yield a pass or fail result for the test. At the end of the suite, the framework will aggregate all the results together into one report. In the best case, all tests pass: 100/100.

However, suppose that one of the tests fails. Upon failure, the test framework would capture any exceptions, perform any cleanup routines, log a failure, and safely move onto the next test case. At the end of the suite, the report would show 99/100 passing tests with one test failure.

By default, most test frameworks will run each test one time. However, some test frameworks have features for automatically rerunning test cases that fail. The framework may even enable testers to specify how many retries to attempt. So, let’s say that we configure 2 retries for our suite of 100 tests. When that one test fails, the framework would queue that failing test to run twice more before moving onto the next test. It would also add more information to the test report. For example, if one retry passed but another one failed, the report would show 99/100 passing tests with a 1/3 pass rate for the failing test.

In this article, we will focus on automated retries for test cases. Testers could also program other types of retries into automated tests, such as retrying browser page loads or REST requests. Interaction-level retries require sophisticated, context-specific logic, whereas test-level retry logic works the same for any kind of test case. (Interaction-level retries would also need their own article.)

Automated retries can be a terrible antipattern

Let’s see how automated test retries can be abused:

Jeremy is a member of a team that runs a suite of 300 automated tests for their web app every night. Unfortunately, the tests are notoriously flaky. About a dozen different tests fail every night, and Jeremy spends a lot of time each morning triaging the failures. Whenever he reruns failed tests individually on his laptop, they almost always pass.

To save himself time in the morning, Jeremy decides to add automatic retries to the test suite. Whenever a test fails, the framework will attempt one retry. Jeremy will only investigate tests whose retries failed. If a test had a passing retry, then he will presume that the original failure was just a flaky test.

Ouch! There are several problems here.

First, Jeremy is using retries to conceal information rather than reveal information. If a test fails but its retries pass, then the test still reveals a problem! In this case, the underlying problem is flaky behavior. Jeremy is using automated retries to overwrite intermittent failures with intermittent passes. Instead, he should investigate why the test are flaky. Perhaps automated interactions have race conditions that need more careful waiting. Or, perhaps features in the web app itself are behaving unexpectedly. Test failures indicate a problem – either in test code, product code, or infrastructure.

Second, Jeremy is using automated retries to perpetuate poor practices. Before adding automated retries to the test suite, Jeremy was already manually retrying tests and disregarding flaky failures. Adding retries to the test suite merely speeds up the process, making it easier to sidestep failures.

Third, the way Jeremy uses automated retries indicates that the team does not value their automated test suite very much. Good test automation requires effort and investment. Persistent flakiness is a sign of neglect, and it fosters low trust in testing. Using retries is merely a “band-aid” on both the test failures and the team’s attitude about test automation.

In this example, automated test retries are indeed a terrible antipattern. They enable Jeremy and his team to ignore legitimate issues. In fact, they incentivize the team to ignore failures because they institutionalize the practice of replacing red X’s with green checkmarks. This team should scrap automated test retries and address the root causes of flakiness.

green check red x — Testers should not conceal failures by overwriting them with passes.

Automated retries are not the main problem

Ignoring flaky failures is unfortunately all too common in the software industry. I must admit that in my days as a newbie engineer, I was guilty of rerunning tests to get them to pass. Why do people do this? The answer is simple: intermittent failures are difficult to resolve.

Testers love to find consistent, reproducible failures because those are easy to explain. Other developers can’t push back against hard evidence. However, intermittent failures take much more time to isolate. Root causes can become mind-bending puzzles. They might be triggered by environmental factors or awkward timings. Sometimes, teams never figure out what causes them. In my personal experience, bug tickets for intermittent failures get far less traction than bug tickets for consistent failures. All these factors incentivize folks to turn a blind eye to intermittent failures when convenient.

Automated retries are just a tool and a technique. They may enable bad practices, but they aren’t inherently bad. The main problem is willfully ignoring certain test results.

Automated retries can be incredibly helpful

So, what is the right way to use automated test retries? Use them to gather more information from the tests. Test results are simply artifacts of feedback. They reveal how a software product behaved under specific conditions and stimuli. The pass-or-fail nature of assertions simplifies test results at the top level of a report in order to draw attention to failures. However, reports can give more information than just binary pass-or-fail results. Automated test retries yield a series of results for a failing test that indicate a success rate.

For example, SpecFlow and the SpecFlow+ Runner make it easy to use automatic retries the right way. Testers simply need to add the retryFor setting to their SpecFlow+ Runner profile to set the number of retries to attempt. In the final report, SpecFlow records the success rate of each test with color-coded counts. Results are revealed, not concealed.

Here is a snippet of the SpecFlow+ Report showing both intermittent failures (in orange) and consistent failures (in red).

This information jumpstarts analysis. As a tester, one of the first questions I ask myself about a failing test is, “Is the failure reproducible?” Without automated retries, I need to manually rerun the test to find out – often at a much later time and potentially within a different context. With automated retries, that step happens automatically and in the same context. Analysis then takes two branches:

If all retry attempts failed, then the failure is probably consistent and reproducible. I would expect it to be a clear functional failure that would be fast and easy to report. I jump on these first to get them out of the way.
If some retry attempts passed, then the failure is intermittent, and it will probably take more time to investigate. I will look more closely at the logs and screenshots to determine what went wrong. I will try to exercise the product behavior manually to see if the product itself is inconsistent. I will also review the automation code to make sure there are no unhandled race conditions. I might even need to rerun the test multiple times to measure a more accurate failure rate.

I do not ignore any failures. Instead, I use automated retries to gather more information about the nature of the failures. In the moment, this extra info helps me expedite triage. Over time, the trends this info reveals helps me identify weak spots in both the product under test and the test automation.

Automated retries are most helpful at high scale

When used appropriate, automated retries can be helpful for any size test automation project. However, they are arguably more helpful for large projects running tests at high scale than small projects. Why? Two main reasons: complexities and priorities.

Large-scale test projects have many moving parts. For example, at PrecisionLender, we presently run 4K-10K end-to-end tests against our web app every business day. (We also run ~100K unit tests every business day.) Our tests launch from TeamCity as part of our Continuous Integration system, and they use in-house Selenium Grid instances to run 50-100 tests in parallel. The PrecisionLender application itself is enormous, too.

Intermittent failures are inevitable in large-scale projects for many different reasons. There could be problems in the test code, but those aren’t the only possible problems. At PrecisionLender, Boa Constrictor already protects us from race conditions, so our intermittent test failures are rarely due to problems in automation code. Other causes for flakiness include:

The app’s complexity makes certain features behave inconsistently or unexpectedly
Extra load on the app slows down response times
The cloud hosting platform has a service blip
Selenium Grid arbitrarily chokes on a browser session
The DevOps team recycles some resources
An engineer makes a system change while tests were running
The CI pipeline deploys a new change in the middle of testing

Many of these problems result from infrastructure and process. They can’t easily be fixed, especially when environments are shared. As one tester, I can’t rewrite my whole company’s CI pipeline to be “better.” I can’t rearchitect the app’s whole delivery model to avoid all collisions. I can’t perfectly guarantee 100% uptime for my cloud resources or my test tools like Selenium Grid. Some of these might be good initiatives to pursue, but one tester’s dictates do not immediately become reality. Many times, we need to work with what we have. Curt demands to “just fix the tests” come off as pedantic.

Automated test retries provide very useful information for discerning the nature of such intermittent failures. For example, at PrecisionLender, we hit Selenium Grid problems frequently. Roughly 1/10000 Selenium Grid browser sessions will inexplicably freeze during testing. We don’t know why this happens, and our investigations have been unfruitful. We chalk it up to minor instability at scale. Whenever the 1/10000 failure strikes, our suite’s automated retries kick in and pass. When we review the test report, we see the intermittent failure along with its exception method. Based on its signature, we immediately know that test is fine. We don’t need to do extra investigation work or manual reruns. Automated retries gave us the info we needed.

Selenium Grid is a large cluster with many potential points of failure.
(Image source: LambdaTest.)

Another type of common failure is intermittently slow performance in the PrecisionLender application. Occasionally, the app will freeze for a minute or two and then recover. When that happens, we see a “brick wall” of failures in our report: all tests during that time frame fail. Then, automated retries kick in, and the tests pass once the app recovers. Automatic retries prove in the moment that the app momentarily froze but that the individual behaviors covered by the tests are okay. This indicates functional correctness for the behaviors amidst a performance failure in the app. Our team has used these kinds of results on multiple occasions to identify performance bugs in the app by cross-checking system logs and database queries during the time intervals for those brick walls of intermittent failures. Again, automated retries gave us extra information that helped us find deep issues.

Automated retries delineate failure priorities

That answers complexity, but what about priority? Unfortunately, in large projects, there is more work to do than any team can handle. Teams need to make tough decisions about what to do now, what to do later, and what to skip. That’s just business. Testing decisions become part of that prioritization.

In almost all cases, consistent failures are inherently a higher priority than intermittent failures because they have a greater impact on the end users. If a feature fails every single time it is attempted, then the user is blocked from using the feature, and they cannot receive any value from it. However, if a feature works some of the time, then the user can still get some value out of it. Furthermore, the rarer the intermittency, the lower the impact, and consequentially the lower the priority. Intermittent failures are still important to address, but they must be prioritized relative to other work at hand.

Automated test retries automate that initial prioritization. When I triage PrecisionLender tests, I look into consistent “red” failures first. Our SpecFlow reports make them very obvious. I know those failures will be straightforward to reproduce, explain, and hopefully resolve. Then, I look into intermittent “orange” failures second. Those take more time. I can quickly identify issues like Selenium Grid disconnections, but other issues may not be obvious (like system interruptions) or may need additional context (like the performance freezes). Sometimes, we may need to let tests run for a few days to get more data. If I get called away to another more urgent task while I’m triaging results, then at least I could finish the consistent failures. It’s a classic 80/20 rule: investigating consistent failures typically gives more return for less work, while investigating intermittent failures gives less return for more work. It is what it is.

The only time I would prioritize an intermittent failure over a consistent failure would be if the intermittent failure causes catastrophic or irreversible damage, like wiping out an entire system, corrupting data, or burning money. However, that type of disastrous failure is very rare. In my experience, almost all intermittent failures are due to poorly written test code, automation timeouts from poor app performance, or infrastructure blips.

Context matters

Automated test retries can be a blessing or a curse. It all depends on how testers use them. If testers use retries to reveal more information about failures, then retries greatly assist triage. Otherwise, if testers use retries to conceal intermittent failures, then they aren’t doing their jobs as testers. Folks should not be quick to presume that automated retries are always an antipattern. We couldn’t achieve our scale of testing at PrecisionLender without them. Context matters.

Should Gherkin Steps use Past, Present, or Future Tense?

Gherkin’s Given-When-Then syntax is a great structure for specifying behaviors. However, while writing Gherkin may seem easy, writing good Gherkin can be a challenge. One aspect to consider is the tense used for Gherkin steps. Should Gherkin steps use past, present, or future tense?

One approach is to use present tense for all steps, like this:

Scenario: Simple Google search
    Given the Google home page is displayed
    When the user searches for "panda"
    Then the results page shows links related to "panda"

Notice the tense of each verb:

the home page is – present
the user searches – present
the results page shows – present

Present tense is the simplest verb tense to use. It is the least “wordy” tense, and it makes the scenario feel active.

An alternative approach is to use past-present-future tense for Given-When-Then steps respectively, like this:

Scenario: Simple Google search
    Given the Google home page was displayed
    When the user searches for "panda"
    Then the results page will show links related to "panda"

Notice the different verb tenses in this scenario:

the home page was – past
the user searches – present
the result page will show – future

Scenarios exercise behavior. Writing When steps using present tense centers the scenario’s main actions in the present. Since Given steps must happen before the main actions, they would be written using past tense. Likewise, since Then steps represent expected outcomes after the main actions, they would be written using future tense.

Both of these approaches – using all present tense or using past-present-future in order – are good. Personally, I prefer to write all steps using present tense. It’s easier to explain to others, and it frames the full scenario in the moment. However, I don’t think other approaches are good. For example, writing all steps using past tense or future tense would seem weird, and writing steps in order of future-present-past tense would be illogical. Scenarios should be centered in the present because they should timelessly represent the behaviors they cover.

Want to learn more? Check out my other BDD articles, especially Writing Good Gherkin.

Solving: How to write good UI interaction tests? #GivenWhenThenWithStyle

Writing good Gherkin is a passion of mine. Good Gherkin means good behavior specification, which results in better features, better tests, and ultimately better software. To help folks improve their Gherkin skills, Gojko Adzic and SpecFlow are running a series of #GivenWhenThenWithStyle challenges. I love reading each new challenge, and in this article, I provide my answer to one of them.

The Challenge

Challenge 20 states:

This week, we’re looking into one of the most common pain points with Given-When-Then: writing automated tests that interact with a user interface. People new to behaviour driven development often misunderstand what kind of behaviour the specifications should describe, and they write detailed user interactions in Given-When-Then scenarios. This leads to feature files that are very easy to write, but almost impossible to understand and maintain.

Here’s a typical example:

Scenario: Signed-in users get larger capacity
 
Given a user opens https://www.example.com using Chrome
And the user clicks on "Upload Files"
And the page reloads
And the user clicks on "Spreadsheet Formats"
Then the buttons "XLS" and "XLSX" show
And the user clicks on "XLSX"
And the user selects "500kb-sheet.xlsx"
Then the upload completes
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
And the user clicks on "XLSX"
And the user selects "1mb-sheet.xlsx"
Then the upload fails
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx" 
And the user clicks on "Login"
And the user enters "testuser123" into the "username" field
And the user enters "$Pass123" into the "password" field
And the user clicks on "Sign in"
And the page reloads
Then the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx" 
And the user clicks on "spreadsheet formats"
Then the buttons "XLS" and "XLSX" show
And the user clicks on "XLSX"
And the user selects "1mb-sheet.xlsx"
Then the upload completes
And the table "Uploaded Files" contains a cell with "1mb-sheet.xlsx" 
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx"

A common way to avoid such issues is to rewrite the specification to avoid the user interface completely. We’ve looked into that option several times in this article series. However, that solution only applies if the risk we’re testing is not in the user interface, but somewhere below. To make this challenge more interesting, let’s say that we actually want to include the user interface in the test, since the risk is in the UI interactions.

Indeed, most behavior-driven practitioners would generally recommend against phrasing steps using language specific to the user interface. However, there are times when testing a user interface itself is valid. For example, I work at PrecisionLender, a Q2 Company, and our main web app is very heavy on the front end. It has many, many interconnected fields for pricing commercial lending opportunities. My team has quite a few tests to cover UI-centric behaviors, such as verifying that entering a new interest rate triggers recalculation for summary amounts. If the target behavior is a piece of UI functionality, and the risk it bears warrants test coverage, then so be it.

Let’s break down the example scenario given above to see how to write Gherkin with style for user interface tests.

Understanding Behavior

Behavior is behavior. If you can describe it, then you can do it. Everything exhibits behavior, from the source code itself to the API, UIs, and full end-to-end workflows. Gherkin scenarios should use verbiage that reflects the context of the target behavior. Thus, the example above uses words like “click,” “select,” and “open.” Since the scenario explicitly covers a user interface, I think it is okay to use these words here. What bothers me, however, are two apparent code smells:

The wall of text
Out-of-order step types

The first issue is the wall of text this scenario presents. Walls of text are hard to read because they present too much information at once. The reader must take time to read through the whole chunk. Many readers simply read the first few lines and then skip the remainder. The example scenario has 27 Given-When-Then steps. Typically, I recommend Gherkin scenarios to have single-digit line length. A scenario with less than 10 steps is easier to understand and less likely to include unnecessary information. Longer scenarios are not necessarily “wrong,” but their longer lengths indicate that, perhaps, these scenarios could be rewritten more concisely.

The second issue in the example scenario is that step types are out of order. Given-When-Then is a formula for success. Gherkin steps should follow strict Given → When → Then ordering because this ordering demarcates individual behaviors. Each Gherkin scenario should cover one individual behavior so that the target behavior is easier to understand, easier to communicate, and easier to investigate whenever the scenario fails during testing. When scenarios break the order of steps, such as Given → Then → Given → Then in the example scenario, it shows that either the scenario covers multiple behaviors or that the author did not bring a behavior-driven understanding to the scenario.

The rules of good behavior don’t disappear when the type of target behavior changes. We should still write Gherkin with best practices in mind, even if our scenarios cover user interfaces.

Breaking Down Scenarios

If I were to rewrite the example scenario, I would start by isolating individual behaviors. Let’s look at the first half of the original example:

Given a user opens https://www.example.com using Chrome
And the user clicks on "Upload Files"
And the page reloads
And the user clicks on "Spreadsheet Formats"
Then the buttons "XLS" and "XLSX" show
And the user clicks on "XLSX"
And the user selects "500kb-sheet.xlsx"
Then the upload completes
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
And the user clicks on "XLSX"
And the user selects "1mb-sheet.xlsx"
Then the upload fails
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx"

Here, I see four distinct behaviors covered:

Clicking “Upload Files” reloads the page.
Clicking “Spreadsheet Formats” displays new buttons.
Uploading a spreadsheet file makes the filename appear on the page.
Attempting to upload a spreadsheet file that is 1MB or larger fails.

If I wanted to purely retain the same coverage, then I would rewrite these behavior specs using the following scenarios:

Feature: Example site
 
 
Scenario: Choose to upload files
 
Given the Example site is displayed
When the user clicks the "Upload Files" link
Then the page displays the "Spreadsheet Formats" link
 
 
Scenario: Choose to upload spreadsheets
 
Given the Example site is ready to upload files
When the user clicks the "Spreadsheet Formats" link
Then the page displays the "XLS" and "XLSX" buttons
 
 
Scenario: Upload a spreadsheet file that is smaller than 1MB
 
Given the Example site is ready to upload spreadsheet files
When the user clicks the "XLSX" button
And the user selects "500kb-sheet.xlsx" from the file upload dialog
Then the upload completes
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
 
 
Scenario: Upload a spreadsheet file that is larger than or equal to 1MB
 
Given the Example site is ready to upload spreadsheet files
When the user clicks the "XLSX" button
And the user selects "1mb-sheet.xlsx" from the file upload dialog
Then the upload fails
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx"

Now, each scenario covers each individual behavior. The first scenario starts with the Example site in a “blank” state: “Given the Example site is displayed”. The second scenario inherently depends upon the outcome of the first scenario. Rather than repeat all the steps from the first scenario, I wrote a new starting step to establish the initial state more declaratively: “Given the Example site is ready to upload files”. This step’s definition method may need to rerun the same operations as the first scenario, but it guarantees independence between scenarios. (The step could also optimize the operations, but that should be a topic for another challenge.) Likewise, the third and fourth scenarios have a Given step to establish the state they need: “Given the Example site is ready to upload spreadsheet files.” Both scenarios can share the same Given step because they have the same starting point. All three of these new steps are descriptive more than prescriptive. They declaratively establish an initial state, and they leave the details to the automation code in the step definition methods to determine precisely how that state is established. This technique makes it easy for Gherkin scenarios to be individually clear and independently executable.

I also added my own writing style to these scenarios. First, I wrote concise, declarative titles for each scenario. The titles dictate interaction over mechanics. For example, the first scenario’s title uses the word “choose” rather than “click” because, from the user’s perspective, they are “choosing” an action to take. The user will just happen to mechanically “click” a link in the process of making their choice. The titles also provide a level of example. Note that the third and fourth scenarios spell out the target file sizes. For brevity, I typically write scenario titles using active voice: “Choose this,” “Upload that,” or “Do something.” I try to avoid including verification language in titles unless it is necessary to distinguish behaviors.

Another stylistic element of mine was to remove explicit details about the environment. Instead of hard coding the website URL, I gave the site a proper name: “Example site.” I also removed the mention of Chrome as the browser. These details are environment-specific, and they should not be specified in Gherkin. In theory, this site could have multiple instances (like an alpha or a beta), and it should probably run in any major browser (like Firefox and Edge). Environmental characteristics should be specified as inputs to the automation code instead.I also refined some of the language used in the When and Then steps. When I must write steps for mechanical actions like clicks, I like to specify element types for target elements. For example, “When the user clicks the “Upload Files” link” specifies a link by a parameterized name. Saying the element is a link helps provides context to the reader about the user interface. I wrote other steps that specify a button, too. These steps also specified the element name as a parameter so that the step definition method could possibly perform the same interaction for different elements. Keep in mind, however, that these linguistic changes are neither “required” nor “perfect.” They make sense in the immediate context of this feature. While automating step definitions or writing more scenarios, I may revisit the verbiage and do some refactoring.

Determining Value for Each Behavior

The four new scenarios I wrote each covers an independent, individual behavior of the fictitious Example site’s user interface. They are thorough in their level of coverage for these small behaviors. However, not all behaviors may be equally important to cover. Some behaviors are simply more important than others, and thus some tests are more valuable than others. I won’t go into deep detail about how to measure risk and determine value for different tests in this article, but I will offer some suggestions regarding these example scenarios.

First and foremost, you as the tester must determine what is worth testing. These scenarios aptly specify behavior, and they will likely be very useful for collaborating with the Three Amigos, but not every scenario needs to be automated for testing. You as the tester must decide. You may decide that all four of these example scenarios are valuable and should be added to the automated test suite. That’s a fine decision. However, you may instead decide that certain user interface mechanics are not worth explicitly testing. That’s also a fine decision.

In my opinion, the first two scenarios could be candidates for the chopping block:

Choose to upload files
Choose to upload spreadsheets

Even though these are existing behaviors in the Example site, they are tiny. The tests simply verify that a user clicks makes certain links or buttons appear. It would be nice to verify them, but test execution time is finite, and user interface tests are notoriously slow compared to other tests. Consider the Rule of 1’s: typically, by orders of magnitude, a unit test takes about 1 millisecond, a service API test takes about 1 second, and a web UI test takes about 1 minute. Furthermore, these behaviors are implicitly exercised by the other scenarios, even if they don’t have explicit assertions.

One way to condense the scenarios could be like this:

Feature: Example site
 
 
Background:
 
Given the Example site is displayed
When the user clicks the "Upload Files" link
And the user clicks the "Spreadsheet Formats" link
And the user clicks the "XLSX" button
 
 
Scenario: Upload a spreadsheet file that is smaller than 1MB
 
When the user selects "500kb-sheet.xlsx" from the file upload dialog
Then the upload completes
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
 
 
Scenario: Upload a spreadsheet file that is larger than or equal to 1MB
 
When the user selects "1mb-sheet.xlsx" from the file upload dialog
Then the upload fails
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx"

This new feature file eliminates the first two scenarios and uses a Background section to cover the setup steps. It also eliminates the need for special Given steps in each scenario to set unique starting points. Implicitly, if the “Upload Files” or “Spreadsheet Formats” links fail to display the expected elements, then those steps would fail.

Again, this modification is not necessarily the “best” way or the “right” way to cover the desired behaviors, but it is a reasonably good way to do so. However, I would assert that both the 4-scenario feature file and the 2-scenario feature file are much better approaches than the original example scenario.

More Gherkin

What I showed in my answer to this Gherkin challenge is how I would handle UI-centric behaviors. I try to keep my Gherkin scenarios concise and focused on individual, independent behaviors. Try using these style techniques to rewrite the second half of Gojko’s original scenario. Feel free to drop your Gherkin in the comments below. I look forward to seeing how y’all write #GivenWhenThenWithStyle!

12 Traits of Highly Effective Tests

Writing effective tests is hard. Tests that are flaky, confusing, or slow are effectively useless because they do more harm than good. The Arrange-Act-Assert pattern gives good structure, but what other characteristics should test cases have? Here are 12 traits for highly effective tests.

#1. Understandable

At its core, a test is just a step-by-step procedure. It exercises a behavior and verifies the outcome. In a sense, tests are living specifications – they detail exactly how a feature should function. Everyone should be able to intuitively understand how a test works. Follow conventions like Arrange-Act-Assert or Given-When-Then. Seek conciseness without vagueness. Avoid walls of text.

If you find yourself struggling to write a test in plain language, then you should review the design for the feature under test. If you can’t explain it, then how will others know how to use it?

#2. Unique

Each test case in a suite should cover a unique behavior. Don’t Repeat Yourself – repetitive tests with few differences bear a heavy cost to maintain and execute without delivering much additional value. If a test can cover multiple inputs, then focus on one variation per equivalence class.

For example, equivalence classes for the absolute value function could be a positive number, a negative number, and zero. There’s little need to cover multiple negative numbers because the absolute value function performs the same operation on all negatives.

#3. Individual

Test one thing at a time. Tests that each focus on one main behavior are easier to formulate and automate. They naturally become understandable and maintainable. When a test covering only one behavior fails, then its failure reason is straightforward to deduce.

Any time you want to combine multiple behaviors into one test, consider separating them into different tests. Make a clear distinction between “arrange” and “act” steps. Write atomic tests as much as possible. Avoid writing “world tours,” too. I’ve seen repositories where tests are a hundred steps long and meander through an application like Mr. Toad’s Wild Ride.

#4. Independent

Each test should be independent of all other tests. That means testers should be able to run each test as a standalone unit. Each test should have appropriate setup and cleanup routines to do no harm and leave no trace. Set up new resources for each test. Automated tests should use patterns like dependency injection instead of global variables. If one test fails, others should still run successfully. Test case independence is the cornerstone for scalable, parallelizable tests.

Modern test automation frameworks strongly support test independence. However, folks who are new to automation frequently presume interdependence – they think the end of one test is the starting point for the next one in the source code file. Don’t write tests like that! Write your tests as if each one could run on its own, or as if the suite’s test order could be randomized.

#5. Repeatable

Testing tends to be a repetitive activity. Test suites need to run continuously to provide fast feedback as development progresses. Every time they run, they must yield deterministic results because teams expect consistency.

Unfortunately, manual tests are not very repeatable. They require lots of time to run, and human testers may not run them exactly the same way each iteration. Test automation enables tests to be truly repeatable. Tests can be automated once and run repeatedly and continuously. Automated scripts always run the same way, too.

#6. Reliable

Tests must run successfully to completion, whether they return PASS or FAIL results. “Flaky” tests – tests that occasionally fail for arbitrary reasons – waste time and create doubt. If a test cannot run reliably, then how can its results be trusted? And why would a team invest so much time developing tests if they don’t run well?

You shouldn’t need to rerun tests to get good results. If tests fail intermittently, find out why. Correct any automation errors. Tune automation timeouts. Scale infrastructure to the appropriate sizes. Prioritize test stability over speed. And don’t overlook any wonky bugs that could be lurking in the product under test!

#7. Efficient

Providing fast feedback is testing’s main purpose. Fast feedback helps teams catch issues early and keep developing safely. Fast tests enable fast feedback. Slow tests cause slow feedback. They force teams to limit coverage. They waste time and money, and they increase the risk that bugs do more damage.

Optimize tests to be as efficient as possible without jeopardizing stability. Don’t include unnecessary steps. Use smart waits instead of hard sleeps. Write atomic tests that cover individual behaviors. For example, use APIs instead of UIs to prep data. Set up tests to run in parallel. Run tests as part of Continuous Integration pipelines so that they deliver results immediately.

#8. Organized

An effective test has a clear identity:

Purpose: Why run this test?
Coverage: What behavior or feature does this test cover?
Level: Should this test be a unit, integration, or end-to-end test?

Identity informs placement and type. Make sure tests belong to appropriate suites. For example, tests that interact with Web UIs via Selenium WebDriver do not belong in unit test suites. Group related tests together using subdirectories and/or tags.

#9. Reportable

Functional tests yield PASS or FAIL results with logs, screenshots, and other artifacts. Large suites yield lots of results. Reports should present results in a readable, searchable format. They should make failures stand out with colors and error messages. They should also include other helpful information like duration times and pass rates. Unit test reports should include code coverage, too.

Publish test reports to public dashboards so everyone can see them. Most Continuous Integration servers like Jenkins include some sort of test reporting mechanism. Furthermore, capture metrics like test result histories and duration times in data formats instead of textual reports so they can be analyzed for trends.

#10. Maintainable

Tests are inherently fragile because they depend upon the features they cover. If features change, then tests probably break. Furthermore, automated tests are susceptible to code duplication because they frequently repeat similar steps. Code duplication is code cancer – it copies problems throughout a code base.

Fragility and duplication cause a nightmare for maintainability. To mitigate the maintenance burden, develop tests using the same practices as developing products. Don’t Repeat Yourself. Simple is better than complex. Do test reviews. For automation, follow good design principles like separating concerns and building solution layers. Make tests easy to update in the future!

#11. Trustworthy

A test is “successful” if it runs to completion and yields a correct PASS or FAIL result. The veracity of the outcome matters. Tests that report false failures make teams waste time doing unnecessary triage. Tests that report false passing results give a false sense of security and let bugs go undetected. Both ways ultimately cause teams to mistrust the tests.

Unfortunately, I’ve seen quite a few untrustworthy tests before. Sometimes, test assertions don’t check the right things, or they might be missing entirely! I’ve also seen tests for which the title does not match the behavior under test. These problems tend to go unnoticed in large test suites, too. Make sure every single test is trustworthy. Review new tests carefully, and take time to improve existing tests whenever problems are discovered.

#12. Valuable

Testing takes a lot of work. It takes time away from developing new things. Therefore, testing must be worth the effort. Since covering every single behavior is impossible, teams should apply a risk-based strategy to determine which behaviors pose the most risk if they fail and then prioritize testing for those behaviors.

If you are unsure if a test is genuinely valuable, ask this question: If the test fails, will the team take action to fix the defect? If the answer is yes, then the test is very valuable. If the answer is no, then look for other, more important behaviors to cover with tests.

Any more traits?

These dozen traits certainly make tests highly effective. However, this list is not necessarily complete. Do you have any more traits to add to the list? Do you agree or disagree with the traits I’ve given? Let me know by retweeting and commenting my tweet below!

12 Traits of Highly Effective Tests:

0⃣1⃣ Understandable
0⃣2⃣ Unique
0⃣3⃣ Individual
0⃣4⃣ Independent
0⃣5⃣ Repeatable
0⃣6⃣ Reliable
0⃣7⃣ Efficient
0⃣8⃣ Leveled Appropriately
0⃣9⃣ Reportable
1⃣0⃣ Maintainable
1⃣1⃣ Trustworthy
1⃣2⃣ Valuable
— Pandy Knight (@AutomationPanda) July 6, 2020

(Note: I changed #8 from “Leveled Appropriately” to “Organized” to be more concise. The tweet is older than the article.)

How Do I Know My Tests Add Value?

Software testing is a huge effort, especially for automation. Teams can spend a lot of time, money, and resources on testing (or not). People literally make careers out of it. That investment ought to be worthwhile – we shouldn’t test for the sake of testing.

So, therein lies the million-dollar question: How do we know that our tests add meaningful value?

Or, more bluntly: How do we know that testing isn’t a waste of time?

That’s easy: bugs!

The stock answer goes something like this: We know tests add value when they find bugs! So, let’s track the number of bugs we find.

That answer is wrong, despite its good intentions. Bug count is a terrible metric for judging the value of tests.

What do you mean bug counts aren’t good?

I know that sounds blasphemous. Let’s unpack it. Finding bugs is a good thing, and tests certainly should find bugs in the features they cover. But, the premise that the value of testing lies exclusively in the bugs found is wrong. Here’s why:

The main value of testing is fast feedback. Testing serves two purposes: (1) validating goodness and (2) identifying badness. Passing tests are validated goodness. Failing tests, meaning uncovered bugs, are identified badness. Both types of feedback add value to the development process. Developers can proceed confidently with code changes when trustworthy tests are passing, and management can assess lower risk. Unfortunately, bug counts don’t measure that type of goodness.
Good testing might actually reduce bug count. Testing means accountability for development. Developers must think more carefully about design. They can also run tests locally before committing changes. They could even do Test-Driven Development. Better practices could prevent many bugs from ever happening.
Tracking bug count can drive bad behavior. Whether a high bug discovery rate looks good (or, worse, has quotas), testers will strive to post numbers. If they don’t find critical bugs, they will open bug reports for nitpicks and trivialities. The extra effort they spend to report inconsequential problems may not be of value to the business – wasting their time and the developers’ time all for the sake of metrics.
Bugs are usually rare. Unless a team is dysfunctional, the product usually works as expected. Hundreds of test runs may not yield a single bug. That’s a wonderful thing if the tests have good coverage. Those tests still add value. Saying they don’t belittles the whole testing effort.

Then what metrics should we use?

Bugs happen arbitrarily, and unlimited testing is not possible. Metrics should focus on the return-on-investment for testing efforts. Here are a few:

Time-to-bug-discovery. Rather than track bug counts, track the time until each bug is discovered. This metric genuinely measures the feedback loop for test results. Make sure to track the severity of each bug, too. For example, if high-severity bugs are not caught until production, then the tests don’t have enough coverage. Teams should strive for the shortest time possible – fast feedback means lower development costs. This metric also encourages teams to follow the Testing Pyramid.
Coverage. Coverage is the degree to which tests exercise product behavior. Higher coverage means more feedback and greater chances of identifying badness. Most unit test frameworks can use code coverage tools to verify paths through code. Feature coverage requires extra process or instrumentation. Tests should avoid duplicate coverage, too.
Test failure proportions. Tests fail for a variety of reasons. Ideally, tests should fail only when they discover bugs. However, tests may also fail for other reasons: unexpected feature changes, environment instability, or even test automation bugs. Non-bug failures disrupt the feedback loop: they force a team to fix testing problems rather than product problems, and they might cause engineers to devalue the whole testing effort. Tracking failure proportions will reveal what problems inhibit tests from delivering their top value.

More resources

EGAD! How Do We Start Writing (Better) Tests?

Some have never automated tests and can’t check themselves before they wreck themselves. Others have 1000s of tests that are flaky, duplicative, and slow. Wa-do-we-do? Well, I gave a talk about this problem at a few Python conferences. The language used for example code was Python, but the principles apply to any language.

Here’s the PyTexas 2019 talk:

And here’s the PyGotham 2018 talk:

And here’s the first time I gave this talk, at PyOhio 2018:

I also gave this talk at PyCaribbean 2019 and PyTennessee 2020 (as an impromptu talk), but it was not recorded.

The Testing Pyramid

The “Testing Pyramid” is an industry-standard guideline for functional test case development. Love it or hate it, the Pyramid has endured since the mid-2000’s because it continues to be practical. So, what is it, and how can it help us write better tests?

Layers

The Testing Pyramid has three classic layers:

Unit tests are at the bottom. Unit tests directly interact with product code, meaning they are “white box.” Typically, they exercise functions, methods, and classes. Unit tests should be short, sweet, and focused on one thing/variation. They should not have any external dependencies – mocks/monkey-patching should be used instead.
Integration tests are in the middle. Integration tests cover the point where two different things meet. They should be “black box” in that they interact with live instances of the product under test, not code. Service call tests (REST, SOAP, etc.) are examples of integration tests.
End-to-end tests are at the top. End-to-end tests cover a path through a system. They could arguably be defined as a multi-step integration test, and they should also be “black box.” Typically, they interact with the product like a real user. Web UI tests are examples of integration tests because they need the full stack beneath them.

All layers are functional tests because they verify that the product works correctly.

Proportions

The Testing Pyramid is triangular for a reason: there should be more tests at the bottom and fewer tests at the top. Why?

Distance from code. Ideally, tests should catch bugs as close to the root cause as possible. Unit tests are the first line of defense. Simple issues like formatting errors, calculation blunders, and null pointers are easy to identify with unit tests but much harder to identify with integration and end-to-end tests.
Execution time. Unit tests are very quick, but end-to-end tests are very slow. Consider the Rule of 1’s for Web apps: a unit test takes ~1 millisecond, a service test takes ~1 second, and a Web UI test takes ~1 minute. If test suites have hundreds to thousands of tests at the upper layers of the Testing Pyramid, then they could take hours to run. An hours-long turnaround time is unacceptable for continuous integration.
Development cost. Tests near the top of the Testing Pyramid are more challenging to write than ones near the bottom because they cover more stuff. They’re longer. They need more tools and packages (like Selenium WebDriver). They have more dependencies.
Reliability. Black box tests are susceptible to race conditions and environmental failures, making them inherently more fragile. Recovery mechanisms take extra engineering.

The total cost of ownership increases when climbing the Testing Pyramid. When deciding the level at which to automate a test (and if to automate it at all), taking a risk-based strategy to push tests down the Pyramid is better than writing all tests at the top. Each proportionate layer mitigates risk at its optimal return-on-investment.

Practice

The Testing Pyramid should be a guideline, not a hard rule. Don’t require hard proportions for test counts at each layer. Why not? Arbitrary metrics cause bad practices: a team might skip valuable end-to-end tests or write needless unit tests just to hit numbers. W. Edwards Deming would shudder!

Instead, use loose proportions to foster better retrospectives. Are we covering too many input combos through the Web UI when they could be checked via service tests? Are there unit test coverage gaps? Do we have a pyramid, a diamond, a funnel, a cupcake, or some other wonky shape? Each layer’s test count should be roughly an order of magnitude smaller than the layer beneath it. Large Web apps often have 10K unit tests, 1K service tests, and a few hundred Web UI tests.

Resources

Check out these other great articles on the Testing Pyramid:

TestPyramid by Martin Fowler
The Practical Test Pyramid by Ham Vocke
Test Pyramid: the key to good automated test strategy by Tim Cochran
Just Say No to More End-to-End Tests by Mike Wacker