BDD

Making Great Waves: 8 Software Testing Convictions

The Great Wave Off Kanagawa.

Katsushika Hokusai, 1830.

It is one of the most recognizable works of art in the world. It is so famous, it has an emoji: 🌊.

The Great Wave Off Kanagawa is a Japanese woodblock print. It is not a painting or a drawing but a print. In Japanese, the term for this type of art is ukiyo-e, which means “pictures of the floating world.” Ukiyo-e prints first appeared around the 1660s and did not decline in popularity until the Meiji Restoration two centuries later. While most artists focused on subjects of people, late masters like Hokusai captured perspectives of landscapes and nature. Here, in The Great Wave, we see a giant wave, full of energy and ferocity, crashing down onto three fast boats attempting to transport live fish to market. Its vibrant blue water and stark white peaks contrast against a yellowish-gray sky. In the distance is Mount Fuji, the highest mountain in Japan, yet it is dwarfed in perspective by the waves. In fact, the water spray from the waves appears to fall over Mount Fuji like snow. If you didn’t look closely, you might presume that Mount Fuji is just the crest of another wave.

The Great Wave is absolutely stunning. It is arguably Hokusai’s finest work. The colors and the lines reflect boldness. The claws of the wave impart vitality. The men on the boat show submission and possibly fear. The spray from the wave reveals delicacy and attention to detail. Personally, I love ukiyo-e prints like this. I travel the world to see them in person. The quality, creativity, and craftsmanship they exhibit inspire me to instill the highest quality possible into my own work.

As software quality professionals, there are several lessons we can learn from ukiyo-e masters like Hokusai. Testing is an art as much as it is engineering. We can take cues from these prolific artists in how we approach quality in our own work. In this article, I will share how we can make our own “Great Waves” using 8 software testing convictions inspired by ukiyo-e prints like The Great Wave. Let’s begin!

Conviction #1: Focus on behavior

Although we hold these Japanese woodblock prints today in high regard, they were seen as anything but fancy centuries ago in Japan. Ukiyo-e was “low” art for the common people, whereas paintings on silk scrolls were considered “high” art for the high classes.

Folks would buy these prints from local merchants for slightly more than the cost of a bowl of noodles – about $5 to $10 US dollars today – and they would use these prints to decorate their homes. By comparison, a print of The Great Wave sold at auction for $1.11 million in September 2020.

These prints weren’t very large, either. The Great Wave measures 10 inches tall by 15 inches wide, and most prints were of similar size. That made them convenient to buy at the market, carry them home, and display on the wall. To understand how the Japanese people treated these prints in their day, think about the decorations in your homes that you bought at stores like Home Goods and Target. You probably have some screen prints or posters on your walls.

Since the target consumer for ukiyo-e prints were ordinary people with working-class budgets, they needed to be affordable, popular, and recognizable. When Hokusai published The Great Wave, it wasn’t a standalone piece. It was the first print in a series named Thirty-six Views of Mount Fuji. Below are three other prints from that series. The central feature in each print is Mount Fuji, which would be instantly recognizable to any Japanese person. The various views would also be relatable.

Fine Wind, Clear Morning
Fine Wind, Clear Morning shows nice weather against the slopes of the mountain with a powerful contrast of colors.
Thunderstorm Beneath the Summit depicts Mount Fuji from a nearly identical profile, but with lightning striking the lower slopes of the mountain amidst a far darker palate.
Kajikazawa in Kai Province depicts two fisherman with Mount Fuji in the background.

The features of these prints made them valuable. Anyone could find a favorite print or two out of a series of 36. They made art accessible. They were inexpensive yet impressive. They were artsy yet accessible. Artists like Hokusai knew what people wanted, and they delivered the goods.

This isn’t any different from software development. Features add value for the users. For example, if you’re developing a banking app, folks better be able to log in securely and view their latest transactions. If those features are broken or unintuitive, folks might as well move their accounts to other banks! We, as the developers and testers, are like the ukiyo-e artists: we need to know what our customers need. We need to make products that they not only want, but they also enjoy.

Features add value. However, I would use a better word to describe this aspect of a product: behavior. Behavior is the way one acts or conducts oneself. In software, we define behaviors in terms of inputs and responses. For example, login is a behavior: you enter valid credentials, and you expect to gain access. You gave inputs, the app did something, and you got the result.

My conviction on software testing AND development is that if you focus on good software behaviors, then everything else falls into place. When you plan development work, you prioritize the most important behaviors. When you test the features, you cover the most important behaviors. When users get your new product, they gain value from those features, and hopefully you make that money, just like Hokusai did.

This is why I strongly believe in the value of Behavior-Driven Development, or BDD for short. As a set of pragmatic practices, BDD helps you and your team stay focused on the things that matter. BDD involves activities like Three Amigos collaboration, Example Mapping, and writing Gherkin. When you focus on behavior – not on shiny new tech, or story points, or some other distractions – you win big.

Conviction #2: Prioritize on risk

Ukiyo-e artists depicted more than just views of Mount Fuji. In fact, landscape scenes became popular only during the late period of woodblock printing – the 1830s to the 1860s. Before then, artists focused primarily on people: geisha, courtesans, sumo wrestlers, kabuki actors, and legendary figures. These were all characters from the “floating world,” a world of pleasure and hedonism apart from the dreary everyday life of feudal Japan.

Here is a renowned print of a kabuki actor by Sharaku, printed in 1794:

Kabuki Actor Ōtani Oniji III as Yakko Edobei in the Play The Colored Reins of a Loving Wife
Tōshūsai Sharaku, 1794

Sharaku was active only for one year, but he produced some of the most expressive portraits seen during ukiyo-e’s peak period. A yakko was a samurai’s henchman. In this portrait, we see Edobei ready for dirty deeds, with a stark grimace on his face and hands pulsing with anger.

Why would artists like Sharaku print faces like these? Because they would sell. Remember, ukiyo-e was not high-class art. It was a business. Artists would make a series of prints and sell them on the streets of Edo (now Tokyo). They needed to make prints that people wanted to buy. If they picked lousy or boring subjects, their prints wouldn’t sell. No soba noodles for them! So, what subjects did they choose? Celebrities. Actors. “Female beauties.” And some content that was not safe for work, like Hokusai’s The Dream of the Fisherman’s Wife. (Seriously, that link is not safe for work. Click it at your own risk.)

Artists prioritized their work based on business risk. They chose subjects that would be easy to sell. They pursued value. As testers, we should also prioritize test coverage based on risk.

I know there’s a popular slogan saying, “Test all the things!”, but that’s just impossible. It’s like saying, “Print all the pictures!” Modern apps are too complex to attempt any sort of “complete” or “100%” coverage. Instead, we should focus our testing efforts on the most important behaviors, the ones that would cause the most problems if they broke. Testing is ultimately a risk-mitigating activity. We do testing to de-risk problems that enter during development.

So, what does a risk-based testing strategy look like? Well, start by covering the most valuable behaviors. You can call them the MVBs. These are behaviors that are core to your app. If they break, then it’s game over. No soba noodles. For example, if you can’t log in, you’re done-zo. The MVBs should be tested before every release. They are non-negotiable test coverage. If your team doesn’t have enough resources to run these tests, then get more resources.

In addition to the MVBs, cover areas that were changed since the previous release. For example, if your banking app just added mobile deposits, then you should test mobile deposits. Things break where developers make changes. Also, look at testing different layers and aspects of the product. Not every test should be a web UI test. Add unit tests to pinpoint failures in the code. Add API tests to catch problems at the service layer. Consider aspects like security, accessibility, and visuals.

When planning these tests, try to keep them fast and atomic, covering individual behaviors instead of long workflows. Shorter tests are more reliable and give space for more coverage. And if you do have the resources for more coverage beyond the MVBs and areas of change, expand your coverage as resources permit. Keep adding coverage for the next most valuable behaviors until you either run out of time or the coverage isn’t worth the time.

Overall, ask yourself this when weighing risks: How painful would it be if a particular behavior failed? Would it ruin a user’s experience, or would they barely notice?

Conviction #3: Automate

The copy of The Great Wave shown at the top of this article is located at the Metropolitan Museum of Art in New York City. However, that’s not the only version. When ukiyo-e artists produced their prints, they kept printing copies until the woodblocks wore out! Remember, these weren’t precious paintings for the rich, they were posters for the commoners. One set of woodblocks could print thousands of impressions of popular designs for the masses. It’s estimated that there were five to eight thousand original impressions of The Great Wave, but nobody knows for sure. To this day, only a few hundred have survived. And much to my own frustration, museums that have copies do not put them on public display because the pieces are so fragile.

Here are different copies of The Great Wave from different museums:

Print production had to be efficient and smooth. Remember, this was a business. Publishers would make more money if they could print more impressions from the same set of woodblocks. They’d gain more renown if their prints maintained high quality throughout the lifetime of the blocks. And the faster they could get their prints to market, the sooner they could get paid and enjoy all the soba noodles.

What can we learn from this? Automate! That’s our third conviction.

What can we learn from this? Automate! Automation is a force multiplier. If Hokusai spent all his time manually laboring over one copy of The Great Wave, then we probably wouldn’t be talking about it today. But because woodblock printing was a whole process, he produced thousands of copies for everyone to enjoy. I wouldn’t call the woodblock printing process fully “automated” because it had several tedious steps with manual labor, but in Edo period Japan, it was about as automated as you could get.

Compare this to testing. If we run a test manually, we cover the target behavior one time. That’s it: lots of labor for one instance. However, if we automate that test, we can run it thousands of times. It can deliver value again and again. That’s the difference between a painting and a print.

So, how should we go about test automation? First, you should define your goals. What do you hope to achieve with automation? Do you want to speed up your testing cycles? Are you looking to widen your test coverage? Perhaps you want to empower Continuous Delivery through Continuous Testing? Carefully defining your goals from the start will help you make good decisions in your test automation strategy.

When you start automating tests, treat it like full software development. You aren’t just writing a bunch of scripts, you are developing a software system. Follow recommended practices. Use design patterns. Do code reviews. Fix bugs quickly. These principles apply whether you are using coded or codeless tools.

Another trap to avoid is delaying test automation. So many times, I’ve heard teams struggle to automate their tests because they schedule automation work as their lowest priority. They wish they could develop automation, but they just never have the time. Instead, they grind through testing their MVBs manually just to get the job done. My advice is flip that attitude right-side up. Automate first, not last. Instead of planning a few tests to automate if there’s time, plan to automate first and cover anything that couldn’t be automated with manual testing.

Furthermore, integrate automated tests into the team’s Continuous Integration system as soon as possible. Automated tests that aren’t running are dead to me. Get them running automatically in CI so they can deliver value. Running them nightly or even weekly can be a good start, as long as they run on a continuous cadence.

Finally, learn good practices. Test automation technologies are ever-evolving. It seems like new tools and frameworks hit the market all the time. If you’re new to automation or you want to catch up with the latest trends, then take time to learn. One of the best resources I can recommend is Test Automation University. TAU has about 70 courses on everything you can imagine, taught by the best instructors in the world, and it’s 100% FREE!

Now, you might be thinking, “Andy, come on, you know everything can’t be automated!” And that’s true. There are times when human intervention adds value. We see this in ukiyo-e prints, too. Here is Plum Garden at Kameido by Utagawa Hiroshige, Hokusai’s main rival. Notice the gradient colors of green and red in the background:

Plum Garden in Kameido
Plum Garden at Kameido
Utagawa Hiroshige, 1857

Printers added these gradients using a technique called bokashi, in which they would apply layers of ink to the woodblocks by hand. Sometimes, they would even paint layers directly on the prints. In these cases, the “automation” of the printing process was insufficient, and humans needed to manually intervene.

It’s always good to have humans test-drive software. Automation is great for functional verification, but it can’t validate user experience. Exploratory testing is an awesome complement to automated testing because it mitigates different risks.

Nevertheless, automation is able to do things it could never do before. As I said before, I work at Applitools, where we specialize in automated visual testing. Take a look at these two prints of Matsumoto Hoji’s Frog from Meika Gafu. Notice anything different between the two?

Two different versions of Matsumoto Hoji’s Frog.

If we use Visual AI to compare these two prints, it will quickly identify the main difference:

Applitools Visual AI identifying visual differences (highlighted in magenta) between two prints.

The signature block is in a different location! Small differences like small pixel offsets are ignored, while major differences are highlighted. If you apply this style of visual testing to your web and mobile apps, you could catch a ton of visual bugs before they cause problems for your users. Modern test automation can do some really cool tricks!

Conviction #4: Shift left and right

Mokuhanga, or woodblock printing, was a huge process with multiple steps. Artists like Hokusai and Hiroshige did not print their artwork themselves. In fact, printing required multiple roles to be successful: a publisher, an artist, a carver, and a printer.

  1. The publisher essentially ran the process. They commissioned, financed, and distributed prints. They would even collaborate with artists on print design to keep them up with the latest trends.
  2. The artist designed the patterns for the prints. They would sketch the patterns on washi paper and give instructions to the carver and printer on how to properly produce the prints.
  3. The carver would chisel the artist’s pattern into a set of wooden printing blocks. Each layer of ink would have its own block. Carvers typically used a smooth, hard wood like cherry.
  4. The printer used the artist’s patterns and carver’s woodblocks to actually make the prints. They would coat the blocks in appropriately-colored water-based inks and then press paper onto the blocks.

Quality had to be considered at every step in the process, not just at the end. If the artist was not clear about colors, then the printer might make a mistake. If the carver cut a groove too deep, then ink might not adhere to the paper as intended. If the printer misaligned a page during printing, then they’d need to throw it away – wasting time, supplies, and woodblock life – or risk tarnishing everyone’s reputation with a misprint. Hokusai was noted for his stringent quality standards for carvers and printers.

The words of W. Edwards Deming ring true:

Inspection does not improve the quality, nor guarantee quality. Inspection is too late. The quality, good or bad, is already in the product. As Harold F. Dodge said, “You cannot inspect quality into a product.”

W. Edwards deming

This is just like software development. We can substitute the word “testing” for “inspection” in Deming’s quote. Testers don’t exclusively “own” quality. Every role – business, development, and testing – has a responsibility for high-caliber work. If a product owner doesn’t understand what the customer needs, or a developer skips code reviews, or if a tester neglects an important feature, then software quality will suffer.

How do we engage the whole team in quality work? Shift left and right.

Most testers are probably familiar with the term shift left. It means, start doing testing work earlier in the development process. Don’t wait until developers are “done” and throw their code “over the fence” to be tested. Run tests continuously during development. Automate tests in-sprint. Adopt test-driven and behavior-driven practices. Require unit tests. Add test implementation to the “Definition of Done.”

But what about shift right? This is a newer phase, but not necessarily a newer practice. Shift right means, continue to monitor software quality during and after releases. Build observability into apps. Monitor apps for bugs, failures, and poor performance. Do canary deployments to see how systems respond to updates. Perform chaos testing to see how resilient environments are to outages. Issue different UIs to user groups as part of A/B testing to find out what’s most effective. And feed everything you learn back into development a la “shift left.”

The DevOps Infinity Loop
(Source: https://www.atlassian.com/devops)

The famous DevOps infinity loop shows how “shift left” and “shift right” are really all part of the same flow. If you start in the middle where the paths cross, you can see arrows pointing leftward for feedback, planning, and building. Then, they push rightward with continuous integration, deployment, monitoring, and operations. We can (and should) take all the quality measures we said before as we spin through this loop perpetually. When we plan, we should build quality in with good design and feedback from the field. When we develop, we should do testing together with coding. As we deploy, automated safety checks should give thumbs-up or thumbs-down. Post-deployment, we continue to watch, learn, and adjust.

Conviction #5: Give fast feedback

The acronym CI/CD is ubiquitous in our industry, but I feel like it’s missing something important: “CT”, or Continuous Testing. CI and CD are great for pushing code fast, but without testing, they could be pushing garbage. Testing does not improve quality directly, but continuous revelation of quality helps teams find and resolve issues fast. It demands response. Continuous Testing keeps the DevOps infinity loop safe.

Fast feedback is critical. The sooner and faster teams discover problems, the less pain those problems will cause. Think about it: if a developer is notified that their code change caused a failure within a minute, they can immediately flip back to their code, which is probably still open in an editor. If they find out within an hour, they’ll still have their code fresh in their mind. Within a day, it’ll still be familiar. A week or more later? Fuggedaboutit! Heaven forbid the problem goes undetected until a customer hits it.

Continuous testing enables fast feedback. Automation enables continuous testing. Test automation that isn’t running continuously is worthless because it provides no feedback.

Japanese woodblock printers also relied on fast feedback. If they noticed anything wrong with the prints as they pressed them, they could scrap the misprint and move on. However, since they were meticulous about quality, misprints were rare. Nevertheless, each print was unique because each impression was done manually. The amount, placement, and hue of ink could vary slightly from print to print. Over time, the woodblocks themselves wore down, too.

Here, you can see differences in the title cartouche between different prints of The Great Wave:

Differences in the title cartouche between two prints of The Great Wave.
(Source: https://blog.britishmuseum.org/the-great-wave-spot-the-difference/)

On the left, the outline around the title is solid, whereas on the right, the outline has breaks. This is because the keyblock had very fine ridges for printing outlines, which suffered the most from wear and tear during repeated impressions. Furthermore, if you look very closely, you can see that the Japanese characters appear bolder on the right than the left. The printer must have used more ink or pressed the title harder for the impression on the right.

Printers would need to spot these issues quickly so they could either correct their action for future prints or warn the publisher that the woodblocks were wearing down. If the print was popular, the publisher could commission a carver to carve new woodblocks to keep production going.

Conviction #6: Go lean

As I’ve said many times now, woodblock printing was a business. Ukiyo-e was commercial art, and competition was fierce. By the 1840s, production peaked with about 250 different publishers. Artists like Hokusai and Hiroshige were rivals. While today we recognize famous prints like The Great Wave, countless other prints were also made.

Publishers competed in a rat race for the best talent and the best prints. They had to be savvy. They had to build good reputations. They needed to respond to market demands for subject material. For example, Kitagawa Utamaro was famous for prints of “female beauties.”

Two Beauties with Bamboo
Kitagawa Utamaro, 1795

Ukiyo-e artists also took inspiration from each other. If one artist made a popular design, then other artists would copy their style. Here is a print from Hiroshige’s series, Thirty-Six Views of Mount Fuji. That’s right, Hokusai’s biggest rival made his own series of 36 prints about Mount Fuji, and he also made his own version of The Great Wave. If you can’t beat ‘em, join ‘em!

The Sea off Satta in Suruga Province
The Sea off Satta in Suruga Province
Utagawa Hiroshige, 1858

Publishers also had to innovate. Oftentimes, after a print had been in production for a while, they would instruct the printer to change the color scheme. Here are two versions of Hokusai’s Kajikazawa in Kai Province, from Thirty-six Views of Mount Fuji:

The print on the left is an early impression. The only colors used were shades of blue. This was Hokusai’s original artistic intention. However, later prints, like the one on the right, added different colors to the palette. The fishermen now wear red coats. The land has a bokashi green-yellow gradient. The sky incorporates orange tones to contrast the blue. Publishers changed up the colors to squeeze more money out of existing designs without needing to pay artists for new work or carvers for new woodblocks.

However, sometimes when doing this, artistic quality was lost. Compare the fine detail in the land between these two prints. In the early impression, you can see dark blue shading used to pronounce the shadows on the side of the rocks, giving them height and depth, and making the fisherman appear high above the water. However, in the later impression, the green strip of land has almost no shading, making it appear flat and less prominent.

Ukiyo-e publishers would have completely agreed with today’s lean business model. Seek first and foremost to deliver value to your customers. Learn what they want. Try some designs, and if they fail, pivot to something else. When you find what works, get a full end-to-end process in place, and then continuously improve as you go. Respond quickly to changes.

Going lean is very important for software testing, too. Testing is engineering, and it has serious business value. At the same time, testing activities never seem to have as many resources as they should. Testers must be scrappy to deliver valuable quality feedback using the resources they have.

When I think about software testing going lean, I’m not implying that testers should skip tests or skimp on coverage. Rather, I’m saying that world-class systems and processes cannot be built overnight. The most important thing a team can do is build basic end-to-end feedback loops from the start, especially for test automation.

The Quality Feedback Loop

So many times, I’ve seen teams skew their test automation strategy entirely towards implementation. They spend weeks and weeks developing suites of automated tests before they set up any form of Continuous Testing. Instead of triggering tests as part of Continuous Integration, folks must manually push buttons or run commands to make them start. Other folks on the team see results sporadically, if ever. When testers open bug reports, developers might feel surprised.

I recommend teams set up Continuous Testing with feedback loops from the start. As soon as you automate your first test, move onto running it from CI and sending you notifications for results before automating your second test. Close the feedback loop. Start delivering results immediately. As you find hotspots, add more coverage. Talk with developers about the kinds of results they find most valuable. Then, grow your suite once you demonstrate its value. Increase the throughput. Turn those sidewalks into highways. Continue to iteratively improve upon the system as you go. Don’t waste time on tests that don’t matter or dashboards that nobody reads. Going lean means allocating your resources to the most valuable activities. What you’ll find is that success will snowball!

Conviction #7: Open up

Once you have a good thing going, whether it’s woodblock printing or software testing, how can you take it to the next level? Open up! Innovation stalls when you end up staring at your own belly button for too long. Outside influences inspire new creativity.

Ukiyo-e prints had a profound impact on Western art. After Japan opened up to the rest of the world in the mid-1800s, Europeans became fascinated by Japanese art, and European artists began incorporating Japanese styles and subjects into their work. This phenomenon became known as Japonisme. Here, Claude Monet, famous for his impressionist paintings, painted a picture of his wife wearing a kimono with fans adorning the wall behind her:

La Japonaise
Claude Monet, 1876

Vincent van Gogh in particular loved Japanese woodblock prints. He painted his own versions of different prints. Here, we see Hiroshige’s Plum Garden at Kameido side-by-side with Van Gogh’s Flowering Plum Orchard (after Hiroshige):

Van Gogh was drawn to the bold lines and vibrant colors of ukiyo-e prints. There is even speculation that The Great Wave inspired the design of The Starry Night, arguably Van Gogh’s most famous painting:

Notice how the shapes of the waves mirror the shapes of the swirls in the sky. Notice also how deep shades of blue contrast yellows in each. Ukiyo-e prints served as great inspiration for what became known as Modern art in the West.

Influence was also bidirectional. Not only did Japan influence the West, but the West influenced Japan! One thing common to all of the prints in Thirty-six Views of Mount Fuji is the extensive use of blue ink. Prussian blue pigment had recently come to Japan from Europe, and Hokusai’s publisher wanted to make extensive use of the new color to make the prints stand out. Indeed, they did. To this day, Hokusai is renowned for popularizing the deep shades of Prussian blue in ukiyo-e prints.

It’s important in any line of work to be open to new ideas. If Hokusai had not been willing to experiment with new pigments, then we wouldn’t have pieces like The Great Wave.

That’s why I’m a huge proponent of Open Testing. What if we open our tests like we open our source? There are so many great advantages to open source software: helping folks learn, helping folks develop better software, and helping folks become better maintainers. If we become more open in our testing, we can improve the quality of our testing work, and thus also the quality of the software products we are building. Open testing involves many things: building open source test frameworks, getting developers involved in testing, and even publicly sharing test cases and results.

Conviction #8: Show empathy

In this article, we’ve seen lots of great artwork, and we’ve learned lots of valuable lessons from it. I think ukiyo-e prints remain popular today because their subject matter focuses on the beauty of the world. Artists strived to make pieces of the “floating world” tangible for the common people.

Ukiyo-e prints revealed the supple humanity of the Japanese people, like in this print by Utagawa Kunisada:

Twilight Snowfall at Ueno
Utagawa Kunisada, 1850

They revealed the serene beauty of nature in harmony with civilization, like in these prints from Hiroshige’s One Hundred Famous Views of Edo:

Prints from One Hundred Famous Views of Edo
Utagawa Hiroshige, 1856-1858

Ukiyo-e prints also revealed ordinary people living out their lives, like this print from Hokusai’s Thirty-six Views of Mount Fuji:

Fuji View Field in Owari Province
Katsushika Hokusai, 1830

Art is compelling. And software, like art, is meant for people. Show empathy. Care about your customers. Remember, as a tester, you are advocating for your users. Try to help solve their problems. Do things that matter for them. Build things that actually bring them value. Be thoughtful, mindful, and humble. Don’t be a jerk.

The Golden Conviction

These eight convictions are things I’ve learned the hard way throughout my career:

  1. Focus on behavior
  2. Prioritize on risk
  3. Automate
  4. Shift left and right
  5. Give fast feedback
  6. Go lean
  7. Open up
  8. Show empathy

I live and breathe these convictions every day. Whether you are making woodblock prints or running test cases, these principles can help you do your best work.

If I could sum up these eight convictions in one line, it would be this: Be excellent in all things. If you test software, then you are both an artist and an engineer. You have a craft. Do it with excellence.

How Q2 uses BDD with SpecFlow for testing PrecisionLender

This case study was written by Andrew Knight, Lead Software Engineer in Test for Q2’s PrecisionLender product, in collaboration with Q2 and Tricentis. It explains the PrecisionLender team’s continuous testing journey and how SpecFlow served as a cornerstone for success.

What is PrecisionLender?

PrecisionLender is a web application that empowers commercial bankers with in-the-moment insights that help them structure and price commercial deals. Andi®, PrecisionLender’s intelligent virtual analyst, delivers these hyper-focused recommendations in real-time, allowing relationship managers to make data-driven decisions while pricing their commercial deals. PrecisionLender is owned and developed by Q2, a financial experience software company dedicated to providing digital banking and lending solutions to banks, credit unions, alternative finance, and fintech companies in the U.S. and internationally.

The PrecisionLender Opportunity Screen
(Picture taken from the PrecisionLender Support Center)

The starting point

The PrecisionLender team had a robust Continuous Integration (CI) delivery pipeline with strong unit test coverage, but they lacked end-to-end feature coverage. Developers would fill this gap by manually inspecting their changes in a shared development environment. However, as the PrecisionLender app grew, manual checks could not cover all possible integrations. The team knew they needed continuous automated testing to provide a safety net for development to remain lean and efficient. In April 2018, they hired Andrew Knight as their first Software Engineer in Test (SET) – a new role for the company – to lead the effort.

Automating tests with SpecFlow

The PrecisionLender team developed the Boa test solution – a project for automating end-to-end tests at scale. Boa would become PrecisionLender’s internal platform for test automation development. The name “Boa” is a loose acronym for “Behavior-Oriented Automation.”

The team chose SpecFlow to be the core framework for Boa tests. Since the PrecisionLender app’s backend is developed using .NET, SpecFlow was a natural fit. SpecFlow’s Gherkin syntax made tests readable and understandable, even to product owners and product support specialists who do not code.

The SpecFlow framework integrates with tools like Selenium WebDriver for testing Web UIs and RestSharp for testing REST APIs to exercise vital pathways for thorough app coverage. SpecFlow’s dependency injection mechanisms are solid yet simple, and the online docs are thorough. Plus, SpecFlow is an open-source project, so anyone can look at its code to learn how things work, open requests for new features, and even offer code contributions.

An example Boa test, written in Gherkin using SpecFlow.

Executing tests with SpecFlow+ Runner

Writing good tests was only part of the challenge. The PrecisionLender team needed to execute Boa tests continuously to provide fast feedback on changes to the app. The team chose to run Boa tests using SpecFlow+ Runner, which is tailored for SpecFlow tests. The team uses SpecFlow+ Runner to launch tests in parallel in TeamCity any time a developer deploys a code change to internal pre-production environments. The entire test suite also runs every night against multiple product configurations. SpecFlow+ Runner produces a helpful test report with everything needed to triage test failures: pass-and-fail tallies overall and per feature, a visual execution timeline, and full system logs. If engineers need to investigate certain failures more closely, they can use SpecFlow tags and SpecFlow+ Runner profiles to selectively filter tests for reruns. SpecFlow+ Runner’s multiple features help the team expedite test execution and investigation.

The SpecFlow+ Runner report for a dozen smoke tests.

Sharing features with SpecFlow+ LivingDoc

Good test cases are more than just verification procedures – they are behavior specifications. They define how features should work. Instead of keeping testing work siloed by role, the PrecisionLender team wanted to share Boa tests as behavior specs with all stakeholders to foster greater collaboration and understanding around features. The team also wanted to share Boa tests with specific customers without sharing the entire automation code.

SpecFlow+ LivingDoc enabled the PrecisionLender team to turn Gherkin feature files into living documentation. Whereas the SpecFlow+ Runner report focuses on automation execution, the SpecFlow+ LivingDoc report focuses on behavior specification apart from coding and automation details. LivingDoc displays Gherkin scenarios in a readable, searchable way that both internal folks and customers can consume. It can also optionally include high-level pass-and-fail results for each scenario, providing just enough information to be helpful and not overwhelming. LivingDoc has also helped PrecisionLender’s engineers identify and eliminate unused step definitions within the automation code. PrecisionLender benefits greatly from complementary reports from SpecFlow+ Runner and SpecFlow+ LivingDoc.

The SpecFlow+ LivingDoc report for a dozen smoke tests with their pass-and-fail results.

Improving interactions with Boa Constrictor

The Boa test solution initially used the Page Object Model to model interactions with the PrecisionLender app. However, as the PrecisionLender team automated more and more Boa tests, it became apparent that page objects did not scale well. Many page object classes had duplicative methods, making automation code messy. Some methods also did not include appropriate waiting mechanisms, introducing flaky failures.

PrecisionLender’s SETs developed Boa Constrictor, a .NET implementation of the Screenplay Pattern, to make better interactions for better automation. In Screenplay, actors use abilities to perform interactions. For example, an ability could be using Selenium WebDriver, and an interaction could be clicking an element. The Screenplay Pattern can be seen as a refactoring of the Page Object Model that minimizes duplicate code through a better separation of concerns. Individual interactions can be hardened for robustness, eliminating flaky hotspots. The Boa test solution now exclusively uses Boa Constrictor for interactions.

In October 2020, Q2 released Boa Constrictor as an open-source project so that anyone can use it. It is fully compatible with SpecFlow and other .NET test frameworks, and it provides rich interactions for Selenium WebDriver and RestSharp out of the box.

Boa Constrictor, the .NET Screenplay Pattern.

Scaling massively with Selenium Grid

When the PrecisionLender team first started automating Boa tests, they ran tests one at a time. That soon became too slow since the average Boa test took 20 to 50 seconds to complete. The team then started running up to 3 tests in parallel on one machine, but that also was not fast enough. They turned to Selenium Grid, a tool for running WebDriver sessions remotely across multiple machines.

PrecisionLender built a set of internal Selenium Grid instances using Microsoft Azure virtual machines to run Boa tests at high scale. As of July 2021, PrecisionLender has over 1800 unique Boa tests that run across four distinct product configurations. Whenever TeamCity detects a code change, it triggers a “continuous” Boa test suite with over 1000 tests running 50 parallel tests using Google Chrome on Selenium Grid. It completes execution in about 10 minutes. TeamCity launches the full test suite every night against all product configurations with 64-100 parallel tests on Selenium Grid. Continuous Integration currently runs up to 10K Boa tests daily against the PrecisionLender app with SpecFlow+ Runner and Selenium Grid.

The Boa test solution architecture, including Continuous Integration through TeamCity and parallel testing with SpecFlow+ Runner and Selenium Grid.

Shifting left with BDD

Better testing and automation practices eventually inspired better development practices. Product owners would create user stories, but developers would struggle to understand requirements and business purposes fully. PrecisionLender’s SETs started bringing together the Three Amigos – business, development, and testing roles – to discuss product behaviors proactively while creating user stories. They introduced Behavior-Driven Development (BDD) activities like Example Mapping to explore behaviors together. Then, well-defined stories could be easily connected to SpecFlow tests written in Gherkin following Specification by Example (SBE). Teams repeatedly saved time by thinking before coding and specifying before testing. They built higher quality into features from the beginning, and they stopped before working on half-baked stories with unjustified value propositions. Developers who participated in these behavior-driven practices were also more likely to automate Boa tests on their own. Furthermore, one of PrecisionLender’s developers loved BDD practices so much that he joined the team of SETs! Through Gherkin, SpecFlow provided a foundation that enabled quality work to shift left.

Challenges along the way

Achieving true continuous testing had its challenges along the way. Intermittent failure was the most significant issue PrecisionLender faced at scale. With so many tests, environments, and infrastructural pieces, arbitrary failures were statistically unavoidable. The PrecisionLender team took a two-pronged approach to handle intermittent failures: (1) eliminate race conditions in automation using good interactions with Boa Constrictor, and (2) use SpecFlow+ Runner to automatically retry failed tests to determine if failures were consistent or intermittent. These two approaches reduced the frequency of flaky failures and helped engineers quickly resolve any remaining issues. As a result, Boa tests enjoy well above a 99% success rate, and most failures are due to actual bugs.

PrecisionLender app performance at scale was a second big challenge. Running up to 100 tests in parallel turned functional tests into de facto load tests. Testing at scale repeatedly uncovered performance bottlenecks in the app. Performance issues caused widespread test failures that were difficult to diagnose because they appeared intermittently. Still, the visual timeline and timestamps in the SpecFlow+ Runner report helped the team identify periods of failure that could be crosschecked against backend logs, metrics, and database queries. Developers resolved many performance issues and significantly boost the app’s response times and load capacity.

Training team members to develop solid test automation was the third challenge. At the start of the journey, test automation, Gherkin, and BDD were all new to PrecisionLender. The PrecisionLender SETs took active steps to train others on how to develop good tests and good automation through group workshops, Three Amigos meetings, and one-on-one mentoring sessions. They shared resources like the Automation Panda blog for how to write good tests and good Gherkin. The investment in education paid off: many developers have joined the SETs in writing readable, reliable Boa tests that run continuously.

Benefits to the business

Developing a continuous testing solution brought many incredible benefits to PrecisionLender. First, the quality of the PrecisionLender app improved because continuous testing provided fast feedback on failures that developers could quickly fix. Instead of relying on manual spot checks, the team could trust the comprehensive safety net of Boa tests to catch bugs. Many issues would be caught within an hour of a developer making a code commit, and the longest feedback cycle would be only one business day for the full nightly test suites to run. Boa tests catch failures before customers ever experience them. The continuous nature of testing enables PrecisionLender to publish new releases every two weeks.

Second, the high reliability of the Boa test solution means that the PrecisionLender team can trust test results. When a test passes, the behavior is working. When a test fails, there is a real bug. Reliability also means that engineers spend less time on automation maintenance and more time on more valuable activities, like developing new features and adding new tests. Quality is present in both the product code and the test code.

Third, continuous testing boosts customer confidence in PrecisionLender. Customers trust the software quality because they know that PrecisionLender thoroughly tests every release. The PrecisionLender team also shares SpecFlow+ LivingDoc reports with specific clients to prove quality.

A bright future

PrecisionLender’s continuous testing journey is not over. Since the PrecisionLender team hired its first SET, it has hired three more, in addition to a testing manager, to grow quality improvement efforts. Multiple development teams have written their own Boa tests, and they plan to write more tests independently. SpecFlow’s tools have been indispensable in helping the PrecisionLender team achieve successful quality assurance. As PrecisionLender welcomes more customers, the Boa solution will be ready to scale with more tests, more configurations, and more executions.

Should Gherkin Steps use Past, Present, or Future Tense?

Gherkin’s Given-When-Then syntax is a great structure for specifying behaviors. However, while writing Gherkin may seem easy, writing good Gherkin can be a challenge. One aspect to consider is the tense used for Gherkin steps. Should Gherkin steps use past, present, or future tense?

One approach is to use present tense for all steps, like this:

Scenario: Simple Google search
    Given the Google home page is displayed
    When the user searches for "panda"
    Then the results page shows links related to "panda"

Notice the tense of each verb:

  1. the home page is – present
  2. the user searches – present
  3. the results page shows – present

Present tense is the simplest verb tense to use. It is the least “wordy” tense, and it makes the scenario feel active.

An alternative approach is to use past-present-future tense for Given-When-Then steps respectively, like this:

Scenario: Simple Google search
    Given the Google home page was displayed
    When the user searches for "panda"
    Then the results page will show links related to "panda"

Notice the different verb tenses in this scenario:

  1. the home page was – past
  2. the user searches – present
  3. the result page will show – future

Scenarios exercise behavior. Writing When steps using present tense centers the scenario’s main actions in the present. Since Given steps must happen before the main actions, they would be written using past tense. Likewise, since Then steps represent expected outcomes after the main actions, they would be written using future tense.

Both of these approaches – using all present tense or using past-present-future in order – are good. Personally, I prefer to write all steps using present tense. It’s easier to explain to others, and it frames the full scenario in the moment. However, I don’t think other approaches are good. For example, writing all steps using past tense or future tense would seem weird, and writing steps in order of future-present-past tense would be illogical. Scenarios should be centered in the present because they should timelessly represent the behaviors they cover.

Want to learn more? Check out my other BDD articles, especially Writing Good Gherkin.

Solving: How to write good UI interaction tests? #GivenWhenThenWithStyle

Writing good Gherkin is a passion of mine. Good Gherkin means good behavior specification, which results in better features, better tests, and ultimately better software. To help folks improve their Gherkin skills, Gojko Adzic and SpecFlow are running a series of #GivenWhenThenWithStyle challenges. I love reading each new challenge, and in this article, I provide my answer to one of them.

The Challenge

Challenge 20 states:

This week, we’re looking into one of the most common pain points with Given-When-Then: writing automated tests that interact with a user interface. People new to behaviour driven development often misunderstand what kind of behaviour the specifications should describe, and they write detailed user interactions in Given-When-Then scenarios. This leads to feature files that are very easy to write, but almost impossible to understand and maintain.

Here’s a typical example:

Scenario: Signed-in users get larger capacity
 
Given a user opens https://www.example.com using Chrome
And the user clicks on "Upload Files"
And the page reloads
And the user clicks on "Spreadsheet Formats"
Then the buttons "XLS" and "XLSX" show
And the user clicks on "XLSX"
And the user selects "500kb-sheet.xlsx"
Then the upload completes
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
And the user clicks on "XLSX"
And the user selects "1mb-sheet.xlsx"
Then the upload fails
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx" 
And the user clicks on "Login"
And the user enters "testuser123" into the "username" field
And the user enters "$Pass123" into the "password" field
And the user clicks on "Sign in"
And the page reloads
Then the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx" 
And the user clicks on "spreadsheet formats"
Then the buttons "XLS" and "XLSX" show
And the user clicks on "XLSX"
And the user selects "1mb-sheet.xlsx"
Then the upload completes
And the table "Uploaded Files" contains a cell with "1mb-sheet.xlsx" 
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx"

A common way to avoid such issues is to rewrite the specification to avoid the user interface completely. We’ve looked into that option several times in this article series. However, that solution only applies if the risk we’re testing is not in the user interface, but somewhere below. To make this challenge more interesting, let’s say that we actually want to include the user interface in the test, since the risk is in the UI interactions.

Indeed, most behavior-driven practitioners would generally recommend against phrasing steps using language specific to the user interface. However, there are times when testing a user interface itself is valid. For example, I work at PrecisionLender, a Q2 Company, and our main web app is very heavy on the front end. It has many, many interconnected fields for pricing commercial lending opportunities. My team has quite a few tests to cover UI-centric behaviors, such as verifying that entering a new interest rate triggers recalculation for summary amounts. If the target behavior is a piece of UI functionality, and the risk it bears warrants test coverage, then so be it.

Let’s break down the example scenario given above to see how to write Gherkin with style for user interface tests.

Understanding Behavior

Behavior is behavior. If you can describe it, then you can do it. Everything exhibits behavior, from the source code itself to the API, UIs, and full end-to-end workflows. Gherkin scenarios should use verbiage that reflects the context of the target behavior. Thus, the example above uses words like “click,” “select,” and “open.” Since the scenario explicitly covers a user interface, I think it is okay to use these words here. What bothers me, however, are two apparent code smells:

  1. The wall of text
  2. Out-of-order step types

The first issue is the wall of text this scenario presents. Walls of text are hard to read because they present too much information at once. The reader must take time to read through the whole chunk. Many readers simply read the first few lines and then skip the remainder. The example scenario has 27 Given-When-Then steps. Typically, I recommend Gherkin scenarios to have single-digit line length. A scenario with less than 10 steps is easier to understand and less likely to include unnecessary information. Longer scenarios are not necessarily “wrong,” but their longer lengths indicate that, perhaps, these scenarios could be rewritten more concisely.

The second issue in the example scenario is that step types are out of order. Given-When-Then is a formula for success. Gherkin steps should follow strict Given → When → Then ordering because this ordering demarcates individual behaviors. Each Gherkin scenario should cover one individual behavior so that the target behavior is easier to understand, easier to communicate, and easier to investigate whenever the scenario fails during testing. When scenarios break the order of steps, such as Given → Then → Given → Then in the example scenario, it shows that either the scenario covers multiple behaviors or that the author did not bring a behavior-driven understanding to the scenario.

The rules of good behavior don’t disappear when the type of target behavior changes. We should still write Gherkin with best practices in mind, even if our scenarios cover user interfaces.

Breaking Down Scenarios

If I were to rewrite the example scenario, I would start by isolating individual behaviors. Let’s look at the first half of the original example:

Given a user opens https://www.example.com using Chrome
And the user clicks on "Upload Files"
And the page reloads
And the user clicks on "Spreadsheet Formats"
Then the buttons "XLS" and "XLSX" show
And the user clicks on "XLSX"
And the user selects "500kb-sheet.xlsx"
Then the upload completes
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
And the user clicks on "XLSX"
And the user selects "1mb-sheet.xlsx"
Then the upload fails
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx"

Here, I see four distinct behaviors covered:

  1. Clicking “Upload Files” reloads the page.
  2. Clicking “Spreadsheet Formats” displays new buttons.
  3. Uploading a spreadsheet file makes the filename appear on the page.
  4. Attempting to upload a spreadsheet file that is 1MB or larger fails.

If I wanted to purely retain the same coverage, then I would rewrite these behavior specs using the following scenarios:

Feature: Example site
 
 
Scenario: Choose to upload files
 
Given the Example site is displayed
When the user clicks the "Upload Files" link
Then the page displays the "Spreadsheet Formats" link
 
 
Scenario: Choose to upload spreadsheets
 
Given the Example site is ready to upload files
When the user clicks the "Spreadsheet Formats" link
Then the page displays the "XLS" and "XLSX" buttons
 
 
Scenario: Upload a spreadsheet file that is smaller than 1MB
 
Given the Example site is ready to upload spreadsheet files
When the user clicks the "XLSX" button
And the user selects "500kb-sheet.xlsx" from the file upload dialog
Then the upload completes
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
 
 
Scenario: Upload a spreadsheet file that is larger than or equal to 1MB
 
Given the Example site is ready to upload spreadsheet files
When the user clicks the "XLSX" button
And the user selects "1mb-sheet.xlsx" from the file upload dialog
Then the upload fails
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx"

Now, each scenario covers each individual behavior. The first scenario starts with the Example site in a “blank” state: “Given the Example site is displayed”. The second scenario inherently depends upon the outcome of the first scenario. Rather than repeat all the steps from the first scenario, I wrote a new starting step to establish the initial state more declaratively: “Given the Example site is ready to upload files”. This step’s definition method may need to rerun the same operations as the first scenario, but it guarantees independence between scenarios. (The step could also optimize the operations, but that should be a topic for another challenge.) Likewise, the third and fourth scenarios have a Given step to establish the state they need: “Given the Example site is ready to upload spreadsheet files.” Both scenarios can share the same Given step because they have the same starting point. All three of these new steps are descriptive more than prescriptive. They declaratively establish an initial state, and they leave the details to the automation code in the step definition methods to determine precisely how that state is established. This technique makes it easy for Gherkin scenarios to be individually clear and independently executable.

I also added my own writing style to these scenarios. First, I wrote concise, declarative titles for each scenario. The titles dictate interaction over mechanics. For example, the first scenario’s title uses the word “choose” rather than “click” because, from the user’s perspective, they are “choosing” an action to take. The user will just happen to mechanically “click” a link in the process of making their choice. The titles also provide a level of example. Note that the third and fourth scenarios spell out the target file sizes. For brevity, I typically write scenario titles using active voice: “Choose this,” “Upload that,” or “Do something.” I try to avoid including verification language in titles unless it is necessary to distinguish behaviors.

Another stylistic element of mine was to remove explicit details about the environment. Instead of hard coding the website URL, I gave the site a proper name: “Example site.” I also removed the mention of Chrome as the browser. These details are environment-specific, and they should not be specified in Gherkin. In theory, this site could have multiple instances (like an alpha or a beta), and it should probably run in any major browser (like Firefox and Edge). Environmental characteristics should be specified as inputs to the automation code instead.I also refined some of the language used in the When and Then steps. When I must write steps for mechanical actions like clicks, I like to specify element types for target elements. For example, “When the user clicks the “Upload Files” link” specifies a link by a parameterized name. Saying the element is a link helps provides context to the reader about the user interface. I wrote other steps that specify a button, too. These steps also specified the element name as a parameter so that the step definition method could possibly perform the same interaction for different elements. Keep in mind, however, that these linguistic changes are neither “required” nor “perfect.” They make sense in the immediate context of this feature. While automating step definitions or writing more scenarios, I may revisit the verbiage and do some refactoring.

Determining Value for Each Behavior

The four new scenarios I wrote each covers an independent, individual behavior of the fictitious Example site’s user interface. They are thorough in their level of coverage for these small behaviors. However, not all behaviors may be equally important to cover. Some behaviors are simply more important than others, and thus some tests are more valuable than others. I won’t go into deep detail about how to measure risk and determine value for different tests in this article, but I will offer some suggestions regarding these example scenarios.

First and foremost, you as the tester must determine what is worth testing. These scenarios aptly specify behavior, and they will likely be very useful for collaborating with the Three Amigos, but not every scenario needs to be automated for testing. You as the tester must decide. You may decide that all four of these example scenarios are valuable and should be added to the automated test suite. That’s a fine decision. However, you may instead decide that certain user interface mechanics are not worth explicitly testing. That’s also a fine decision.

In my opinion, the first two scenarios could be candidates for the chopping block:

  1. Choose to upload files
  2. Choose to upload spreadsheets

Even though these are existing behaviors in the Example site, they are tiny. The tests simply verify that a user clicks makes certain links or buttons appear. It would be nice to verify them, but test execution time is finite, and user interface tests are notoriously slow compared to other tests. Consider the Rule of 1’s: typically, by orders of magnitude, a unit test takes about 1 millisecond, a service API test takes about 1 second, and a web UI test takes about 1 minute. Furthermore, these behaviors are implicitly exercised by the other scenarios, even if they don’t have explicit assertions.

One way to condense the scenarios could be like this:

Feature: Example site
 
 
Background:
 
Given the Example site is displayed
When the user clicks the "Upload Files" link
And the user clicks the "Spreadsheet Formats" link
And the user clicks the "XLSX" button
 
 
Scenario: Upload a spreadsheet file that is smaller than 1MB
 
When the user selects "500kb-sheet.xlsx" from the file upload dialog
Then the upload completes
And the table "Uploaded Files" contains a cell with "500kb-sheet.xlsx" 
 
 
Scenario: Upload a spreadsheet file that is larger than or equal to 1MB
 
When the user selects "1mb-sheet.xlsx" from the file upload dialog
Then the upload fails
And the table "Uploaded Files" does not contain a cell with "1mb-sheet.xlsx" 

This new feature file eliminates the first two scenarios and uses a Background section to cover the setup steps. It also eliminates the need for special Given steps in each scenario to set unique starting points. Implicitly, if the “Upload Files” or “Spreadsheet Formats” links fail to display the expected elements, then those steps would fail.

Again, this modification is not necessarily the “best” way or the “right” way to cover the desired behaviors, but it is a reasonably good way to do so. However, I would assert that both the 4-scenario feature file and the 2-scenario feature file are much better approaches than the original example scenario.

More Gherkin

What I showed in my answer to this Gherkin challenge is how I would handle UI-centric behaviors. I try to keep my Gherkin scenarios concise and focused on individual, independent behaviors. Try using these style techniques to rewrite the second half of Gojko’s original scenario. Feel free to drop your Gherkin in the comments below. I look forward to seeing how y’all write #GivenWhenThenWithStyle!

SpecFlow’s Online Gherkin Editor

Finding a good Gherkin editor is difficult. Some editors like Visual Studio Code and similar IDEs work great for engineers but aren’t suitable for product owners and non-programmer Amigos who want to contribute. Other editors like Notepad++ and Atom are lighter in weight but still require extensions and a little expertise. Fancy BDD tools like CucumberStudio and Cucumber for Jira provide Gherkin editors together with a bunch of other nifty features, but they require paid licenses.

For years, I’ve wanted a lightweight Gherkin editor that’s easy to use and accessible to all. Now, one finally exists: the Online Gherkin Editor by SpecFlow!

SpecFlow is the most popular BDD test automation framework for .NET. It’s also my favorite BDD framework. Over the past few years, I’ve built two large-scale test automation solutions with SpecFlow.

The Online Gherkin Editor by SpecFlow is just an editor on a web page. When you first load the page, the editor has example scenarios for you to reference. You can type your own Gherkin into the text area, and the editor highlights it for you. The editor provides line numbers and visual scrolling, too. My language is English, but if you happen to speak German, French, Spanish, or Dutch, then you can change the language setting via a dropdown. Once you’re done writing your Gherkin, you can clear it, copy it to the clipboard, or download it as a feature file using icons in the top-right corner. Be warned, though, that this editor won’t save your Gherkin in the cloud.

If you want to give this new editor a try, here’s the link: https://specflow.org/gherkin-editor/

You can also read SpecFlow’s official announcement here: https://specflow.org/blog/introducing-the-specflow-online-gherkin-editor/

Thanks, SpecFlow! Happy “Gherk-ing”!

Beyond Unit Tests: End-to-End Web UI Testing

On October 4, 2019, I gave a talk entitled Beyond Unit Tests: End-to-End Web UI Testing at PyGotham 2019. Check it out below! I show how to write a concise-yet-complete test solution for Web UI test cases using Python, pytest, and Selenium WebDriver.

This talk is a condensed version of my Hands-On Web UI Testing tutorials that I delivered at DjangoCon 2019 and PyOhio 2019. If you’d like to take the full tutorial, check out https://github.com/AndyLPK247/djangocon-2019-web-ui-testing. Full instructions are in the README.

Be sure to check out the other PyGotham 2019 talks, too. My favorite was Dungeons & Dragons & Python: Epic Adventures with Prompt-Toolkit and Friends by Mike Pirnat.

How Do We Write Good Gherkin as Part of BDD? (Webinar + Q&A)

On July 23, 2019, I gave a webinar entitled, “How Do We Write Good Gherkin as Part of BDD?” in collaboration with Paul Merrill and his company, Beaufort Fairmont. This webinar was the follow-up to a previous webinar, What Is BDD, and How Do We Practice It? It was an honor to partner with Paul again to go further into BDD practices. (If you want to learn more about BDD, check out Beaufort Fairmont’s two-day BDD training offering, as well as their blog and other webinars.)

To see my webinar recording, register here. Definitely watch the previous webinar first.

Just like last time, attendees asks several great questions that we simply could not answer live. I categorized all questions we received and answered them below. Please note that some questions might be rephrased or combined with others.

Questions about BDD

What is BDD?

Behavior-Driven Development! Read more here.

In a typical Agile development process, who should write feature files?

The Three Amigos! Product owners, developers, and testers should all come together to figure out behaviors. I recommend doing Example Mapping to formulate before writing Gherkin scenarios. The green example cards should be turned into feature files. The specific person who writes the feature files is up to team preference. It could be a collaborative effort, or it could be divided-and-conquered. Any one of the Three Amigos can do it.

How can we apply BDD to SAFe (Scaled Agile Framework) teams?

BDD practices like Three Amigos meetings, Example Mapping, Behavior Specification with Gherkin, and Behavior Implementation can become part of any process. All of these practices happen at the level of the development teams. Teams could even share Gherkin steps and test frameworks wherever sharing makes sense. Check out BDD 101: Behavior-Driven Agile.

What advice can you give to teams that use BDD tests frameworks solely as an automation tool and not part of a greater BDD process?

Do the best with what you’ve got. Try to show how other BDD practices can pragmatically improve your team’s development and delivery work. See also:

Questions about Gherkin Syntax

What is the difference between a scenario and a scenario outline?

A scenario is a procedure of Given-When-Then steps that covers one example for one behavior. If there are any parameters for steps, then a scenario has exactly one combination of possible inputs. A scenario outline is a Given-When-Then procedure that can have multiple examples of one behavior provided as a table of input combos. Each input row will run the same steps once, just with different parameter inputs. See BDD 101: Gherkin by Example to see examples.

What do you think about long tables in scenarios?

Long tables in Gherkin usually look terrible. They’re hard to read, and they create a wall of text. They may also include unnecessary variations. Stick to the Unique Example rule.

Are Given steps mandatory, or can scenarios start directly with When steps?

None of the step types are mandatory. It is valid to write a scenario that skips the Given and has only When-Then steps. It is also valid to write scenarios that are Given-Then or Given-When. In fact, it is syntactically valid to put steps in any order. However, I strongly recommend keeping Given-When-Then step order to properly frame behaviors.

Are quotation marks required for parameters?

No, quotation marks are not required for parameters, but they are a popular convention, and one that I recommend. Quotes make parameters easy to identify.

Questions about Gherkin Scenarios

How do we make sure each scenario focuses on an individual, independent behavior?

Do Example Mapping first as a team. Write scenarios together, or review them with others. Ask, “What makes this behavior unique?” Make sure to use strict Given-When-Then step order when defining the behavior. Rethink the scenario if it is more than 10 lines long. Look out for unnecessary complication.

What does it mean for a scenario to be “chronological”?

Scenario steps should be written as if they were on a timeline. Each step will be executed after the previous one, so its description must start where the previous one ended. Remember, steps will be automated as if they were scripts.

How do we write a very low-level scenario without having a wall of text?

Don’t write low-level scenarios! Gherkin is best for feature testing, not unit testing. Steps should focus on intention and business value. Instead of writing “type, type, click, wait,” write “log into the app.” If you absolutely must write a low-level scenario, remember that the same principles apply. Be intuitively descriptive. Focus on individual behaviors. Keep scenarios concise.

If all scenarios in a feature file have only one user, is it okay to use first-person perspective instead of third-person?

In my opinion, no. I favor third-person perspective universally. Trying to limit usage to one feature file won’t work because any step can be used by any feature file within a test project. The entire solution must be either first-person or third-person. There’s no middle ground.

Can we write Gherkin scenarios with personas?

Yes! Personas can make scenarios more meaningful and understandable. Make sure to define the personas well – they could be described under the Feature section or in a separate text file.

How do we write Gherkin scenarios that need to validate lots of information on a page?

Pick the most important pieces of information to check. You could write separate Then steps for each assertion, or you could push small-but-similar validations down to the automation level to avoid Gherkin clutter.

How do we write Gherkin scenarios for validating Web UI fields?

Typically, I treat each field validation as an independent behavior, and thus I write separate scenarios to check each field. If the scenario steps simply enter a textual value and verify a specific message, then I might make a Scenario Outline with example rows for each equivalence class of inputs.

How do we write Gherkin scenarios that have multiple inputs and setup steps? (Example: an API with ten parameters)

Gherkin allows multiple steps of the same type to be written using “And” and “But” keywords. It’s not a problem to have “Given-And-And” or “When-And-And”. If you discover that different scenarios repeat the same setup steps, then I recommend either moving those common steps to a Background section or writing a new step that covers multiple calls (for conciseness).

One example from the webinar showed searching for shoes and adding them to a shopping cart as part of one scenario. Aren’t those two different behaviors?

Here’s the scenario in question:

Scenario: Add shoes to the shopping cart
  Given the ShoeStore home page is displayed
  When the shopper searches for “red pumps”
  And the shopper adds the first result to the cart
  Then the cart has one pair of “red pumps”

We could have split this scenario into two. I just chose to define the behavior this way. This scenario is a bit more end-to-end because it covers a basic but typical workflow. It may also have leveraged existing steps, which expedites automation development. Overall, the scenario is still concise, chronological, and intuitively understandable. Remember, there is an art as well as a science to writing good Gherkin.

Questions about Automation

Do scenarios need to be independent of each other?

Yes, unequivocally. Tests that are not independent could interfere with each other and cause unexpected failures. Independence also reinforces singular behavioral focus.

How do we start a scenario “in media res” without it depending on other tests?

At the Gherkin level, write Given steps that define a new starting point for the behavior. For example, many teams develop Web apps. It’s common to think that the starting point for all tests is login. However, the starting point can be a few pages after login.

At the automation level, it may be useful to implement the Given steps by calling other steps. For example, if a Given step should start at a user’s profile page, then perhaps it could internally call the login step and the click-the-profile-link step. Test steps may repetitively do the same operations for different tests, but test case independence will be preserved, and unique failures will be reported.

What is the best way to handle preconditions like logging into a Web app?

The simplest way to handle preconditions is to write Given steps. If those Given steps are shared by all scenarios in a feature file, then move them to a Background section. Automation hooks can also perform common setup and cleanup actions, depending upon the test framework. Personally, I prefer to use hooks to do automatic login rather than repeat Given steps for many scenarios.

Is it better to set up and tear down new test objects for each test case, or is it better to use shared, pre-created objects?

That depends upon the object. Most objects like WebDrivers and page objects should have scenario scope, meaning they are created fresh for each scenario and then torn down when the scenario ends. The only time an object should be shared across scenarios is if it is immutable or very expensive to create. For example, configuration data could be read in once before all tests and then injected immutably into each scenario. The safe position is always to use fresh objects; justify why sharing is needed before trying it.

I want to use Serenity for BDD and testing. Should I use Cucumber-like Gherkin feature files, or should I use Serenity’s native methods?

That’s up to you and your team. Personally, I would still use Gherkin feature files with Serenity. I like to separate my test case from my test code. Everyone can read Gherkin feature files, but not everyone can read Java or JavaScript test methods.

If a company already has a large BDD test solution that is poorly implemented, would it be better to keep it going or try to change it?

This question can be applied to all software projects, not just BDD test solutions. The answer is situational. Personally, I favor doing things right, even if it means refactoring. Please read Our Test Automation Has Problems. Should We Start Over? for a thorough answer.

Final Questions

Why do you call yourself “Pandy” and the “Automation Panda”?

Pandas are awesome. Everybody loves them. And nobody forgets my moniker. The nickname “Pandy” came about in the Python community to distinguish me from other folks named “Andy.”

Where can I get team training in BDD?

Beaufort Fairmont provides a one- or two-day course in BDD and writing Gherkin. Sign up for more information here.

Python BDD Framework Comparison

Almost every major programming language has BDD test frameworks, and Python is no exception. In fact, Python has several! So, how do they compare, and which one is best? Let’s find out.

Head-to-Head Comparison

behave

behave is one of the most popular Python BDD frameworks. Although it is not officially part of the Cucumber project, it functions very similarly to Cucumber frameworks.

Pros

  • It fully supports the Gherkin language.
  • Environmental functions and fixtures make setup and cleanup easy.
  • It has Django and Flask integrations.
  • It is popular with Python BDD practitioners.
  • Online docs and tutorials are great.
  • It has PyCharm Professional Edition support.

Cons

  • There’s no support for parallel execution.
  • It’s a standalone framework.
  • Sharing steps between feature files can be a bit of a hassle.

pytest-bdd

pytest-bdd is a plugin for pytest that lets users write tests as Gherkin feature files rather than test functions. Because it integrates with pytest, it can work with any other pytest plugins, such as pytest-html for pretty reports and pytest-xdist for parallel testing. It also uses pytest fixtures for dependency injection.

Pros

  • It is fully compatible with pytest and major pytest plugins.
  • It benefits from pytest‘s community, growth, and goodness.
  • Fixtures are a great way to manage context between steps.
  • Tests can be filtered and executed together with other pytest tests.
  • Step definitions and hooks are easily shared using conftest.py.
  • Tabular data can be handled better for data-driven testing.
  • Online docs and tutorials are great.
  • It has PyCharm Professional Edition support.

Cons

  • Step definition modules must have explicit declarations for feature files (via “@scenario” or the “scenarios” function).
  • Scenario outline steps must be parsed differently.

radish

radish is a BDD framework with a twist: it adds new syntax to the Gherkin language. Language features like scenario loops, scenario preconditions, and constants make radish‘s Gherkin variant more programmatic for test cases.

Resources

Logo

Pros

  • Gherkin language extensions empower testers to write better tests.
  • The website, docs, and logo are on point.
  • Feature files and step definitions come out very clean.

Cons

  • It’s a standalone framework with limited extensions.
  • BDD purists may not like the additions to the Gherkin syntax.

lettuce

lettuce is another vegetable-themed Python BDD framework that’s been around for years. However, the website and the code haven’t been updated for a while.

Resources

Logo

../_images/flow.png

Pros

  • Its code is simpler.
  • It’s tried and true.

Cons

  • It lacks the feature richness of the other frameworks.
  • It doesn’t appear to have much active, ongoing support.

freshen

freshen was one of the first BDD test frameworks for Python. It was a plugin for nose. However, both freshen and nose are no longer maintained, and their doc pages explicitly tell readers to use other frameworks.

My Recommendations

None of these frameworks are perfect, but some have clear advantages. Overall, my top recommendation is pytest-bdd because it benefits from the strengths of pytest. I believe pytest is one of the best test frameworks in any language because of its conciseness, fixtures, assertions, and plugins. The 2018 Python Developers Survey showed that pytest is, by far, the most popular Python test framework, too. Even though pytest-bdd doesn’t feel as polished as behave, I think some TLC from the open source community could fix that.

Here are other recommendations:

  • Use behave if you want a robust, clean experience with the largest community.
  • Use pytest-bdd if you need to integrate with other plugins, already have a bunch of pytest tests, or want to run tests in parallel.
  • Use radish if you want more programmatic control of testing at the Gherkin layer.
  • Don’t use lettuce or freshen.

What is BDD, and How Do We Practice It? (Webinar + Q&A)

On March 18, 2019, I gave a webinar entitled, “What is Behavior-Driven Development, and How Do We Practice It?” in collaboration with Paul Merrill and his company, Beaufort Fairmont. It was both a pleasure and an honor to do this webinar with them. Paul is a top-notch test automation expert, and Beaufort Fairmont is doing really exciting things. Check out their two-day BDD training offering, as well as their blog and other webinars.

To see my webinar recording, register here.

During the webinar, attendees asked more questions than we could answer. I’m excited that so many people asked questions. My answers are below.

Questions about Process

How is BDD different from TDD (Test-Driven Development)?

BDD is an evolution of TDD. In TDD, developers (1) write unit tests and watch them fail, (2) develop the feature to make the tests pass, (3) refactor the code to make it stronger, and (4) repeat the cycle. In BDD, teams do this same loop with feature tests (a.k.a “acceptance” or “black-box” tests) as well as unit tests. Furthermore, BDD adds shift left practices like Example Mapping and Specification by Example so that teams know what they are doing and focus on developing the right things.

Check out Dan North’s article, Introducing BDD, for a more thorough answer.

Can BDD be used with manual testing?

Yes! BDD is not merely an automation tool – it is a set of pragmatic practices to help teams develop better software. Gherkin scenarios are first and foremost behavior specs that help a team’s collaboration and accountability. They function secondarily as test cases that can be executed either manually or with automation.

Can we use BDD with technical stories or backend features?

Yes! If you can describe it, then you can do it.

How many Gherkin scenarios should one story have?

There’s no hard rule, but I recommend no more than a handful of rules per story, and no more than a handful of examples per rule. If you do Example Mapping and feel overwhelmed by the number of cards for a story, then the story should probably be broken into smaller stories.

Should we do Example Mapping for every story? Spending 20-30 minutes for each story would take a long time.

Try doing Example Mapping on one or two stories to start. The first time is always rough, but as you iterate on it, you’ll get better as a team. Even though Example Mapping has an upfront time cost, it will save a lot of time later in the sprint because (a) acceptance criteria is clear, (b) tests are already written, and (c) everyone has a mutual understanding of the story. The team won’t suffer through the inefficiencies of miscommunication and poor planning. You may even want to replace planning meeting with Example Mapping meetings.

What metrics should we use with BDD?

All metrics are flawed, but some metrics are useful. All the standard testing and Agile metrics still apply: code coverage, story velocity, etc. Here are some additional metrics you may consider for BDD:

  • the percentage of stories that undergo Example Mapping before the sprint
  • the number of rules and examples that get “missed” during Example Mapping and need to be added later
  • the percentage of Gherkin scenarios that get automated in the sprint

If you choose to track metrics, make sure their feedback is used to improve team practices. For more info on metrics, please read my Quality Metrics 101 series.

What were the resources you recommended at the end of the webinar?

Questions about Tools

What test management tools should we use with BDD?

I’m sure there are BDD plugins for test management tools, but I don’t have any that I can personally recommend. To be honest, I try to stay away from large test management tools like HP ALM, qTest, VersionOne. When doing BDD, the Gherkin feature files themselves should be the single source of truth for feature-level tests, and they should be version-controlled in a repository. Don’t fall into the trap of slapping “Given-When-Then” keywords onto existing functional tests – that’s not BDD.

Does Jira support Example Mapping?

I have not personally used any Jira plugin for Example Mapping. It looks like there is an Easy Agile User Story Maps plugin that is similar to but slightly different from Example Mapping.

Are there other good tools for BDD and Example Mapping?

What’s the difference between Gherkin, Cucumber, and SpecFlow?

  • Gherkin is the Given-When-Then spec language.
  • Cucumber is a company and its eponymous test framework that uses Gherkin.
  • SpecFlow is Cucumber for .NET.

Questions about Testing

Can BDD test frameworks be used for unit testing?

Yes, but I don’t recommend it. BDD frameworks shine for black-box feature testing. They’re a bit too verbose for code-level unit tests. Read BDD 101: Unit, Integration, and End-to-End Tests for more info.

Can BDD test frameworks be used for integration testing?

Yes! See BDD 101: Unit, Integration, and End-to-End Tests.

How long should Gherkin scenarios be?

Scenarios should be bite-sized. Each scenario should focus on one individual behavior. There’s no hard rule, but I recommend single-digit step counts. Read BDD 101: Writing Good Gherkin for more info.

What are “step definitions” in Cucumber?

Step definitions are the methods in the automation code that execute the steps. When a BDD framework runs a Gherkin scenario as a test, it “glues” each step to a step definition based on some sort of string matching.

How can we minimize duplicate code within a BDD test framework?

Know your steps. Always search for existing steps before writing new steps. Refactor existing steps whenever appropriate. Reuse steps when writing new scenarios. Do pair programming or mob programming when writing scenarios. Put scenarios through code reviews. Apply good coding practices – remember, test automation is software.

I write Gherkin scenarios, but I don’t write test automation code. What’s the best way to write Gherkin scenarios so that they can be automated?

Do pair programming with the automation engineers to write Gherkin scenarios together. Become familiar with existing steps by reading and searching feature files. Otherwise, the Gherkin steps you write in isolation might not be usable. Remember, BDD is a team effort!

The examples in the webinar were all fairly basic. Do you have any examples with more complex systems?

I have some example projects on GitHub in Python and Java with some basic unit, integration, and end-to-end tests, but I don’t have any large-scale examples that I can share publicly.

We wrote hundreds of SpecFlow tests without the other Amigos. Now, there are large test gaps, and many steps aren’t reusable. What should we do?

I’m sorry to hear that. It’s not an uncommon story. There are two paths: (1) refactoring or (2) starting over. Without really knowing the situation, I don’t think it’s my place to say which way is better. Here are some questions to help guide your decision:

  • What are your goals for testing and automation?
  • What’s your overall quality and testing strategy?
  • What parts of the code base are salvageable?
  • What parts of the code base should be removed?
  • If you started again from scratch, what would you do differently to make sure the same problems don’t reoccur?

I strongly recommend taking the Setting a Foundation for Successful Test Automation course from Test Automation University. (It’s free.) I also gave a talk about this very problem, Egad! How Do We Start Writing (Better) Tests?, at a few Python conferences.

We have a large BDD test suite with heavy coupling and slow execution times. The business amigos have also left the company. Should we try to fix what we have or just start over?

Sorry to hear that; same answer as before.

Final Questions

Why do you call yourself the “Automation Panda”?

Pandas are awesome. Everybody loves them. And nobody forgets my moniker.

Where can I get team training in BDD?

Beaufort Fairmont provides a one- or two-day course in BDD and writing Gherkin. Sign up for more information here.

Sprint Planning Sucks. Can It Be Fixed?

Warning: This article contains strong opinions that might not be suitable for all audiences. Reader discretion is advised.

It’s Monday morning. After an all-too-short weekend and rush hour traffic, you finally arrive at the office. You throw your bag down at your desk, run to the break room, and queue up for coffee. As the next pot is brewing, you check your phone. It’s 8:44am… now 8:45am, and DING! A meeting reminder appears:

Sprint Planning – 9am to 3pm.

.

What’s your visceral reaction?

.

I can’t tell you mine, because I won’t put profanity on my blog.

Real Talk

In the capital-A Agile Scrum process, sprint planning is the kick-off meeting for the next iteration. The whole team comes together to talk about features, size work items with points, and commit to deliverables for the next “sprint” (typically 2 weeks long). Idealistically, team members collaborate freely as they learn about product needs and give valued input.

Let’s have some real talk, though: sprint planning sucks. Maybe that’s a harsh word, but, if you’re reading this article, then it caught your attention. Personally, my sprint planning experiences have been lousy. Why? Am I just bellyaching, or are there some serious underlying problems?

Sprint planning is a huge time commitment. 9am to 3pm is not an exaggeration. Sprint planning meetings are typically half-day to full-day affairs. Most people can’t stay focused on one thing for that long. Plus, when a sprint is only two weeks long, one hour is a big chunk of time, let alone 3, or 6, or a whole day. The longer the meeting, the higher the opportunity cost, and the deeper the boredom.

Collaboration is a farce. Planning meetings typically devolve into one “leader” (like a scrum master, product owner, or manager) pulling teeth to get info for a pre-determined list of stories. Only two people, the leader and the story-owner, end up talking, while everyone else just stares at their laptops until it’s their turn. Discussions typically don’t follow any routine beyond, “What’s the acceptance criteria?” and, “Does this look right?” with an interloper occasionally chiming in. Each team member typically gets only a few minutes of value out of an hours-long ordeal. That’s an inefficient use of everyone’s time.

No real planning actually happens. These meetings ought to be called “guessing” meetings, instead. Story point sizes are literally made up. Do they measure time or complexity? No, they really just measure groupthink. Teams even play a game called planning poker that subliminally encourages bluffing. Then, point totals are used to guess how much work can be done during the sprint. When the guess turns out to be wrong at the end of the sprint (and it always does), the team berates itself in retro for letting points slip. Every. Time.

Does It Spark Joy?

I’ve long wondered to myself if sprint planning is a good concept just implemented poorly, or if it’s conceptually flawed at its root. I’m pretty sure it’s just flawed. The meetings don’t facilitate efficient collaboration relative to their time commitments, and estimates are based on poor models. Retros can’t fix that. And gut reactions don’t lie.

So, what should we do? Should we Konmari our planning meetings to see if they spark joy? Should we get rid of our ceremonies and start over? Is this an indictment of the whole Agile Scrum process? But then, how will we know what to do, and when things can get done?

I think we can evolve our Agile process with more effective practices than sprint planning. And I don’t think that evolution would be terribly drastic.

Behavior-Driven Planning

What we really want out of a planning meeting is planning, not pulling and not predicting. Planning is the time to figure out what will be done and how it will be done. The size of the work should be based on the size of the blueprint. Enter Example Mapping.

Example Mapping is a Behavior-Driven Development practice for clarifying and confirming stories. The process is straightforward:

  1. Write the story on a yellow card.
  2. Write each rule that the story must satisfy on a blue card.
  3. Illustrate each rule with examples written on green cards.
  4. Got stuck on a question? Write it on a red card and move on.

One story should take about 20-30 minutes to map. The whole team can participate, or the team can split up into small groups to divide-and-conquer. Rules become acceptance criteria, examples become test cases, and questions become spikes.

Here’s a good walkthrough of Example Mapping.

What about story size? That’s easy – count the cards. How many cards does a story have? That’s a rough size for the work to be done based on the blueprint, not bluffing. More cards = more complexity. It’s objective. No games. Frankly, it can’t be any worse that made-up point values.

This is real planning: a blueprint with a course of action.

So, rather than doing traditional sprint planning meetings, try doing Example Mapping sessions. Actually plan the stories, and use card counts for point sizes. Decisions about priority and commitments can happen between rounds of story mapping, too. The Scrum process can otherwise remain the same.

If you want to evolve further, you could eliminate the time boxes of sprints in favor of Kanban. Two-week work item boundaries can arbitrarily fall in the middle of progress, which is not only disruptive to workflow but can also encourage bad responses (like cramming to get things done or shaming for not being complete.) Kanban treats work items as a continuous flow of prioritized work fed to a team in bite-sized pieces. When a new story comes up, it can have its own Example Mapping “planning” meeting. Now, Kanban is not for everyone, but it is popular among post-Agile practitioners. What’s important is to find what works for your team.

Rant Over

I know I expressed strong, controversial opinions in this article. And I also recognize that I’m arguing against bad examples of Agile Scrum. Nevertheless, I believe my points are fair: planning itself is not a waste of time, but the way many teams plan their sprints uses time inefficiently and sets poor expectations. There are better ways to do planning – let’s give them a try!