coverage

Test coverage and trusting your instincts

Picture this: It’s 2010, and I’m fresh out of college, eager to dive into the software industry. Little did I know, a simple interview question would challenge not only my knowledge about testing but also my instincts.

Job openings were hard to find in the wake of the Great Recession. Thankfully, I landed a few interviews with IBM, where I completed a series of internships over the summers of 2007-2009. I was willing to take any kind of job – as long as it involved coding. One of those interviews was for an entry-level position on a data warehouse team in Boston. I honestly don’t remember much from this interview, but there was one question I will never forget:

How do you know when you’ve done enough testing?

Now, remember, back in 2010, I wasn’t the Automation Panda yet. Nevertheless, since I had experience with testing during my internships, I felt prepared to give a reasonable answer. If I recall correctly, I said something about covering all paths through the code and being mindful to consider edge cases that could be overlooked. (My answer today would likely frame “completeness” around acceptable risk, but that’s not the point of the story.) I’ll never forget what the interviewer said in reply:

Well, that’s not the answer I was looking for.

Oh? What’s the “right” answer?

If you write roughly the same number of lines of test code as you write for product code, then you have enough coverage.

That answer stunned me. Despite my limited real-world experience as a recent college graduate, I knew that answer was blatantly wrong. During my internships, I wrote plenty of code with plenty of tests, and I knew from experience that there was no correlation between lines of test code and actual coverage. Even a short snippet could require multiple tests to thoroughly cover all of its variations.

For example, here’s a small Python class that keeps track of a counter:

class Counter:

  def __init__(self):
    self.count = 0

  def add(self, more=1):
    self.count += more

And here’s a set of pytest tests to cover it:

import pytest

@pytest.fixture
def counter():
  return Counter()

def test_counter_init(counter):
  assert counter.count == 0

def test_counter_add_one(counter):
  counter.add()
  assert counter.count == 1

def test_counter_add_three(counter):
  counter.add(3)
  assert counter.count == 3

def test_counter_add_twice(counter):
  counter.add()
  counter.add()
  assert counter.count == 2

There are three times as many lines of test code as product code, and I could still come up with a few more test cases.

In the moment, I didn’t know how to reply to the interviewer. He sounded very confident in his answer. All I could say was, “I don’t think I agree with that.” I didn’t have any examples or evidence to share; I just had my gut feeling.

I sensed my interviewer’s disappointment with my response. Who was I, a lowly intern, to challenge a senior engineer? Needless to say, I did not receive a job offer. I ended up taking a different job with IBM in Raleigh-Durham instead.

Nevertheless, this exchange taught me a very valuable lesson: trust your instincts. While I didn’t land the job that day, the encounter left an indelible mark on my approach to problem-solving. It instilled in me the confidence to question assumptions and trust my instincts, qualities that would shape my career trajectory in unforeseen ways. Never dismiss your instincts because you are less senior than others. You just might be right!

BDD 101: Writing Good Gherkin

So, you and your team have decided to make test automation a priority. You plan to use behavior-driven development to shift left with testing. You read the BDD 101 Series up through the previous post. You picked a good language for test automation. You even peeked at Cucumber-JVM or another BDD framework on your own. That’s great! Big steps! And now, you are ready to write your first Gherkin feature file.  You fire open Atom with a Gherkin plugin or Notepad++ with a Gherkin UDL, you type “Given” on the first line, and…

Writer’s block.  How am I supposed to write my Gherkin steps?

Good Gherkin feature files are not easy to write at first. Writing is definitely an art. With some basic pointers, and a bit of practice, Gherkin becomes easier. This post will cover how to write top-notch feature files. (Check the Automation Panda BDD page for the full table of contents.)

The Golden Gherkin Rule: Treat other readers as you would want to be treated. Write Gherkin so that people who don’t know the feature will understand it.

Proper Behavior

The biggest mistake BDD beginners make is writing Gherkin without a behavior-driven mindset. They often write feature files as if they are writing “traditional” procedure-driven functional tests: step-by-step instructions with actions and expected results. HP ALM, qTest, AccelaTest, and many other test repository tools store tests in this format. These procedure-driven tests are often imperative and trace a path through the system that covers multiple behaviors. As a result, they may be unnecessarily long, which can delay failure investigation, increase maintenance costs, and create confusion.

For example, let’s consider a test that searches for images of pandas on Google. Below would be a reasonable test procedure:

  1. Open a web browser.
    1. Web browser opens successfully.
  2. Navigate to https://www.google.com/.
    1. The web page loads successfully and the Google image is visible.
  3. Enter “panda” in the search bar.
    1. Links related to “panda” are shown on the results page.
  4. Click on the “Images” link at the top of the results page.
    1. Images related to “panda” are shown on the results page.

I’ve seen many newbies translate a test like this into Gherkin like the following:

# BAD EXAMPLE! Do not copy.
Feature: Google Searching

  Scenario: Google Image search shows pictures
    Given the user opens a web browser
    And the user navigates to "https://www.google.com/"
    When the user enters "panda" into the search bar
    Then links related to "panda" are shown on the results page
    When the user clicks on the "Images" link at the top of the results page
    Then images related to "panda" are shown on the results page

This scenario is terribly wrong. All that happened was that the author put BDD buzzwords in front of each step of the traditional test. This is not behavior-driven, it is still procedure-driven.

The first two steps are purely setup: they just go to Google, and they are strongly imperative. Since they don’t focus on the desired behavior, they can be reduced to one declarative step: “Given a web browser is at the Google home page.” This new step is friendlier to read.

After the Given step, there are two When-Then pairs. This is syntactically incorrect: Given-When-Then steps must appear in order and cannot repeat. A Given may not follow a When or Then, and a When may not follow a Then. The reason is simple: any single When-Then pair denotes an individual behavior. This makes it easy to see how, in the test above, there are actually two behaviors covered: (1) searching from the search bar, and (2) performing an image search. In Gherkin, one scenario covers one behavior. Thus, there should be two scenarios instead of one. Any time you want to write more than one When-Then pair, write separate scenarios instead. (Note: Some BDD frameworks may allow disordered steps, but it would nevertheless be anti-behavioral.)

This splitting technique also reveals unnecessary behavior coverage. For instance, the first behavior to search from the search bar may be covered in another feature file. I once saw a scenario with about 30 When-Then pairs, and many were duplicate behaviors.

Do not be tempted to arbitrarily reassign step types to make scenarios follow strict Given-When-Then ordering. Respect the integrity of the step types: Givens set up initial state, Whens perform an action, and Thens verify outcomes. In the example above, the first Then step could have been turned into a When step, but that would be incorrect because it makes an assertion. Step types are meant to be guide rails for writing good behavior scenarios.

The correct feature file would look something like this:

Feature: Google Searching

  Scenario: Search from the search bar
    Given a web browser is at the Google home page
    When the user enters "panda" into the search bar
    Then links related to "panda" are shown on the results page

  Scenario: Image search
    Given Google search results for "panda" are shown
    When the user clicks on the "Images" link at the top of the results page
    Then images related to "panda" are shown on the results page

The second behavior arguably needs the first behavior to run first because the second needs to start at the search result page. However, since that is merely setup for the behavior of image searching and is not part of it, the Given step in the second scenario can basically declare (declaratively) that the “panda” search must already be done. Of course, this means that the “panda” search would be run redundantly at test time, but the separation of scenarios guarantees behavior-level independence.

The Cardinal Rule of BDD: One Scenario, One Behavior!

Remember, behavior scenarios are more than tests – they also represent requirements and acceptance criteria. Good Gherkin comes from good behavior.

(For deeper information about the Cardinal Rule of BDD and multiple When-Then pairs per scenario, please refer to my article, Are Gherkin Scenarios with Multiple When-Then Pairs Okay?)

Phrasing Steps

How you write a step matters. If you write a step poorly, it cannot easily be reused. Thankfully, some basic rules maintain consistent phrasing and maximum reusability.

Write all steps in third-person point of view. If first-person and third-person steps mix, scenarios become confusing. I even dedicated a whole blog post entirely to this point: Should Gherkin Steps Use First-Person or Third-Person? TL;DR: just use third-person at all times.

Write steps as a subject-predicate action phrase. It may tempting to leave parts of speech out of a step line for brevity, especially when using Ands and Buts, but partial phrases make steps ambiguous and more likely to be reused improperly. For example, consider the following example:

# BAD EXAMPLE! Do not copy.
Feature: Google Searching

  Scenario: Google search result page elements
    Given the user navigates to the Google home page
    When the user entered "panda" at the search bar
    Then the results page shows links related to "panda"
    And image links for "panda"
    And video links for "panda"

The final two And steps lack the subject-predicate phrase format. Are the links meant to be subjects, meaning that they perform some action? Or, are they meant to be direct objects, meaning that they receive some action? Are they meant to be on the results page or not? What if someone else wrote a scenario for a different page that also had image and video links – could they reuse these steps? Writing steps without a clear subject and predicate is not only poor English but poor communication.

Also, use appropriate tense and phrasing for each type of step. For simplicity, use present tense for all step types. Rather than take a time warp back to middle school English class, let’s illustrate tense with a bad example:

# BAD EXAMPLE! Do not copy.
Feature: Google Searching

  Scenario: Simple Google search
    Given the user navigates to the Google home page
    When the user entered "panda" at the search bar
    Then links related to "panda" will be shown on the results page

The Given step above uses present tense, but its subject is misleading. It indicates an action when it says, “Given the user navigates.” Actions imply the exercise of behavior. However, Given steps are meant to establish an initial state, not exercise a behavior. This may seem like a trivial nuance, but it can confuse feature file authors who may not be able to tell if a step is a Given or When. A better phrasing would be, “Given the Google home page is displayed.” It establishes a starting point for the scenario. Use present tense with an appropriate subject to indicate a state rather than an action.

The When step above uses past tense when it says, “The user entered.” This indicates that an action has already happened. However, When steps should indicate that an action is presently happening. Plus, past tense here conflicts with the tenses used in the other steps.

The Then step above uses future tense when it says, “The results will be shown.” Future tense seems practical for Then steps because it indicates what the result should be after the current action is taken. However, future tense reinforces a procedure-driven approach because it treats the scenario as a time sequence. A behavior, on the other hand, is a present-tense aspect of the product or feature. Thus, it is better to write Then steps in the present tense.

The corrected example looks like this:

Feature: Google Searching

  Scenario: Simple Google search
    Given the Google home page is displayed
    When the user enters "panda" into the search bar
    Then links related to "panda" are shown on the results page

And note, all steps are written in third-person. Read Should Gherkin Steps use Past, Present, or Future Tense? to learn more.

Good Titles

Good titles are just as important as good steps. The title is like the face of a scenario – it’s the first thing people read. It must communicate in one concise line what the behavior is. Titles are often logged by the automation framework as well. Specific pointers for writing good scenario titles are given in my article, Good Gherkin Scenario Titles.

Choices, Choices

Another common misconception for beginners is thinking that Gherkin has an “Or” step for conditional or combinatorial logic. People may presume that Gherkin has “Or” because it has “And”, or perhaps programmers want to treat Gherkin like a structured language. However, Gherkin does not have an “Or” step. When automated, every step is executed sequentially.

Below is a bad example based on a classic Super Mario video game, showing how people might want to use “Or”:

# BAD EXAMPLE! Do not copy.
Feature: SNES Mario Controls

  Scenario: Mario jumps
    Given a level is started
    When the player pushes the "A" button
    Or the player pushes the "B" button
    Then Mario jumps straight up

Clearly, the author’s intent is to say that Mario should jump when the player pushes either of two buttons. The author wants to cover multiple variations of the same behavior. In order to do this the right way, use Scenario Outline sections to cover multiple variations of the same behavior, as shown below:

Feature: SNES Mario Controls

  Scenario Outline: Mario jumps
    Given a level is started
    When the player pushes the "<letter>" button
    Then Mario jumps straight up
    
    Examples: Buttons
      | letter |
      | A      |
      | B      |

The Known Unknowns

Test data can be difficult to handle. Sometimes, it may be possible to seed data in the system and write tests to reference it, but other times, it may not. Google search is the prime example: the result list will change over time as both Google and the Internet change. To handle the known unknowns, write scenarios defensively so that changes in the underlying data do not cause test runs to fail. Furthermore, to be truly behavior-driven, think about data not as test data but as examples of behavior.

Consider the following example from the previous post:

Feature: Google Searching
  
  Scenario: Simple Google search
    Given a web browser is on the Google page
    When the search phrase "panda" is entered
    Then results for "panda" are shown
    And the following related results are shown
      | related       |
      | Panda Express |
      | giant panda   |
      | panda videos  |

This scenario uses a step table to explicitly name results that should appear for a search. The step with the table would be implemented to iterate over the table entries and verify each appeared in the result list. However, what if Panda Express were to go out of business and thus no longer be ranked as high in the results? (Let’s hope not.) The test run would then fail, not because the search feature is broken, but because a hard-coded variation became invalid. It would be better to write a step that more intelligently verified that each returned result somehow related to the search phrase, like this: “And links related to ‘panda’ are shown on the results page.” The step definition implementation could use regular expression parsing to verify the presence of “panda” in each result link.

Another nice feature of Gherkin is that step definitions can hide data in the automation when it doesn’t need to be exposed. Step definitions may also pass data to future steps in the automation. For example, consider another Google search scenario:

Feature: Google Searching

  Scenario: Search result linking
    Given Google search results for "panda" are shown
    When the user clicks the first result link
    Then the page for the chosen result link is displayed

Notice how the When step does not explicitly name the value of the result link – it simply says to click the first one. The value of the first link may change over time, but there will always be a first link. The Then step must know something about the chosen link in order to successfully verify the outcome, but it can simply reference it as “the chosen result link”. Behind the scenes, in the step definitions, the When step can store the value of the chosen link in a variable and pass the variable forward to the Then step.

Handling Test Data

Some types of test data should be handled directly within the Gherkin, but other types should not. Remember that BDD is specification by example – scenarios should be descriptive of the behaviors they cover, and any data written into the Gherkin should support that descriptive nature. Read Handling Test Data in BDD for comprehensive information on handling test data.

Less is More

Scenarios should be short and sweet. I typically recommend that scenarios should have a single-digit step count (<10). Long scenarios are hard to understand, and they are often indicative of poor practices. One such problem is writing imperative steps instead of declarative steps. I have touched on this topic before, but I want to thoroughly explain it here.

Imperative steps state the mechanics of how an action should happen. They are very procedure-driven. For example, consider the following When steps for entering a Google search:

  1. When the user scrolls the mouse to the search bar
  2. And the user clicks the search bar
  3. And the user types the letter “p”
  4. And the user types the letter “a”
  5. And the user types the letter “n”
  6. And the user types the letter “d”
  7. And the user types the letter “a”
  8. And the user types the ENTER key

Now, the granularity of actions may seem like overkill, but it illustrates the point that imperative steps focus very much on how actions are taken. Thus, they often need many steps to fully accomplish the intended behavior. Furthermore, the intended behavior is not always as self-documented as with declarative steps.

Declarative steps state what action should happen without providing all of the information for how it will happen. They are behavior-driven because they express action at a higher level. All of the imperative steps in the example above could be written in one line: “When the user enters ‘panda’ at the search bar.” The scrolling and keystroking is implied, and it will ultimately be handled by the automation in the step definition. When trying to reduce step count, ask yourself if your steps can be written more declaratively.

Another reason for lengthy scenarios is scenario outline abuse. Scenario outlines make it all too easy to add unnecessary rows and columns to their Examples tables. Unnecessary rows waste test execution time. Extra columns indicate complexity. Both should be avoided. Below are questions to ask yourself when facing an oversized scenario outline:

  • Does each row represent an equivalence class of variations?
    • For example, searching for “elephant” in addition to “panda” does not add much test value.
  • Does every combination of inputs need to be covered?
    • N columns with M inputs each generates MN possible combinations.
    • Consider making each input appear only once, regardless of combination.
  • Do any columns represent separate behaviors?
    • This may be true if columns are never referenced together in the same step.
    • If so, consider splitting apart the scenario outline by column.
  • Does the feature file reader need to explicitly know all of the data?
    • Consider hiding some of the data in step definitions.
    • Some data may be derivable from other data.

These questions are meant to be sanity checks, not hard-and-fast rules. The main point is that scenario outlines should focus on one behavior and use only the necessary variations.

Style and Structure

While style often takes a backseat during code review, it is a factor that differentiates good feature files from great feature files. In a truly behavior-driven team, non-technical stakeholders will rely upon feature files just as much as the engineers. Good writing style improves communication, and good communication skills are more than just resume fluff.

Below are a number of tidbits for good style and structure:

  1. Focus a feature on customer needs.
  2. Limit one feature per feature file. This makes it easy to find features.
  3. Limit the number of scenarios per feature. Nobody wants a thousand-line feature file. A good measure is a dozen scenarios per feature.
  4. Limit the number of steps per scenario to less than ten.
  5. Limit the character length of each step. Common limits are 80-120 characters.
  6. Use proper spelling.
  7. Use proper grammar.
  8. Capitalize Gherkin keywords.
  9. Capitalize the first word in titles.
  10. Do not capitalize words in the step phrases unless they are proper nouns.
  11. Do not use punctuation (specifically periods and commas) at the end of step phrases.
  12. Use single spaces between words.
  13. Indent the content beneath every section header.
  14. Separate features and scenarios by two blank lines.
  15. Separate examples tables by 1 blank line.
  16. Do not separate steps within a scenario by blank lines.
  17. Space table delimiter pipes (“|”) evenly.
  18. Adopt a standard set of tag names. Avoid duplicates.
  19. Write all tag names in lowercase, and use hyphens (“-“) to separate words.
  20. Limit the length of tag names.

Without these rules, you might end up with something like this:

# BAD EXAMPLE! Do not copy.

 Feature: Google Searching
     @AUTOMATE @Automated @automation @Sprint32GoogleSearchFeature
 Scenario outline: GOOGLE STUFF
Given a Web Browser is on the Google page,
 when The seach phrase "<phrase>" Enter,

 Then  "<phrase>" shown.
and The relatedd   results include "<related>".
Examples: animals
 | phrase | related |
| panda | Panda Express        |
| elephant    | elephant Man  |

Don’t do this. It looks horrible. Please, take pride in your profession. While the automation code may look hairy in parts, Gherkin files should look elegant.

Gherkinize Those Behaviors!

With these best practices, you can write Gherkin feature files like a pro. Don’t be afraid to try: nobody does things perfectly the first time. As a beginner, I broke many of the guidelines I put in this post, but I learned as I went. Don’t give up if you get stuck. Always remember the Golden Gherkin Rule and the Cardinal Rule of BDD!

This is the last of three posts in the series focused exclusively on Gherkin. The next post will address how to adopt behavior-driven practices into the Agile software development process.

10 Things You Lose Without Automation

Automation has a lot of potential to improve software development. Unfortunately, though, automation is often seen as a luxury. Deadlines in the real word are unforgiving, and since test code isn’t product code, automation tasks are given lower priority and dunked into the black hole of the backlog. Some might argue that this is okay because it is lean or because a new project is just getting started. Once, I even heard it quipped that the first ones cut during a layoff are the automation folks. And it is true that automation requires a nontrivial resource investment.

However, I want to turn the tables. Instead of thinking about automation in terms of the opportunity, think about automation in terms of the opportunity cost. What happens if you don’t automate your tests from the get-go?  There are 10 major things you lose:

#1: Man Hours

Automated tests will automatically run.  Manual tests must be manually run.  That’s ontological.  If you only run a test one time, then automation has no return-on-investment.  But if you run a test more than once, automation saves a tester from repeating themselves. Plus, it’s easy: push the button and wait for results. Automated tests almost always run faster than manual tests, too.  Considering that time is money and engineer salaries aren’t cheap, man hours are a clear opportunity cost.

#2: Coverage

Automated tests can achieve greater coverage than manual tests, particularly for regression testing. As product development progresses, the sheer number of test cases increases. For example, in Agile, new tests will be created every sprint. Older tests must be run periodically to verify that new features don’t break existing features. If regression tests are manual, then testers must burn hours grinding through the same tests repeatedly.  Often, for expediency, this means that they skip some tests – not in the sense of being lazy, but rather as part of a risk-based approach.  Weaker coverage plus risk of missing bugs are accepted for the sake of shorter testing time.  If those regression tests were automated, then there would be no reason to shrink coverage, because they would be easy to run.

#3: Consistency

People make mistakes. It’s human nature – nobody’s perfect. And manual tests are prone to human error because humans run them. I remember how nervous I felt running manual on-call system checks at MaxPoint for the first time, afraid that I would miss a problem that could bring down a million-dollar bidding system.  Automated scripts run the same way every time.

#4: Protection

Continuous integration (CI) protects code against defects by building and testing every code change in real time. A CI system will automatically trigger tests all the time.Tests not running in CI (like manual tests) are effectively dead. At NetApp, failing code changes would immediately be kicked out of the code line, making automated tests act like a vaccine against bugs. On the other hand, I remember a project at MaxPoint that was riddled with bugs and perpetually delayed. When I asked the developers to see their unit tests, they said they never wrote unit tests because “it wasn’t a requirement.”

#5: Delivery Time

Continuous delivery (CD) is the natural extension of continuous integration, in which software products can automatically be delivered (and potentially even deployed) as the final step in a CI pipeline. This is how big companies like Google, Facebook, and Netflix can deliver so rapidly. No automation means no CD.

#6: Results and Metrics

Non-engineers (managers, product owners, scrum masters, oh my!) love to ask questions about tests.  “Are we red or green?” “How many tests do we have for this feature?” “What’s our coverage?” “How often do we run the tests?” Automated tests simply yield more accurate and more comprehensive results. Automation can also generate test reports, so engineers don’t need to waste time drafting emails or updating wiki pages.

#7: Accountability

Numbers don’t lie. Scripts don’t lie. Engineers typically don’t lie, but… results from manual tests can have a fudge factor, or a mistake in reporting, or any other sort of inconsistency. Inaccurate results may lead to poor business decisions. Automated results tell it like it is.

#8: Creativity

Manual testing can devolve into repetitive, menial labor: just follow steps 1-10 again and again and again. It would be much more effective for manual testers to focus on exploratory testing rather than deterministic testing. While automated tests can cover the fixed, repetitive test scenarios, exploratory testing lets testers find creative ways to uncover defects and judge how well a product actually works. Lack of automation ties up human capital.

#9: Peace of Mind

Are you sure that your product is “good”? Can you run enough tests to make sure? I learned the value of peace of mind while I was still in college. In my compiler theory course, I had to develop a simple programming language and build a compiler for it. Every week, we had to add new language features: arithmetic, strings, arrays, functions, etc. And every week, I wrote a slew of mini-programs to test grammar updates to my new language. By the time the project was complete, I had 1000+ automated test cases running through JUnit with 100% coverage, and the entire suite took a mere few minutes to run. And there were many late nights when the tests caught bugs in my language right away before committing code. There was no way I could have passed that class without my automated tests.

#10: Quality

The ultimate purpose of test automation is product quality. Having automation doesn’t necessarily mean product quality is good, but not having automation severely limits how quality can be pursued. Anecdotally, I’ve seen much better code quality come out of projects that have good test automation than ones without it. If I were a product owner, I know what I would want.