Please Hang Up and Dial Again: Handling Test Interruptions in CI/CD

This post was originally published by Sealights on December 19, 2017 as part of their article Test Quality in CI/CD – Expert Roundup. I was honored to contribute my thoughts on automatic recovery in test automation, and I reblogged the text of my contribution here for Automation Panda readers. Please check out contributions from other experts in the full article!

Test automation is an essential part of CI/CD, but it must be extremely robust.
Unfortunately, tests running in live environments (integration and end-to-end)
often suffer rare but pesky “interruptions” that, if unhandled, will cause tests to fail.
These interruptions could be network blips, web pages not fully loaded, or
temporarily downed services – any environment issues unrelated to product bugs.
Interruptive failures are problematic because they (a) are intermittent and thus
difficult to pinpoint, (b) waste engineering time, (c) potentially hide real failures,
and (d) cast doubt over process/product quality. Furthermore, CI/CD magnifies
even rare issues. If an interruption has only a 1% chance of happening during a test,
then considering binomial probabilities, there is a 63% chance it will happen after
100 tests, and a 99% chance it will happen after 500 tests. Keep in mind that it is not
uncommon for thousands of tests to run daily in CI – Google Guava had over 286K
tests back in July 2012!

It is impossible to completely avoid interruptions – they will happen. Therefore, it is
imperative to handle interruptions at multiple layers:

  1. First, secure the platform upon which the tests run. Make sure system
    performance is healthy and that network connections are stable.
  2. Second, add failover logic to the automated tests. Any time an interruption
    happens, catch it as close to its source as possible, pause briefly, and retry the
    operation(s). Do not catch any type of error: pinpoint specific interruption
    signatures to avoid false positives. Build failover logic into the framework
    rather than implementing it for one-off cases. For example, wrappers around web element or service calls could automatically perform retries. Aspect-
    oriented programming can help here tremendously. Repeating failed tests in their entirety also works and may be easier to implement but takes much
    more time to run.
  3. Third, log any interruptions and recovery attempts as warnings. Do not
    neglect to report them because they could indicate legitimate problems,
    especially if patterns appear.

It may be difficult to differentiate interruptions from legitimate bugs. Or, certain
retry attempts might take too long to be practical. When in doubt, just fail the test –
that’s the safer approach.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s