This article is part of a series of posts about medical devices software testing:
- Automated tests that run in the build system.
- Automated tests supervised by a tester.
- Manual tests.
In my experience, writing and make pass the unit tests that act as guardians of a piece of code will take about twice the time required to write that piece of code alone. But that doubled time will include most of the debugging. As a developer, would you rather like to write unit tests or to debug? Would you rather fix things when they are fresh in your memory or two months later? Do you like to have your work really finished and move on to the next challenge, or be constantly caught up by your past and shameful mistakes? When somebody else breaks something in that functional area you’ve been working on some time ago, would you rather spend one day investigating their blunder to determine if it’s their responsibility to fix it, or instantly see in the CI what commit likely broke your unit test? Would you rather refactor constantly stuff you don’t like in your code or be paralyzed by the fear or breaking things in lesser-known areas of the code without any safety net? When you have to modify some code written by someone gone long ago, do you want to find it equipped with unit tests to help you get acquainted with the beast? I believe that automated tests are profitable, but on the long run. There is a quite high entry cost. Introducing automated tests late in a project is not quite so profitable. It must by a constant effort of everyone in the team with the goals of catching bugs, of inciting to decoupled design, of allowing design evolution by providing a safe environment for refactoring. But isn’t that exactly what we’re pursuing when developing medical devices? Good design, few bugs?
Anyway, the most economically interesting part with automated tests is that running them is cheap. Sure, some time will be required to take care of the production of all the servers and build agents, and to maintain the build scripts and build tools configuration in good shape. But this cannot be compared with the incredible cost of manual testing campains. Automated testing means putting effort in designing stuff, not repeating an effort which outcome is of very temporary value (until the next commit, or luckily the next version); it means climbing higher in the value chain. The very fact that it is so cheap allows the sheer number of tests run for every release to grow dramatically (compared to manual testing), which actually decreases the number of defects found in an end product – if you take advantage of the discipline.
Running tests almost for free is what makes automated testing so paramount to agile development. You can’t deliver often if you can’t test cheap and fast. And you can’t refactor if you have no safety net. A development that’s not iterative and evolutionary can’t be agile at all.
Going back to the schema of the introduction, here are the tests we will focus on in this section (green boxes).
Automated tests have the freedom to exercise small software components. To do so, they replace the environment of the System Under Test (SUT) with mocks and stubs, at any level. They are meant to run in virtual machines (or, someday, containers), which means they don’t run with the target embedded computer and they don’t run with the production OS (as they necessarily imply installing at least one testing dependency to run the tests and gather the results, and no dependency should be shipped in a production OS that is not required to operate the medical device, as this might increase risks regarding safety and cybersecurity). This freedom to mock collaborators of the SUT means it is possible to thoroughly test edge cases that are difficult or impossible to test by other means. This is especially important for programs that are meant to operate sensors and actuators and give life to an automaton, something that inherently has state: thanks to automated testing, you can easily test many configuration of the medical device, and simulate every possible error code of every possible sensor and actuator (at least at the level of the layers closest to the hardware, before combinatorics become overwhelming). You can also test what’s happing with a large variety of biological measurements or behaviors – even simulating phenomena that seldom happen in reality). Automated tests rock to achieve reliability.
Here are some guidelines on the use of the 3 main categories of tests that can easily be executed in the build:
- Unit tests. Unit tests are written to exercise a class or a couple of classes. The idea here is to test your basic algorithms decoupled from their environment. They prove worth themselves when the algorithms are not trivial – unit-testing boilerplate code is not profitable, I’d rather leave that to module and integration testing. The point of unit tests is, since you work at so small a scale, you can test all execution path and conditions, even those that are not meant to happen in production – but that eventually will. For example, suppose your write an algorithm for counting cells in an image. You should test that algorithms with a nice quantity of images taken from production condition (say one hundred) with a carefully chosen variety of concentrations for the different cell types. And then error cases: inappropriate formats, corrupted images, pictures of the sea… Once these tests are written, you must be confident enough to say: results will be good and errors will be gracefully detected.
- Module testing. I’ve found many definitions of a module on the web, but here’s the one I use in this blog: a module is a complete binary entity that can be started and shut down– typically an executable. Module testing hence means testing the module as it will be in production (compiled in Release mode, all relevant compiler optimizations on, signed…). You mock the collaborators of the module (typically making them have a known behavior) and test the latter through its production API. At that scale, you won’t test all execution paths, but ensure that most features of the module that will be delivered to production work.
- Integration testing. In these tests, you exercise several modules but you still mock some of their collaborators (e.g. network peers, fake machine hardware). Here you can test for interesting things such as complete app startup and shutdown, global error handling (are errors detected, and a nice error report generated, hopefully sent to the dev team?), and maybe run them on the target production OS (along with all hardening and branding features). For example, one of my teams created integration tests where the C++ and Straton automaton code (complete except for the stubbed actuator and sensor level) was running inside a virtual machine with Linux-preemptRT OS and interacted with the business code running in a Windows 7 of the build agent (which was quite different from the hardened production OS). These tests, by running complete biological tests execution, complete startup and shutdown sequences, as well as key maintenance behaviors, guaranteed many things about the execution of the automation code in its target OS and its interaction with the upper layer.
Probable source: http://www.quickmeme.com/meme/3rvaum
Working in a regulated industry, to fully take advantage of automated testing, you have to legalize it. To be suitable for use in the traceability system, all kinds of automated tests must be legitimated. This means that management plans and procedures need to introduce them properly, and that every tool in the chain must be documented, put in configuration control and formally validated. Tests have to reference requirement numbers in such a way that they appear in the tests results, which will feed the traceability matrix (for example, write the ids of the requirements covered by each test in NUnit’s Description attribute).
Give time for automated tests. Automated tests take time. If management doesn’t provide this time, they won’t happen by magic. A technique I find useful is to constantly repeat in planning poker sessions that automated tests are a mandatory part of the job, and that they should be part of the estimate. Difficult tests (such as automated load and stamina tests) must have a user story of their own, and with high estimates (these things are always a lot harder than they appear). As with all things related to long-running quality efforts, a constant care must be taken that will affect the sort-term velocity of the team. But everyone (up and down) has to get used to that velocity if it is required to write good, reliable, safe medical device software. One more reason to start it from the beginning: automated testing must be part of the natural rhythm of the team.
Time is the soil in which automated tests can grow. The ideal of quality is the seed. But what are the fertilizers you can use to make them flourish? Apart from technical factors that we’ll delve in soon, what is required is an automated testing culture.
Writing good automated tests must be pervasive in the development culture. Every developer should expect to create them and fix them when they’re broken. In fact, should developers be deprived from this constraint, they should find it unacceptable and fight hard to have automated tests introduced. In addition to this distributed responsibility, a dedicated team is useful to take care of the more difficult work regarding test: maintaining and polishing base test classes and shared stubs, updating libraries and introducing new ones when appropriate, spotting and annihilating sources of randomness in tests results, improving test performance, deleting obsolete or not-that-profitable old tests. Having tests that pass fast, deterministically and successfully (tests that are never left broken for long) makes a long way towards creating and expanding that automated testing culture.
Good KPI visible everywhere (including sprint reviews) can help. I’ve used the raw number of tests (motivating since always increasing, but not very meaningful), the number of tests created by iteration (a good indicator of the persistence of the practice in your team), the test coverage, the status of a few key and difficult tests.
Management should constantly show its care. By ranting about broken or random tests. By praising accomplishments (such as lowering the number of broken tests, fixing randomness, achieving victory with difficult load or stamina tests). By watching the state of the build. By investing the energy in building the culture and overcoming the necessary obstacles. By allocating time and resources to the endeavor.
Here are my favorite psychological tools to deal with broken tests.
- Box. I built a box with funny pictures to be put on the desk of every developed convinced to have broken the build so that they feel peer pressure to fix things fast. But be careful with this one, it has to be fun and light – otherwise it might create a weird atmosphere in the open space or even turn to moral harassment.
- Build guardian. When there are many tests, many developers, no gated commit and when feedback time is long, someone is required to watch for the build: the build guardian. His or her job is to analyze failed tests and assign investigations (either by determining the likely culprit by looking at the commits since the last time the test passed, or by assigning them randomly). This person must have enough authority that his or her decisions are respected.
- Instant feedback. Having a nice continuous integration tool with appropriate alerts (by mail, system tray, Slack…) when things need attention will help maintain quality in tests. I’ve appreciated using TeamCity the last few years. We also put a LED indicator in the open space to shout for broken tests (and stand-up meeting time J). If you happen to have a certain percentage of broken tests on a regular basis, the XFD can also be used to display the number of tests passing and failing. Or display silly jokes.
A necessary condition for automated tests to exist is to have an architecture that makes writing them easy enough – let’s call it test-friendly architecture. This means a highly decoupled architecture: massive use of dependency injection is a key technique, mocks and stubs are required, onion architecture can also help. The interesting thing here is the reverse effect: once you seriously start your automated testing journey (hopefully at the beginning of the project), you find yourself adding interfaces to write test doubles (to mock the hardware, a remote server, the database…), injecting dependencies, using in isolation just a few classes (to keep the test lights and fast)… and the architecture gets better. Testing the software is much more demanding to the design that all these extension points you consider for the future. As Bob Martin coined it, “The act of writing a unit test is more of an act of design than of verification”.
You need a good continuous integration tool to run the builds. You need enough build agents to handle the load, ideally a cloud with dynamic provisioning. Build agents should be identical, hopefully frequently restored to a reference state to avoid drifting apart. But sometimes differences in configuration between build agents will be necessary to minimize license costs (for example, install expensive tool X used only for C++ static checking on agent A).
The configuration of the build system (build scripts, build config…) should be kept in source control. Have a policy on this repository so that it is easy to restore the version of the build system to a former release branch.
If you want developers to pay attention to broken tests and fix them, they should never break for no reason. A couple of hints to achieve better test predictability :
- No dependency with time. Since the performance of virtual machine build agents and developers’ machines are likely different, timing will vary in both environments. It’s very frustrating to fix a bug that breaks only in the CI but works on all dev machines. The cure here is to use events to synchronize the parts of your code that need synchronization, but never time (no sleeps). If you really need time, encapsulate it (through some kind of ITimeProvider) so that you can fake it deterministically in your tests (to return always the same known value, for example).
- Test isolation. Tests MUST be independent from one another. This probably means that the database will have to be re-generated or restored for every module of integration tests. This will take time. But tests failures provoked by other tests are a worse evil.
- Test predictability. Sometimes an unpredictable behavior enters the code – several consecutive executions of the tests for the same revision of the code don’t yield the same results: test X once passes, once fails. Those unbearable sources of randomness must be mercilessly hunted and annihilated: they will very fast kill the faith of developers in tests – hence the intensity of their effort in maintaining them.
- Sometimes it’s not easy to spot these sources of randomness, especially if they don’t happen on developer machines. But there’s one thing you can do: put these tests in quarantine – a special build for ill-mannered tests marked by an infamous attribute, a kind of jail where they can be reeducated before they are reintegrated in proper society again. This way, they won’t harm the production of well-behaving tests. And fixing them will be easier if they can be run in isolation, repeatedly, in the build agent environment.
- Crucial tests. You can identity a couple of crucial tests without which you know your app is broken (such as starting and executing the most typical feature of the app). These tests should be executed with the build and break it when they fail (assuming you have so many automated tests otherwise that they have been distributed over several build projects).
The best of all tools to protect tests is the gated commit. Developers submit their branches to the CI, which passes the required build scripts (including static checks and tests). If the build is good, the branch is merged. If not, the developer is notified but nobody else is disturbed (which in itself tames the natural mayhem of large teams with many commits and helps build a more relaxed and focused environment). The amount of tests to put in your gated commit criteria should be as high as your pool of build agents allows you to process in a short enough time – I think 12 hours would be my maximum here, ten minutes would be ideal but difficult to achieve with many tests. If you can put them all and your tests are truly deterministic, the build of the main branch will always be green! But be careful: sources of randomness in tests might block your entire delivery pipeline, make sure they are taken care of.
To set up a gated commit, you need a DVCS enough at merging that most automatic merges succeed most of the time (git is pretty good at this) and support from your CI (we used TeamCity and GitHub together to achieve gated commit with limited effort).