Skip to content

The testing approach

mjt320 edited this page Aug 26, 2022 · 6 revisions

Collecting and testing of contributed open-source code is a current focus of the Taskforce. At present, we are focusing on developing unit tests for each type of functionality, e.g. T1 measurement, pharmacokinetic models etc.

This page describes the general approach of Taskforce 2.3 to code testing. The procedure for developing and implementing tests is described in detail here.

What is a unit test?

A unit test executes a function with a specified set of input parameters and compares the actual vs. expected (reference) outputs. If these are equal, within a specified tolerance, the test passes. If not, it fails.

What is OSIPI testing?

The present focus of the taskforce is to develop unit tests to verify the scientific performance of code, i.e. does it return the "correct" answer? At present we are not concerned with testing non-scientific questions such as speed, checking for valid parameters etc.

Viewing automated testing reports

Github Actions has been utilized for automated testing of the community contributed source code. Tests are run every time the repository is modified. The status of testing is shown at the beginning of this document by the Github status image denoted by ci passing. Clicking on it will take you to Github Actions where a detailed report with past build information will be available.

What is the standard for pass vs. fail?

For most perfusion functionality, there is no clear "gold standard" that can be used to assess the scientific performance of code. For example:

  • DRO reference values depend on the specific code used to simulate the data
  • In-vivo results depend on the code used to fit the data
  • Reference values from the literature (or specification values for a phantom) depend on the acquisition and processing methods used to obtain these.

Our strategy is therefore to verify "reasonable" agreement between the output of contributed code based on reference data from trusted, open sources. In general, we expect code contributions to "pass" most of the tests, unless there is a significant bug or scientific error in the code.

We also report and compare the quantitative outputs from the tests using a graphical approach.

How much testing is done?

We suggest that each type of functionality should be tested using one or more sources of test data. For example, the present tests of T1 measurement use data from (i) in-vivo prostate scans, (ii) in-vivo brain scans and (iii) a QIBA digital reference object. Within each data set, we suggest that multiple different cases should be tested, e.g. multiple voxels/ROIs in different tissues. For unit testing, it is not necessary to process entire images since this will make the tests slow.
More tests can be added to cover additional scenarios if needed.

What data is used for testing?

Test data can be simulated or taken from real scans. Data used for testing (i.e. input parameters, signals and reference values) should ideally be citeable as a publication, DOI or similar. If the data itself is not citeable then a reference describing the protocol, patient cohort and method used to obtain reference values can be cited.

What are reference values?

The exact nature of tests and reference values will depend on the functionality being tested. We suggest the following guidelines:

  • For in-vivo data, the reference value (e.g. T1) should be obtained using software independent of that being tested.
  • For synthetic data (e.g. DROs), the reference value will most likely be the value used to simulate the data.
  • Ideally, the values, software and protocols used to obtain the reference values will be citeable as a publication, DOI or similar.

Tolerance

For various reasons (noise, different analysis methods etc.) we do not expect measured and reference values to be identical. Therefore, a tolerance should be specified for each group of tests. The test writer (and the Taskforce) will judge what is a "reasonable" tolerance depending on the functionality being tested and the data used, e.g. a function claiming to reproduce the Parker population-average AIF model should be highly accurate, whereas a larger tolerance will be required for measuring KTrans in a noisy dataset.

It may be necessary to specify an absolute as well as a relative tolerance, e.g. where some reference values are close to zero.

Expected fails

Specific test cases may fail without there being an issue with the source code. These cases will be documented within the test function and marked as expected failures.

How are tests implemented in the repository?

Unit tests are implemented using the Pytest package. The pytest parametrize decorator makes testing multiple cases straightforward. The procedure for developing and implementing tests is described here.