In July 2021, the UK standards organisation BSI announced that it was developing guidelines for the application of ISO 14971 to artificial intelligence (AI) and machine learning (ML). This was followed by the recent publication in April 2022 of the consensus report AAMI CR34971:2022, Guidance on the Application of ISO 14971 to Artificial Intelligence and Machine Learning.
Designed to be used in conjunction with ISO 14971:2019, Medical devices—Application of risk management to medical devices, CR34971:2022 shares its structure with ISO/TR 24971:2020 Medical devices — Guidance on the application of ISO 14971. Standards are written to support manufacturers’ design products, some standards become harmonised standards and are given a role within the regulatory framework, Technical Reports (TR) and consensus reports (CR) describe the best practice, but a non-conformance would never be raised against them. Consensus usually indicates that there may be divergent opinions or approaches, but this is the core that the majority can agree on; however, in the new and evolving science, it is a valuable benchmark of good practice.
Risk management is the cornerstone of the medical device product development lifecycle, this consensus report (CR) aims to provide a framework for identifying and addressing the unique AI/ML-related hazards, hazardous situations, and potential harms, that can arise across all stages of the product lifecycle.
Now with the formalities aside, what does this new consensus report offer to those using AI and ML in the development of medical devices? Perhaps most useful are Annex B and its subsections which contains risk management examples (from hazards to risk control measures) on the identification of characteristics related to safety, which cover the following areas in more detail:
1. Data Management
2. Bias
3. Data storage/security/privacy
4. Overtrust
5. Adaptive systems
While some of these areas are specific to AI/ML, others are familiar bugbears for medical device developers; however, each brings novel complications when working with AI/ML, let’s look at these in more detail…
Data management is a broad field, but CR34971 calls out the need to consider specific issues such as data completeness, consistency, and correctness. What are the implications of the data quality and model complexity as they pertain to performance hazards, applicability, and generalization? This will depend on the specifics of your devices, but here you are provided with examples of each and prompted to consider these issues as part of your risk management activities. For example, using incorrect, incomplete, subjective, inconsistent, and/or atypical data can lead to deterioration of AI/ML model performance, and the related hazards associated with these data quality issues, and assumptions of the data properties must be included in the risk management process, including control measures used to mitigate their impact on performance and safety. This section also raises the issue of “bias/variance trade-off”, a fundamental issue when developing AI/ML models, and the need to consider controlling complexity in model development.
Consideration of bias, one of the most fundamental pillars of statistical rigour necessary for medical and scientific research and the medical device product development lifecycle, receives an appropriate spotlight in CR34971. While noting that bias can have both positive and negative performance effects, a handful of types of bias that can specifically impact AI/ML models are discussed in detail – selection bias, implicit bias, group attribution bias, and experimenter’s bias. This section highlights how missing data, sample bias (data not collected randomly), and coverage bias (data does not match the target population) can result in selection bias and potential risks to product safety and efficacy. As mitigation, it is recommended that verification is performed at the end of data collection, to ensure the data set is appropriately distributed. Of course, device manufacturers must consider, and evaluate, how bias can introduce hazards and hazardous situations beyond the development phases. For example, the design of the human user interface of a decision-making device that determines risk levels should be evaluated to ensure that the means of reporting calculated risk does not introduce bias and unduly influence the user.
Data storage/security/privacy is already at the forefront of attention for all organisations, due to the business and regulatory risks associated with ignoring or neglecting them. What is special for AI/ML medical devices? When it comes to cyber security, the CR calls out an example of the impact of an adversarial attack against a medical image classifier, with the potential of subtle image changes resulting in completely different classifications with high confidence. Whilst the misclassification of a cat as guacamole is highly amusing the implications of this are much greater than the harms associated with the misclassification of a cat as guacamole, but the tools to do so are the same, and freely available online.
From https://github.com/anishathalye/obfuscated-gradients
Overtrust occurs when people rely on technology beyond its capabilities and become too overdependent on it to the extent that it introduces risk to the patient. CR34971 presents different scenarios where this can become manifest, as well as the need for post-production monitoring to identify when this may be happening so that appropriate actions can be taken. An interesting addition to this section is the suggestion that disclosing the confidence in the AI/ML performance could be used as a risk control measure for overtrust, in that it would set and anchor the user’s expectation of the performance of the device. The effectiveness of this measure could be assessed and quantified as part of usability evaluations during development and monitored post-production.
Adaptive systems, which have the capacity to continue learning from new observations post-installation unlike traditional software systems which do not change over time, rightly receive specific attention, as they represent a unique capability of AI/ML systems. While not all AI/ML systems are designed to do this, CR34971 recommends self-validation, roll-back, and post-market validation processes, amongst other options, to protect against risks incurred by adaptive medical device systems with this capability.
A key part of the risk management process is the evaluation of overall residual risk. This can be a difficult process for any device, but the use of AI/ML adds further challenges. Fortunately, the CR provides some suggestions on how to perform this evaluation. Algorithmic outcomes, impacts of decision/prediction thresholds, and choice of fairness tests and performance metrics should be documented and justified with an attempt made to quantify the resulting residual risks of these choices.
Explainability of outcomes, and where possible, information regarding feature weightings should be documented appropriately for all stakeholders from users, auditors, and the general public.
While we have only touched on a few areas covered by CR34971, there is clearly much to digest when it comes to the risk management of AI/ML medical devices. This consensus report provides a lens for exposing, and tools for mitigating, AI/ML-related risks, and the framework, similar to that of ISO/TR 24971:2020, to formalise this within the risk management process. With the number of AI/ML-based medical devices gaining market access globally, and many more under development, this consensus report should be welcomed with open arms by the MD/IVD industry.
If you wish to track the development of this interesting subject further information can be found: