Verifying LLMs in Medical Device Software: Challenges and Emerging Strategies

Introduction

Since ChatGPT’s debut in November 2022, large language models (LLMs) have demonstrated enormous transformative power in various industries, including healthcare and medical devices. Many device manufacturers are exploring ways to integrate LLMs into their products for clinical assistance, clinical decision support, patient support, and others. While LLMs, as a form of AI, bring many unprecedented new technical opportunities to improve clinical care, they also bring a number of previously unseen challenges for its quality assurance -- a foundational requirement for any medical device.

Medical device software (MDSW) incorporating LLMs differs from conventional MDSW in several fundamental ways.

1. The inputs and outputs of the MDSW are no longer parameter-bound. For example, instead of an input image with a defined number of grey values in the form of pixels and an output that is well-defined (e.g. binary outcome of malignant or benign), you have inputs and outputs in the form of natural language, which is open-ended. A clear mapping of parameterized input and output space is not possible for LLM inference, rendering the traditional software testing based on well-defined input-output infeasible.

2. Stochasticity inherent in LLMs’ algorithmic design and supporting hardware makes LLM output non-deterministic. Contrary to conventional MDSW, the same input is not guaranteed to produce the same output, although the variation is expected to follow a distribution pattern that is locally stable around a center (the mode, most likely) but with outliers that are hard to predict.

3. LLMs are trained as general purpose foundation models using a vast amount of digital data. The quality of such training data is poorly controlled for regulated medical use. The inclusion of LLMs as SOUP (Software of Unknown Provenance) in MDSW brings in a variety of risks related to verification such as factual inaccuracy, bias, and security backdoors, among others, that can compromise the safety of the MDSW. The risk control verifications of such SOUP components are challenging due to the vast size of the model’s knowledge space and the opacity of its training.

In the following sections of this article, we will discuss each of the challenges identified above in detail and provide readers with suggestions for developing verification strategies that are in compliance with IEC 62304 and defensible based on state-of-the-art (SOTA) in front of Notified Bodies for MDR conformity assessment.

Challenge 1: Unbounded I/O

Unlike conventional MDSW, where the input and output spaces are formally specifiable (e.g. a chest radiograph is a matrix of defined dimensions and bit depth; a classification output is a label drawn from a closed set), LLM-based MDSW operates on natural language, which is semantically open-ended in both input and output. This fundamental property has far-reaching consequences for verification.

Classical software testing relies on the ability to partition the input space into equivalence classes, define boundary conditions, and establish pass/fail criteria against a deterministic output. This framework presupposes a structured, enumerable input-output relationship. In LLM inference, no such structure exists. Small perturbations in prompt language can yield substantially divergent outputs, making stable equivalence-class construction impractical.

On the output side, failure is no longer a discrete, detectable event: it may manifest as omission of clinically relevant information, unsafe framing, overconfident reasoning, or hallucinated content. This breaks the IEC 62304 requirement testability chain as a requirement for open-ended natural language output cannot be straightforwardly mapped to a binary test verdict.

Several mitigation strategies have emerged from the literature. On the input side, a tightly scoped intended purpose statement, i.e., specifying the clinical context, user population, and permissible query types, serves as the foundation for all downstream verification design. Programmatic input guardrails can complement this by screening user inputs prior to LLM inference: classifying queries by topic relevance, rejecting out-of-scope inputs, and flagging anomalous prompt structures before they reach the model. This effectively narrows the active input space to a defined operational domain, making coverage-based testing more tractable. On the output side, architectural constraints, such as grammar-constrained decoding, schema-enforced structured outputs, or rules-based post-processing guardrails, can re-parameterize the output space into a structured, testable form before it reaches the clinical user, effectively reinstating a bounded verification target by design.

Where free-text output is unavoidable, decompose-then-verify pipelines break the generated response into discrete evaluable claims, each assessed against a defined rubric for factual accuracy, clinical appropriateness, and safety. Natural language processing (NLP) evaluation metrics including BLEU, ROUGE, and BERTScore can provide automated, scalable assessment of free-text outputs against curated ground-truth references, though the correlation of these automated metrics with clinical utility must itself be empirically validated for the specific intended use. Complementarily, perturbation testing probes input-output consistency without requiring full space enumeration, while semantic entropy analysis can flag high-uncertainty responses statistically more susceptible to error. The most defensible posture under IEC 62304 and ISO 14971 would be a combination of these strategies, with an explicit acknowledgment in the risk management file that exhaustive input-output coverage is not achievable, and a benefit-risk analysis justifying the accepted residual risk.

Challenge 2: Non-deterministic output

Non-determinism in LLMs arises from two distinct and partially independent sources, both with direct implications for verification under IEC 62304. The first is algorithmic: stochastic sampling strategies, as reflected in hyperparameters such as temperature, top-p, top-k, introduce probabilistic token selection by design. Setting temperature to zero, i.e. greedy decoding, is commonly assumed to eliminate this variability. However, a growing body of empirical evidence demonstrates that this assumption is unreliable in practice.

Khatchadourian and Franco (2025, arXiv:2511.07585) quantified output drift across five model architectures on regulated financial tasks, revealing a stark inverse relationship between model scale and output consistency: models of 7–8 billion parameters achieved 100% output consistency at temperature zero, while a 120-billion parameter model produced consistent outputs in only 12.5% of runs (95% CI: 3.5–36.0%), regardless of configuration (p < 0.0001). Atil et al. (2024, arXiv:2408.04667) similarly documented accuracy variations of up to 15% across equivalent runs at zero temperature, with best-to-worst performance gaps reaching 70% across tasks — findings that collectively cast doubt on the assumption that frontier models are preferable for regulated deployment.

The second source of non-determinism is computational. The non-associativity of floating-point arithmetic in parallel GPU execution means that the order of arithmetic operations, which is determined by batch composition and hardware scheduling, affects intermediate activation values and, consequently, output token probabilities. This effect is particularly pronounced in reasoning-focused models, where early rounding differences cascade through extended chains of thought. Recent work has identified batch-size variation as the primary driver of this phenomenon and has proposed batch-invariant kernels as a hardware-level mitigation (Yuan et al., 2025, arXiv:2506.09501).

For IEC 62304-compliant verification, non-determinism cannot be assumed away by fixing temperature or random seed; it must be empirically characterized for the specific model, deployment infrastructure, and hardware configuration. Verification protocols should incorporate repeated-run consistency testing with statistically grounded sample sizes across representative inputs. Application-level controls can meaningfully reduce drift and should themselves be verified as effective mitigation measures. Schema-constrained outputs restrict the LLM's response to a predefined structure. For example, a JSON object with enumerated field names and value types. Because a constrained output space reduces the number of tokens over which stochastic variation can propagate, this architectural measure substantially improves run-to-run consistency while simultaneously re-bounding the output space for conventional verification. In Retrieval-Augmented Generation (RAG) architectures, where the LLM's response is conditioned on information retrieved from a controlled knowledge base, the ordering of retrieved passages passed to the model affects output generation: identical queries can retrieve the same documents but present them in different orders across runs, introducing an additional source of variability. Deterministic retrieval ordering which can be achieved by imposing a fixed, content-independent sort key on retrieved results before they are assembled into the prompt, eliminates this source of drift, reducing the RAG pipeline's contribution to non-determinism.

Challenge 3: LLMs as SOUP

IEC 62304 contains a structured framework for managing Software of Unknown Provenance (SOUP): manufacturers must identify all SOUP items (§8.1.2), document the functional and performance requirements placed upon them (§5.3.3), evaluate published anomaly lists (§7.1.3) and establish risk controls for anomalies with safety implications (§7.2), and treat changes to SOUP as change events requiring documented risk analysis (§7.4). Underlying this framework is a foundational assumption: that the manufacturer can meaningfully characterize what the SOUP component does, what its known failure modes are, and what functional boundaries contain it. Foundation models challenge each of these assumptions, and in particular give rise to new categories of risks including factual inaccuracy, bias, and security backdoors for which the existing SOUP verification toolkit is inadequately prepared for.

LLMs generate outputs by predicting statistically likely token sequences, not by retrieving verified facts from a curated knowledge store. This architecture makes hallucination an intrinsic failure mode, not a detectable exception. Unlike a conventional SOUP component with a published anomaly list, the distribution of factual errors in a foundation model is distributed across an effectively unbounded knowledge space and cannot be enumerated by the model provider. Verifying the absence of clinically harmful factual errors is therefore not reducible to conventional black-box testing as the test space is too large, and failure is not signaled by a well-defined error. Systematic evaluation requires curated clinical test sets with verified ground truth, multi-run sampling to characterize the distribution of erroneous outputs, and factual consistency scoring.

In addition, foundation models trained on large-scale digital corpora encode the statistical regularities of those corpora, including demographic, cultural, and socioeconomic biases present in health-related content. Bias in an LLM-based MDSW may manifest as differential performance across patient subgroups in ways that are difficult to detect without targeted, stratified evaluation. Unlike conventional algorithmic bias in narrow ML classifiers, where the feature space and decision boundary can be inspected, LLM bias is distributed across billions of parameters and is not attributable to any identifiable architectural component. Verification requires prospective bias evaluation using demographically stratified test datasets, ideally co-designed with clinical experts, with pre-specified acceptance thresholds for inter-group performance disparity.

On the security front, training data poisoning represents a unique supply chain threat. An adversary with the ability to inject content into a foundation model's corpus of training data can embed backdoor behaviors. For example, outputs that are normal under typical inputs but shift in a targeted, harmful direction when a specific trigger pattern is present in the prompt. Because the trigger is unknown to the manufacturer and the poisoned behavior is latent under normal testing, standard functional verification will not detect it. The verification challenge is compounded by the opacity of training data provenance. Therefore, a multifaceted risk mitigation strategy including contractual control on training data transparency with the model supplier, adversarial testing, and output anomaly monitoring in production is warranted to address this challenge. As with the two preceding challenges, risk control measures cannot fully eliminate the safety risks introduced by LLM SOUP components, a fact that must be explicitly addressed in the risk management file, with residual risks evaluated against the device's clinical benefits and documented in accordance with ISO 14971.

It is worth noting that the classification of a foundation LLM as SOUP under IEC 62304 is a practically defensible and increasingly adopted position. However, it has not yet formally settled in state-of-the-art standards or MDCG guidance. The SOUP framework in IEC 62304 is designed for discrete off-the-shelf software components with defined interfaces. Its application to multi-billion parameter generative models with emergent behaviors and opaque training pipelines requires careful adaptation and justification.

Conclusion

In this article, we examined three fundamental verification challenges posed by LLMs in MDSW: the unbounded nature of natural language input and output spaces, the intrinsic and computational sources of output non-determinism, and the novel safety risks introduced by incorporating foundation models as SOUP under IEC 62304. Each challenge undermines assumptions that have underpinned software verification practice for decades, and none yet has a fully standardized regulatory solution. What the current SOTA does offer is a set of emerging strategies including input guardrails, output constraints, repeated-run consistency testing, stratified bias evaluation, and adversarial testing, that when combined and properly documented, can form a verification approach that is both technically defensible and coherent under the current regulatory framework.

Need help for verification strategy development or other aspects of conformity assessment for LLM-based MDSW? Reach out to us today! Qserve has extensive hands-on experience supporting medical device manufacturers through the regulatory complexities of AI/ML-enabled MDSW. From verification strategy design and test protocol development to risk management file preparation and clinical evaluation support, Qserve provides targeted, technically grounded advisory services tailored to each customer’s unique needs. We look forward to hearing from you!

Blog AI cybersecurity

Services Areas

Solutions

Expertise

Training

Resource Center

About us

Careers

Verifying LLMs in Medical Device Software: Challenges and Emerging Strategies

Introduction

Challenge 1: Unbounded I/O

Challenge 2: Non-deterministic output

Challenge 3: LLMs as SOUP

Conclusion

Verifying LLMs in Medical Device Software: Challenges and Emerging Strategies

Introduction

Challenge 1: Unbounded I/O

Challenge 2: Non-deterministic output

Challenge 3: LLMs as SOUP

Conclusion

Stay informed with the latest insights —subscribe to the Qserve newsletter today!