Machine Learning in Medical Devices: Promise, Pitfalls, and the Road Ahead

Kazimierz Krol

Consultant

‘AI’ has become the defining technology buzzword of recent years. It dominates headlines, investment portfolios, and boardroom conversations. Yet what most people call ‘artificial intelligence’ is not the sentient, self-aware entity that science fiction has long promised us.

What we actually have is machine learning (ML): a family of mathematical techniques that allow software to improve at a task by learning from data rather than following hand-written rules. The theoretical foundations of ML have existed for decades; neural networks were first described in the 1940s, and backpropagation, the method by which a neural network learns from its mistakes, dates to the 1980s. It is only in recent years, however, that computing hardware has become powerful enough to train and run advanced ML models at useful speeds, unlocking the wave of practical applications we see today.

Pattern Recognition: Where ML Truly Excels

The real strength of machine learning lies in pattern recognition. ML models can process images, video, motion data, and text faster and, often, more accurately than humans. In healthcare, this capability is already making a tangible difference. The UK produces roughly 133,000 diagnostic scans every day, yet the Royal College of Radiologists reports a 29% shortfall in the clinical radiologists needed to analyse them, with the gap expected to widen to 39% by 2029.^1,2 Machine learning is increasingly being called upon to fill that void.

The examples from NHS deployments are striking. A British Heart Foundation-funded AI tool developed at UCL and Barts Heart Centre can analyse a heart MRI scan in just 20 seconds while the patient is still in the scanner, compared to the 13 minutes or more a doctor would need to interpret the images manually. It detects changes in heart structure and function with 40% greater precision than human analysis, and is estimated to save around 3,000 clinician days per year across the 120,000 heart MRI scans performed annually in the UK.³ The HeartFlow FFRCT Analysis, recommended by NICE under its MedTech Funding Mandate, uses deep learning to build a personalised 3D model of coronary arteries from a standard CT scan. This saves the NHS an estimated £391 per patient by reducing the need for invasive diagnostic procedures and is now available in over 65 NHS hospitals (as of 2021).⁴

In cancer diagnostics, the NHS has launched the EDITH trial. This is the world’s largest AI-led breast cancer screening study, involving nearly 700,000 women across 30 screening sites. The trial is testing whether AI can safely replace one of the two specialist readers currently required for every mammogram.⁵ Early pilots using Kheiron Medical’s Mia® system found it detected up to 13% more cancers than the standard double-reading process.⁶ Furthermore, Annalise.ai’s chest X-ray platform is being deployed across 64 NHS trusts through the government’s AI Diagnostic Fund. The platform can screen for up to 124 findings, including signs of lung cancer.⁷

The Shortcut Problem

Impressive as these results are, they come with a crucial caveat: the quality of any ML model’s output depends entirely on the quality of its training data and the rigour of the training process. When something goes wrong in that pipeline, models can learn the wrong thing entirely, a phenomenon researchers call shortcut learning.

A well-documented example involves chest X-ray analysis. Researchers found that AI models trained on open-source radiographic data for pneumothorax had learned to associate the presence of chest drainage tubes with the diagnosis, rather than the underlying lung condition. The model performed well in testing because patients with pneumothorax often do have chest tubes, but it systematically missed cases where the tube was absent. The AI had found a statistical shortcut rather than learning genuine pathology.⁸

AI generated graphic

The problem runs deeper than individual misdiagnoses. A 2022 study by MIT researchers revealed that radiology AI models can accurately predict a patient’s race from chest X-rays, something even the most skilled human radiologists cannot do. Follow-up research showed that the models most accurate at predicting demographics also exhibited the largest ‘fairness gaps’: significant discrepancies in diagnostic accuracy between men and women, and between white and black patients.⁹ In other words, the models were using demographic shortcuts to arrive at their diagnoses, producing less reliable results for historically underserved groups. This is not just an American problem; any model trained on imbalanced or biased datasets will carry those biases wherever it is deployed, including into NHS pathways.

The Human in the Loop

These failure modes point to a broader and more insidious risk: the tendency of humans to uncritically trust AI output. In 2023, a New York lawyer was sanctioned after submitting a legal brief drafted with ChatGPT; the brief cited six court cases that did not exist, complete with fabricated judicial quotations, because the lawyer had not verified the output. The consequences in that instance were professional embarrassment and a $5,000 fine.¹⁰ In a clinical setting, however, the equivalent error could be fatal.

An ML model that confidently misdiagnoses because it learned the wrong shortcut is only dangerous if a clinician accepts its output without scrutiny. Every deployment of ML in healthcare must therefore be designed on the assumption that the model will sometimes be wrong, and that the human in the loop is not a formality but the last line of defence.

This risk is sharpened by an awkward truth about how ML systems present their results. A language model is optimised to produce plausible-sounding prose, while a diagnostic classifier returns a confident probability score. In neither case, however, does the form of the output guarantee the soundness of the underlying judgement. A result that looks right is far more likely to be accepted without scrutiny, and that tendency has a name: ‘automation bias’. This well-documented human propensity to defer to an automated system, even when its recommendations conflict with one’s own judgement, is perhaps the single greatest threat to the safe use of AI in medicine.

Edge AI: Intelligence at the Point of Care

A particularly exciting branch of ML in healthcare is Edge AI, where models run directly on a local device rather than on remote servers. Unlike cloud-based systems, Edge AI processes data where it is captured, eliminating the need for an internet connection and drastically reducing latency.

The simplest everyday example is a fitness tracker recognising movement patterns to count steps. But the technology extends far beyond pedometers. Modern smartwatches can detect irregular heart rhythms, estimate blood oxygen levels, and monitor sleep architecture, all of which is processed on a chip smaller than a fingernail. Continuous glucose monitors equipped with Edge AI analyse blood sugar trends in real time and offer personalised management recommendations for diabetic patients. Smart clothing embedded with sensors can track heart rate, respiratory rate, and exercise posture, feeding data into on-device ML models that provide immediate feedback.

A compelling UK example of in-device ML comes from Cambridge. Neocam, a handheld screening tool invented by paediatric ophthalmologist Dr Louise Allen at Addenbrooke’s Hospital, uses digital imaging to detect congenital cataracts in newborns, the most common cause of preventable childhood blindness. In a collaboration with 42 Technology, software engineers are training an ML model on 46,000 images of babies’ eyes so that midwives can instantly tell whether the photograph they have taken is of sufficient quality for Neocam-assisted cataract diagnosis.

The device is currently being trialled across 30 UK maternity units in the NIHR-funded DIvO study, with over 140,000 babies to be screened. Early results have already identified rare visual conditions that conventional ophthalmoscope examination alone would have missed.¹¹

The efficiency gains are substantial. Because processed summaries, rather than raw sensor streams, are what leaves the device, edge-enabled wearables require dramatically less network bandwidth than cloud-dependent equivalents. That lighter communication load translates directly into longer battery life, with on-device inference avoiding the power cost of continuously streaming data to a remote server.

The trend is accelerating as chip manufacturers continue to release processors designed specifically for efficient ML inference. Some companies are pursuing hybrid architectures that combine lightweight Edge AI for immediate on-device analysis with cloud-based models for more sophisticated long-term insights. This approach captures the best of both worlds: responsiveness and privacy at the edge, depth and power in the cloud.

The Case Against the Cloud

The broader trend towards cloud-based health data processing carries undeniable benefits, including access to more powerful algorithms, centralised data aggregation, and the ability to update models remotely. But it also brings significant disadvantages that deserve scrutiny.

Latency is one. For time-critical applications such as real-time surgical assistance or cardiac arrhythmia detection, even a few hundred milliseconds of round-trip delay can matter. Security is another: every data transmission across a network expands the attack surface for sensitive health information. Systems implementing edge architectures are less prone to data breaches affecting protected health information than those relying on centralised processing.

Then there are the commercial pressures. Subscription-based pricing models, forced hardware obsolescence, opaque data-use policies, and the insertion of advertising into health platforms all erode user trust. It is no surprise that both consumers and healthcare providers are increasingly drawn to offline-capable solutions that offer greater control, lower recurring costs, and fewer privacy concerns.

AI generated graphic

The Regulatory Landscape: Playing Catch-Up

Regulators worldwide are racing to keep pace with the technology. In the UK, the MHRA has been reshaping its approach since launching its Software and AI as a Medical Device Change Programme in 2023.¹² In September 2025, the MHRA launched the National Commission into the Regulation of AI in Healthcare, chaired by Professor Alastair Denniston, which opened a public Call for Evidence in December 2025 and is expected to deliver its recommendations in 2026.¹³ The Commission brings together clinicians, technologists, patient groups, and representatives from all four UK nations to tackle fundamental questions about how AI medical devices should be governed.

A major practical development came in July 2025, when the MHRA announced new international reliance routes: if the US FDA, Health Canada, or Australia’s TGA has already cleared a medical device, manufacturers can use a streamlined pathway to bring it to the UK market without duplicating the full assessment. For AI developers previously facing waits of a year or more for a second regulatory audit, this could shave several months off market access. The MHRA is also consulting on removing the sunset clause on CE mark recognition in Great Britain, which would allow devices cleared under the EU’s regulatory framework, including those meeting the AI Act’s requirements, to remain eligible for the UK market indefinitely.¹⁴

AI generated graphic

In Europe, the EU AI Act, the world’s first horizontal AI regulation, adds a new layer of requirements on top of the existing Medical Device Regulation. Most AI-enabled medical devices will be classified as high-risk AI systems, subject to stringent obligations around data governance, robustness, transparency, human oversight, and bias detection. Full compliance for medical devices is expected by August 2027.¹⁵ Whether these overlapping frameworks will improve patient safety or simply create a regulatory maze that stifles innovation remains to be seen.

A Note on ‘Intelligence’

It is worth pausing on the word ‘intelligence’ itself. In my view, current AI systems are not intelligent in any meaningful sense. They cannot truly comprehend reality, reason about novel situations, or produce genuinely original ideas beyond the boundaries of their training data. A telling illustration is the persistent difficulty image generators have with conceptually simple prompts, such as generating an extra finger on a hand, or a facial feature blending into a person’s hair. The model has no understanding of anatomy, logic or spatial relationships; it merely interpolates between patterns it has seen before.

But intelligent or not, these systems are undeniably powerful tools. The medical devices built on ML are already saving lives, catching diseases earlier, and extending the reach of healthcare into underserved communities. The technology’s limitations, including shortcut learning, demographic bias, and regulatory complexity, are serious, but they are not reasons to dismiss it. They are reasons to engage with it thoughtfully, demand transparency, insist on rigorous validation, and keep a healthy scepticism about any claim that sounds too good to be true.

The future of ML in medical devices is not a question of whether, but of how. Done well, it will be one of the most consequential advances in modern healthcare. Done carelessly, it risks embedding bias, eroding trust, and causing real harm. The stakes are high, and the choices we make now, as engineers, regulators, clinicians, and patients, will shape the outcome.

Kazimierz Krol

Pattern Recognition: Where ML Truly Excels

The Shortcut Problem

The Human in the Loop

Edge AI: Intelligence at the Point of Care

The Case Against the Cloud

The Regulatory Landscape: Playing Catch-Up

A Note on ‘Intelligence’

We excel in deep innovation and technical breakthroughs, from early-stage exploration to end-to-end development and manufacturing.