← back to blog

What the promise actually requires

Last summer, while working on the early stages of what would become Kaelo, I had a conversation with a nonprofit executive in Botswana that I’ve been thinking about ever since. She was genuinely excited about what machine learning and AI could do for health systems in the country. Not about any one tool or paper, but about the broader possibility: that these technologies might finally help close the gaps that decades of careful planning and hard work had only narrowed.

I remember agreeing with her, and I still do. The potential is real. But the conversation has stayed with me because of everything we didn’t say that afternoon, all the complications that sit underneath the optimism, and all the reasons that “ML can revolutionize global health” is both true and, on its own, dangerously incomplete.

The excitement isn’t misplaced. In the last decade, a handful of ML applications in global health have moved from proof-of-concept to real deployment. The clearest case is medical imaging. Deep learning systems for diabetic retinopathy, a leading cause of preventable blindness, now perform on par with retina specialists in community screening settings. In Thailand’s national screening program, one such system achieved 94.7% accuracy and 91.4% sensitivity for vision-threatening disease, matching the retina specialists who over-read its results. Google’s model alone has supported more than 600,000 screenings worldwide, with partnerships in India and Thailand aiming for six million more over the next decade. Similar work is underway for tuberculosis, cervical cancer, and pathology. The pattern is consistent: in narrow, well-defined visual tasks, these systems can substitute for a specialist who simply doesn’t exist in most of the world.

Beyond imaging, the picture is quieter but potentially larger. Forecasting drug demand, predicting outbreaks, matching patients to services, optimizing where to place a clinic, these don’t generate headlines the way a diagnostic AI does, but they shape whether stockouts happen and where. And on the newest frontier, large language models are being tested for clinical translation, note summarization, and triage support. The hype is loud and the evidence is thin, but some of these tools will matter. We just don’t yet know which.

What ties the examples together isn’t the technology but the constraint: in most of the world, expertise is the scarcest resource in health care. ML offers, in principle, a way to spread that expertise thinner without losing as much as you’d expect. That is the real source of the excitement. It is also where the complications begin.

None of this works without infrastructure, and this is where the conversation usually gets harder. A deep learning model for diabetic retinopathy is only as good as the fundus cameras deployed in clinics, the electricity to run them, the connectivity to upload images, and the registries that track which patients were referred and whether they actually came back. Most writing on AI in global health skips this part because it’s unglamorous, but the entire question of whether these tools help or just exist sits here. A model trained in California and deployed in a rural clinic without stable power isn’t a medical advance. It’s a PowerPoint slide.

Data is the other half of the infrastructure problem, and it’s the half that raises harder questions. Training useful models requires patient records at scale, which means health systems must decide how to collect, store, and share data that is by nature deeply sensitive. Differential privacy, a leading standard for sharing statistics without exposing individuals, carries a real cost: the stronger the privacy guarantee, the noisier and less useful the data. Aras Selvi, a postdoc at Princeton ORFE, gave a talk this semester on exactly this trade-off, showing that larger perturbations provide stronger privacy guarantees but result in less accurate statistics. In a wealthy country with redundant data sources, that trade-off is manageable. In a country where the national health registry may be the only dataset of its kind, calibrating it badly, either too private to be useful or too loose to be safe, can foreclose both the research and the trust that future research depends on.

All of which brings us to the hardest part. The same properties that make modern ML powerful, its ability to learn patterns from data that no human could fully specify, also make it opaque. A deep learning model that grades a fundus image at specialist-level accuracy cannot, in general, tell you why it flagged a particular image. It produces a number. We can build tools to approximate its reasoning after the fact, saliency maps, SHAP values, attention visualizations, but these are post-hoc explanations of an opaque model, not windows into its actual logic. Cynthia Rudin, in a widely-cited 2019 paper in Nature Machine Intelligence, argues that this distinction matters more than the field usually admits. Her thesis is blunt: trying to explain black-box models, rather than creating models that are interpretable in the first place, is likely to perpetuate bad practices and can potentially cause serious harm. The accuracy-interpretability trade-off, she argues, is often overstated. For many high-stakes decisions, an inherently interpretable model performs just as well as a black box, and if we insist on the black box anyway, we are choosing opacity we did not need.

In a global health context, this choice has compounding costs. Clinicians asked to act on a model’s recommendation need some basis for trusting or overriding it. A rural clinician who cannot interrogate why the system flagged a patient is in a worse position than one working with a transparent rule, not a better one. And when the model is wrong, and it will sometimes be wrong, the inability to trace the error forward or backward means that the failure cannot easily be diagnosed, corrected, or learned from. The model simply keeps running.

Which raises the accountability question. Consider a proprietary sepsis prediction model deployed at hundreds of US hospitals over the past several years. When researchers at the University of Michigan externally validated it in 2021, they found that it predicted the onset of sepsis with an area under the curve of 0.63, substantially worse than the performance its developer had reported. The model missed 67% of sepsis cases while generating alerts on 18% of all hospitalized patients, producing exactly the combination of false negatives and alert fatigue that degrades clinical decision-making. It had been deployed widely for years before this external validation was even possible, in part because the model was proprietary and its internals were not available for scrutiny. No one, not the vendor, not the hospitals, not the clinicians, had done the work to verify that it worked. The question of who was responsible for that silence has no clean answer.

Now imagine the same scenario in a setting with fewer resources for external validation, weaker regulatory infrastructure, and a smaller research community positioned to publish the corrective paper. The failure modes do not go away in global health. They get harder to see. This is why the interpretability question and the accountability question are really the same question, asked at different scales. A model that cannot be interrogated cannot be audited, and a model that cannot be audited cannot be trusted, no matter how confident the deployment slide deck is about what it can do.

I don’t want to end on a pessimistic note, because I don’t think the pessimistic note is the right one. The nonprofit executive I spoke with last summer was right to be excited. The tools are real, the need is real, and the potential to close gaps that have persisted for decades is not hype. But the same conversation I keep thinking about also reminds me that excitement alone is not a plan. If ML is going to revolutionize global health, and I think parts of it will, the bar for deployment has to rise to meet the stakes: local validation before rollout, interpretable methods where they work, external audit of the ones that don’t, and a clear account of who is responsible when things go wrong. These are not obstacles to the promise of the technology. They are what the promise actually requires.