ChatGPT’s AI healthcare push has a fatal flaw
More than 230 million people ask the app for health-related advice every week
OpenAI and Anthropic have both announced big plans to enter healthcare, with a consumer-focused tool called ChatGPT Health and a version of the chatbot Claude that can help clinicians figure out a diagnosis and write medical notes.
Notably absent from this flurry of announcements is Google. Its Gemini chatbot is one of the most popular and capable, so why not jump into the lucrative health market too? Perhaps because Google knows from experience that such an effort can backfire spectacularly.
Health advice is where generative artificial intelligence has some of its most exciting potential. But the newer AI companies, perhaps blinded by bravado and hype, face a fate similar to Google’s if they are not more transparent about their technology’s notorious hallucinations.
OpenAI is slowly rolling out a new feature that lets users query about their health, with a separate memory and links to data from a person’s medical records or their wellness apps if they choose to plug them in.
The company says ChatGPT Health is more secure and “not intended for diagnosis,” but many people already use it to determine ailments.
More than 230 million people ask the app for health-related advice every week, the company says. It also announced ChatGPT for Healthcare, a version of the bot for clinicians that’s being trialed at several hospitals including Boston Children’s Hospital and Memorial Sloan Kettering Cancer Center.
Navigate Asia in
a new global order
Get the insights delivered to your inbox.
Anthropic, which has had greater success than OpenAI in selling to businesses, launched a chatbot aimed at doctors. It looks the same as the consumer version of Claude, but is trained on databases of medical data such as diagnostic codes and healthcare providers – to help it generate authorisation documents – and academic papers from PubMed to help it walk a doctor through a potential diagnosis.
The company has given a tantalising glimpse of how that training can make Claude more accurate. When the consumer version of Claude is asked about the ICD-10 codes doctors use to classify a diagnosis or procedure, the answer is correct 75 per cent of the time, Anthropic’s chief product officer, Mike Krieger said at a launch event earlier this month. But the doctors’ version of Claude, trained on those codes, is 99.8 per cent accurate.
What’s the accuracy rate when it comes to making a diagnosis, though? That particular number seems more important. When I asked Anthropic, the company couldn’t give a complete answer. It said its most powerful reasoning model, Claude Opus 4.5, achieved 92.3 per cent accuracy on MedCalc, which tests medical calculation accuracy, and 61.3 per cent on MedAgentBench, which measures whether an AI can do clinical tasks in a simulated electronic health-record system. But neither indicate how reliable the AI is with clinical recommendations. The first refers to a test for drug dosing and lab values; the 61.3 per cent stat is, let’s face it, a worryingly low score.
To its credit, Anthropic’s models are more honest – they are more likely to admit uncertainty than invent answers – than those made by OpenAI or Google, according data compiled by Scale, the AI company recently purchased by Meta Platforms Anthropic played up those numbers during its launch at the JPMorgan Chase Healthcare Conference in San Francisco, but such praise will ring hollow for doctors if they can’t quantify how accurate a diagnostic tool actually is.
When I asked OpenAI about ChatGPT’s reliability with health facts, a spokeswoman said its models had become more reliable and accurate in health scenarios compared with previous versions, but she also did not provide hard numbers showing hallucination rates when giving medical advice.
AI companies have long been silent about how often their chatbots make mistakes, in part because doing so would highlight how difficult a problem this has been to solve. Instead, they will provide benchmark data showing, for instance, how well their AI models do on a medical licensing exam. But being more transparent about reliability will be critical in building trust both with clinical professionals and the public.
Alphabet’s Google learnt this the hard way. Between 2008 and 2011, it tried to create a personal health record under the banner “Google Health,” which could aggregate a person’s medical data from different doctors and hospitals in one place.
The effort failed in part because Google faced an enormous technical challenge in collating health data from incompatible systems. The bigger problem: People were creeped out at the idea of uploading their health records to a company that regularly hoovered up personal information for ads.
Public mistrust was so strong that a valiant effort by Google’s DeepMind lab to alert hospital doctors to signs of acute kidney failure was shut down in 2018 after it emerged it had accessed more than a million UK patient records as part of the project. A year later, the Wall Street Journal unveiled another Google effort, known as Project Nightingale, to access the medical records of millions of US patients.
Both incidents were deemed scandals, and the lesson was clear: People perceived Google as untrustworthy. That makes the fate of AI companies in healthcare even more fraught. Google’s troubles came down to how it was perceived by the public, not because of any errors its systems had made in processing medical records. The cost will be higher if ChatGPT or Claude make a mistake when helping doctors make life or death decisions.
Perhaps it was naivety or blinkered thinking that led Dario Amodei, the chief executive of Anthropic, to address this exact point during his healthcare launch last week, even as his company provided no data to address it. The definition of “safety” was expanding as his company entered new markets like health, he said. “Healthcare is one place you don’t want the model making stuff up,” he added. “That’s bad.” BLOOMBERG
Decoding Asia newsletter: your guide to navigating Asia in a new global order. Sign up here to get Decoding Asia newsletter. Delivered to your inbox. Free.
Share with us your feedback on BT's products and services