Half of AI Health Advice Is Wrong—And Seems Just Right

In brief

Nearly half of AI chatbot responses to health questions were rated "somewhat" or "highly" problematic in a BMJ Open audit of five major chatbots.
Grok produced significantly more "highly problematic" responses than statistically expected, while nutrition and athletic performance questions fared worst across all models.
No chatbot produced a fully accurate reference list.

Nearly half of the health and medical answers provided by today's most popular AI chatbots are wrong, misleading, or dangerously incomplete—and they're delivered with total confidence. That's the headline finding of a new peer-reviewed study published April 14 in BMJ Open.

Researchers from UCLA, the University of Alberta, and Wake Forest tested five chatbots—Gemini, DeepSeek, Meta AI, ChatGPT, and Grok—on 250 health questions covering cancer, vaccines, stem cells, nutrition, and athletic performance. The results: 49.6% of responses were problematic. Thirty percent were "somewhat problematic," and 19.6% were "highly problematic"—the kind of answer that could plausibly lead someone toward ineffective or dangerous treatment.

To stress-test the models, the team used an adversarial approach—deliberately phrasing questions to push chatbots toward bad advice. Questions included whether 5G causes cancer, which alternative therapies are better than chemotherapy, and how much raw milk to drink for health benefits.

"By default, chatbots do not access real-time data but instead generate outputs by inferring statistical patterns from their training data and predicting likely word sequences," the authors write. "They do not reason or weigh evidence, nor are they able to make ethical or value-based judgments."

That's the core problem. The chatbots aren't consulting a doctor—they're pattern-matching text. And pattern-matching on the internet, where misinformation spreads faster than corrections, produces exactly this kind of output.

The researchers continue: "This behavioural limitation means that chatbots can reproduce authoritative-sounding but potentially flawed responses." Out of 250 questions, only two prompted a refusal to answer—both from Meta AI, on anabolic steroids and alternative cancer treatments. Every other chatbot kept talking.

Performance varied by topic. Vaccines and cancer fared best—partly because high-quality research on those subjects is well-structured and widely reproduced online. Nutrition had the worst statistical performance of any category in the study, with athletic performance close behind. If you've been asking AI whether the carnivore diet is healthy, the answer you got was probably not grounded in scientific consensus.

Grok stood out for the wrong reasons. Elon Musk's chatbot was the worst performer of any model tested. Of its 50 responses, 29 (58%) were rated problematic overall—the highest share across all five chatbots. Fifteen of those (30%) were highly problematic, significantly more than expected under a random distribution. The researchers connect this directly to Grok's training data: X is a platform known for spreading health misinformation rapidly and widely.

Citations were a separate disaster. Across all models, the median completeness score for references was just 40%—and not one chatbot produced a fully accurate reference list. Models hallucinated authors, journals, and titles. DeepSeek even acknowledged it: The model told researchers its references were generated from training data patterns "and may not correspond to actual, verifiable sources."

The readability problem compounds everything else. All chatbot responses scored in the "Difficult" range on the Flesch Reading Ease scale—equivalent to college sophomore-to-senior level. That exceeds the American Medical Association's recommendation that patient education materials should not go beyond sixth-grade reading level.

In other words, these chatbots apply the same trick politicians and professional debaters tend to do: shoot you so many technical words in so little time that you end up thinking they know more than they do. The harder something is to understand, the easier it is to misinterpret.

The findings echo a February 2026 Oxford study covered by Decrypt that found AI medical advice no better than traditional self-diagnosis methods. They also track with broader concerns about AI chatbots delivering inconsistent guidance depending on how questions are framed.

"As the use of AI chatbots continues to expand, our data highlight a need for public education, professional training, and regulatory oversight to ensure that generative AI supports, rather than erodes, public health," the authors conclude.

The study only tested five free-tier chatbots, and the adversarial prompting method may overstate real-world failure rates. But the authors are direct: the problem isn't the fringe cases. It's that these models are deployed at scale, used by non-experts as search engines, and configured—by design—to almost never say "I don't know."

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

Half of AI Health Advice Is Wrong—And Seems Just Right

In brief

Daily Debrief Newsletter

Схожі новини

Quantum Computing Threat 'Mostly a Coordination Issue' for Bitcoin: Fireblocks CEO

It’s time for clarity for America’s digital asset markets

The 2036 Issue: Letter From The Editor

Metaplanet и Upexi отчитались об убытках из-за криптопереоценки