AI Outperforms Human Doctors in Emergency Room Diagnosis, New Study Reveals

By admin | May 03, 2026 | 2 min read

A recent study has explored how large language models perform in various medical scenarios, including real emergency room cases—where one model appeared to outperform human doctors in accuracy. Published this week in *Science*, the research was conducted by a team of physicians and computer scientists from Harvard Medical School and Beth Israel Deaconess Medical Center. The investigators ran multiple experiments to compare OpenAI’s models with human physicians. In one experiment, they analyzed 76 patients who visited the Beth Israel emergency room, contrasting diagnoses from two attending physicians with those generated by OpenAI’s o1 and 4o models. Two other attending physicians, unaware of which diagnoses came from humans or AI, evaluated the results. “At each diagnostic touchpoint, o1 either performed nominally better than or on par with the two attending physicians and 4o,” the study noted, adding that the differences “were especially pronounced at the first diagnostic touchpoint (initial ER triage), where there is the least information available about the patient and the most urgency to make the correct decision.”

In a press release from Harvard Medical School about the study, the researchers stressed that they did not “pre-process the data at all”—the AI models received the same information available in electronic medical records at the time of each diagnosis. Using that data, the o1 model provided “the exact or very close diagnosis” in 67% of triage cases, compared to one physician who achieved that accuracy 55% of the time and another who did so 50% of the time. “We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” said Arjun Manrai, who leads an AI lab at Harvard Medical School and is a lead author of the study, in the press release. However, the study did not claim that AI is ready to make real life-or-death decisions in the emergency room. Instead, it highlighted that the findings demonstrate an “urgent need for prospective trials to evaluate these technologies in real-world patient care settings.”

The researchers also pointed out that they only examined how the models performed with text-based information, noting that “existing studies suggest that current foundation models are more limited in reasoning over nontext inputs.” Adam Rodman, a Beth Israel doctor and another lead author of the study, told the Guardian that there is “no formal framework right now for accountability” regarding AI diagnoses, and that patients still “want humans to guide them through life or death decisions [and] to guide them through challenging treatment decisions.”