Medical-AI: How should we evaluate performance in medical AI models?

We developed a specialized medical AI model purpose-built for women’s health. But how should medical models be evaluated? Here’s what we did: Real questions, from real patients, evaluated by real doctors.

3 min readOct 5, 2023

Benchmarks to real-world relevance

I’ve been looking at data for medical benchmarks for AI models in the field of medicine. Examples include Med-PaLM[1][2], GPT3.5, GPT4 and HippocraticAI. These models all assessed their performance with simulated medical board questions.

Here is a sample of the published performance metrics on USMLE exams:

MedPaLM: 67.2% accuracy (Singhal, Dec 2022)
MedPaLM2: 86.5% (Singhal, May 2023)
GPT-3.5: ~60+% (Kung, Dec 2022)
GPT-4: ~86+% (Nori, Apr 2023)
HippocraticAI: 89.9 (Published on marketing site, May 2023)

While these performances make for fun headlines like GPT passes the Step 1 medical exam, the benchmarks remain academic demonstrations of the technology. The industry appears to be moving towards demonstrating ever increasing ability to ace multiple medical exams — but what does that really mean for clinicians? From a clinicians’ perspective, these technology evaluations don’t yet offer any practical clinical take-aways that substantially advances medicine or medical training.

The AI model for women’s health

Our team has spent the last several months developing several AI models capable of chat-based question answering in the women’s health domain. As the models improved, it was time to clinically evaluate our menopause AI model first (others to come later).

Real Patients, Real Questions

We could have used the NAMS Menopause certification question bank to evaluate the AI model. But, as doctors, we do have the privilege of engaging patients daily in multiple settings. We took advantage of this. We asked followers of a menopause centric social media channel to anonymously share their real questions and health challenges.

Then, we selected 100 questions from the submission pool, asked the chat model the questions, and captured the output as model generated responses.

What we see in these questions is that patients don’t ask questions like board exams. These questions are raw, unstructured and diverse. And they certainly don’t give you a multiple choice set of answers to select from.

Evaluation by Real MD’s

We created a document of 100 question/answer pairs similar to this sample:

Real questions, AI generated responses. These question/answer pairs were evaluated by MD menopause experts for accuracy and quality.

Then, we recruited MD’s who practice menopause as their core speciality. These MD experts evaluated the AI generated responses for accuracy and quality on a likert scale.

The results

The menopause model described here achieved a accuracy and quality rating of greater than 80%. We are proud of this. More significantly, we introduced a practical and clinically relevant method for evaluating AI models. By using a genuine real-world question dataset, assessed by experienced doctors, we established a realistic benchmark of clinical accuracy. This approach allows a doctor to truly grasp the capabilities of a new AI model and instills trust, especially when assessed by experts in the field.

Summarized Key points:
✨ Real-World Relevance: Existing benchmarks from GPT-4, Med-PaLM2, Hippocratic AI rely on simulated board exam questions. Instead we used real questions, by real women.
✨ Highly accurate model: the model performance matches or outperforms industry benchmarks from GPT-4, Med-PaLM2, HippocraticAI.
✨ Validation, not Competition: We think of this not necessarily as “better than” the work from OpenAI and Google, rather this is another validation of the technology.

We presented the study described above as an oral abstract at the 2023 NAMS conference.