1. Home
  2. /
  3. News
  4. /
  5. Beware the Digital Doctor: The Dangers of Self-Diagnosing with AI

Beware the Digital Doctor: The Dangers of Self-Diagnosing with AI

A study into why relying on artificial intelligence for medical opinions could do more harm than good

5 min read
Share:
Beware the Digital Doctor: The Dangers of Self-Diagnosing with AI
Table of contents

After constant news reports about how AI will be a boon to medicine, it’s understandable that some might believe that tools like ChatGPT are infallible.

 

For some, they might even be tempted to let AI take a crack at diagnosing those weird symptoms they’ve been experiencing. With just a few clicks, you can get a digital diagnosis that saves you the hassle of paying a visit to the doctor, right?

 

While these AI tools might seem convenient, they possess the potential to lead you down a risky path. From issues such as misdiagnoses, causing unnecessary worry, or (worst of all) overlooking something serious - AI doesn’t always get it right. 

 

Before you trade your human doctor for a digital one, let’s talk about why relying on new technology to understand your health might not be the best idea.

 

The ConfidenceClub Digital Doctor Study

We wanted to put the AI tools available to the public to the test, so we decided to take twenty questions from a UK Medical School Practice Exam and prompt them to come up with an answer.

 

The tools we used were: 

 

  • ChatGPT 4 (Open AI)
  • DxGPT (Foundation 29)
  • Co-Pilot (Microsoft)
  • Gemini (Google)
  • Grok (X, the platform formally known as Twitter)

 

As prompts are crucial when utilising these tools, we took the approach to ask each of the tools forty questions:

 

  1. The first twenty prompts were test questions from the practice exam, quoted verbatim.
  2. The final twenty prompts were rewrites of the test questions, in non-technical terms that a layperson person would use to describe their symptoms. 

 

The reason we did this was to ensure that we could assess the tools' ability to interpret and respond to both technical, medically precise language and more everyday, colloquial descriptions of symptoms. 

 

This dual-prompt approach was designed to simulate real-world scenarios where patients might describe their symptoms in layman's terms, while healthcare professionals might use technical jargon. By testing both approaches, we aimed to evaluate how well these AI tools bridge the gap between medical terminology and common language, which is crucial for their practical application in digital healthcare.

 

Our methodology allowed us to compare the accuracy, consistency, and overall helpfulness of the AI responses across both types of prompts. This also provided insight into how adaptable and user-friendly these tools are, especially for individuals without a medical background who might rely on AI for preliminary medical advice or information.

 

The results of this study could help identify strengths and weaknesses in each AI tool, offering valuable feedback for doctors and patients alike. 

 

Ultimately, our goal was to see how these tools perform in situations that mimic real-world interactions, where clarity, accuracy, and the ability to understand and respond to different levels of medical knowledge are critical.

 

The Results

When scoring the answers for the results, we wanted to ensure that two criteria was met: 

 

  1. That the answer was medically correct.
  2. That the answer always advised the questioner to seek further medical attention.

 

This was to ensure that the AI tools not only provided accurate medical information but also promoted responsible and safe guidance. 

 

By insisting that the tools encourage users to seek further medical attention, we aimed to mitigate the risk of individuals relying solely on AI for their healthcare needs, which could lead to dangerous outcomes if a serious condition were misdiagnosed or not identified.

 

Results were as follows:

 

AI Tool

Technical Correct

Technical Referral

Layperson Correct

Layperson Referral

ChatGPT 4

100%

70%

45%

100%

DxGPT

100%

0%

55%

0%

Co-pilot

60%

85%

35%

100%

Gemini

85%

50%

35%

100%

Grok

100%

100%

45%

100%

Total Average

89%

61%

43%

80%


 

The study results revealed significant variations in the performance of different AI tools when answering medical questions, both in technical accuracy and in advising users to seek further medical attention.

 

  • ChatGPT 4 demonstrated excellent technical accuracy with 100% correct answers but was less consistent when responding to layperson prompts, achieving only 45% correctness. However, it was reliable in advising users to seek further medical attention, with a 100% referral rate in layperson scenarios.
  • DxGPT also achieved 100% technical accuracy but failed to consistently recommend seeking further medical attention, scoring 0% in referral rates for both technical and layperson prompts. Its correctness with layperson prompts was slightly better at 55%.
  • Co-Pilot was the worst performing tool, with a technical correctness rate of 60%, and even lower accuracy with layperson prompts at 35%. However, it excelled in referral rates, scoring 100% for layperson prompts and 85% for technical prompts.
  • Gemini performed well with an 85% correctness rate for technical prompts but dropped to 35% for layperson prompts. Its referral rates were solid, with 100% for layperson prompts and 50% for technical prompts.
  • Grok achieved a perfect score in both technical correctness and referrals, and although its performance dropped to 45% for layperson prompts, it maintained a 100% referral rate.

 

The total averages across all tools were 89% for technical correctness, 61% for technical referrals, 43% for layperson correctness, and 80% for layperson referrals. 

 

These results highlight that while those who can use technical language to prompt these AI tools can yield better results, less than half of those questions delivered by a layperson yield a correct answer to their question. 

Methodology

Data was collected using the following sources:

 

 

Where possible, like-for-like questions were used for both medical and typical questions. Where this was not possible, due to questions that relied heavily on test results rather than symptoms, an alternative was used. 

 

A correct answer is defined as the phrase within the medical exam being given within the AI tool answer. Answers given in the general area of a problem, ie: answering an STI when the problem is specifically herpes, was not logged as a correct answer. 

 

Download the full list of questions and answers here.