Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models

Pompili, David; Richa, Yasmina; Collins, Patrick; Richards, Helen; Hennessey, Derek B

doi:10.1007/s00345-024-05146-3

Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models

Document detail

ID

doi:10.1007/s00345-024-05146-3...

Author

Pompili, David Richa, Yasmina Collins, Patrick Richards, Helen Hennessey, Derek B

Langue

Editor

Springer

Year

2024

listing date

7/31/2024

Keywords

artificial intelligence (ai) large language model (llm) patient information leaflet chatgpt google bard patient education topics content llama level reading generated quality generate palm llms average pils

Metrics

Abstract

Purpose Large language models (LLMs) are a form of artificial intelligence (AI) that uses deep learning techniques to understand, summarize and generate content.

The potential benefits of LLMs in healthcare is predicted to be immense.

The objective of this study was to examine the quality of patient information leaflets (PILs) produced by 3 LLMs on urological topics.

Methods Prompts were created to generate PILs from 3 LLMs: ChatGPT-4, PaLM 2 (Google Bard) and Llama 2 (Meta) across four urology topics (circumcision, nephrectomy, overactive bladder syndrome, and transurethral resection of the prostate).

PILs were evaluated using a quality assessment checklist.

PIL readability was assessed by the Average Reading Level Consensus Calculator.

Results PILs generated by PaLM 2 had the highest overall average quality score (3.58), followed by Llama 2 (3.34) and ChatGPT-4 (3.08).

PaLM 2 generated PILs were of the highest quality in all topics except TURP and was the only LLM to include images.

Medical inaccuracies were present in all generated content including instances of significant error.

Readability analysis identified PaLM 2 generated PILs as the simplest (age 14–15 average reading level).

Llama 2 PILs were the most difficult (age 16–17 average).

Conclusion While LLMs can generate PILs that may help reduce healthcare professional workload, generated content requires clinician input for accuracy and inclusion of health literacy aids, such as images.

LLM-generated PILs were above the average reading level for adults, necessitating improvement in LLM algorithms and/or prompt design.

How satisfied patients are to LLM-generated PILs remains to be evaluated.

Pompili, David,Richa, Yasmina,Collins, Patrick,Richards, Helen,Hennessey, Derek B, 2024, Using artificial intelligence to generate medical literature for urology patients: a comparison of three different large language models, Springer