How Reliable Are Large Language Models for Health Literacy?

An Artificial Intelligence Practitioner’s Perspective on Trustworthiness and Responsibility, Part 2

By Temese Szalai | November 30, 2023

Are LLMs the Right Tool for Health Literacy?

When considering the reliability of LLMs in various contexts, it’s important to think about what they were designed to do vis-a-vis the context at hand. It means asking, “Is this the right tool for the job?”

Tools like ChatGPT, Google Bard, Anthropic’s Claude, and Meta’s Llama 2, often called “foundational models,” were designed to help develop and understand general content. They can answer general user questions, generate business plans for cat cafés, write project descriptions for grant applications, and translate field values in a JavaScript Object Notation (JSON) file from English to French.

These foundational models were not designed for any specific domain. Importantly, they were not designed to be experts on medical content or health literacy. So, they aren’t the perfect tool for health literacy, even though they may still be a useful tool in some circumstances.

Generating Suitable Health Literacy Outputs

Further, LLMs are designed to mimic the data they were trained on. For foundational LLMs, this is generally crawlable digital data ¹ such as that found in the Common Crawl, the BookCorpus, and similar sources.

The average reading level of these inputs is roughly college sophomore level, which results in outputs at a similar level. While recent research by Dr. Hana Haver et al., which is summarized in this post, indicates that ChatGPT can simplify responses to patient questions about lung cancer screening to about a 12th-grade level, outputs at this level aren’t typically simple enough for health literacy’s purposes.

As such studies, and many experiences described by folks in the discussion list, indicate, when we apply these tools to health literacy use cases, we may be disappointed with their results. Maybe the reading level is higher than it should be, or something simple is turned into something complex.

Even with healthcare-specific tools, like Grafi.ai, we may find their outputs unsatisfactory for health literacy’s purposes. Most of these tools are not designed with health literacy in mind. In fact, Grafi.ai was designed to help copywriters develop health content especially for search engine optimization (SEO).

As far as accuracy goes, at least one initial study concluded that “ChatGPT generated largely accurate information to diverse medical queries as judged by academic physician specialists, although with important limitations.” So, while suitably health literate content might not be generated, there’s a small ray of hope regarding medical accuracy.

To Use or Not to Use?

Navigating questions of reliability and whether to use these tools requires a solid understanding of the outputs you’re looking for and a consideration of which tool might be the most effective for the job.

You might be inclined not to use an LLM at all due to concerns about reliability, accuracy, appropriateness of outputs, and/or ethical issues. If you do use an LLM for health, you must attune your expectations, workflow, and prompts to align with what these tools were intended to be used for and the outputs you’re looking for.

To illustrate this, I found that Anthropic’s Claude provided better results than OpenAI’s ChatGPT on a small handful of health literacy tasks, such as rephrasing a sentence like “Managing diabetes involves a multi-faceted approach” in a friendly and easy-to-understand way. I saw better results from ChatGPT when specifically instructed via prompts to provide outputs in the tone of Doak and Doak and a couple of examples of desired outputs had been provided. That said, these were only initial investigations and not a systematic, thorough analysis of LLMs for health literacy needs.

Can such tools be fine-tuned for health literacy use cases? Yes. Should they be? Will they be? These are questions participants in forums like the Health Literacy Solutions Center discussion list should weigh in on and for developers of these tools to consider.

In our next post, we’ll address some of the root causes that impact reliability and accuracy of LLMs generally, impact on health literacy work where relevant, and strategies for navigating some of the limitations of LLMs.

^{1. “There’s controversy surrounding how public this crawlable data is and whether it is being used legally and legitimately. For this reason, we’ve opted to avoid the term “public” data and instead refer to “crawlable” data.↩}

About the Author

Temese

Temese Szalai has been an AI practitioner in the field of Natural Language Processing (NLP) and Machine Learning (ML) for over 20 years. She is the founder and principal of Subtextive, an independent woman-owned/woman-led provider of socio-technical AI consulting focused on AI initiatives for unstructured data for organizations large and small across a range of industries.

#IHABlog
#ArtificialIntelligence

0 comments

45 views

Grid css

Adobe Fonts

Blogs