Blogs

  
Explained: AI + Health Literacy

When Large Language Models May Not Be Appropriate for Health Literacy

An Artificial Intelligence Practitioner’s Perspective on Trustworthiness and Responsibility, Part 3

By Temese Szalai | December 5, 2023

Read more in the series:

Part 1 | Can — and Should — We Trust Large Language Models for Health Literacy? | Publish Date: November 28, 2023

Part 2 | How Reliable Are Large Language Models for Health Literacy? | Publish Date: November 30, 2023

Part 4 | Will LLMs Ever Really Be Trustworthy Enough for Health Literacy? | Publish Date: December 7, 2023


AI post 3

This post is the third in a series inspired by a recent discussion on the IHA Health Literacy Solutions Center discussion list about the implications of generative artificial intelligence (AI) powered by large language models (LLMs), sometimes referred to as chatbots, for the health literacy field. Today we’re going to confront some of the factors that make LLMs less appropriate than we’d like for health literacy’s needs. 


I’ll start with situations where LLMs misbehave, specifically “hallucinations” and bias, and then move on to the role of humans and our fallibility as another contributing factor.

Hallucinations: When AI Chatbots Make Things Up

Because LLMs imitate human intelligence, they generate “hallucinations” from time to time. These are outputs that are inaccurate, misleading, or outright false. These systems can make things up, just like humans do — mostly because they draw incorrect correlations from their training data. Like humans, they present these made-up outputs very believably.

What’s not clear is the frequency, distribution, and severity of these hallucinations. How great is the impact of hallucinations on overall accuracy and, therefore, reliability and usability for health literacy tasks? 

Recently, the New York Times published an article featuring work from a tech start-up called Vectara that, among other things, is working to assess and quantify hallucination rates across several of the more popular mass market AI chatbots. Vectara’s research shows that hallucination rates vary considerably from chatbot to chatbot: ChatGPT’s average hallucination rate hovered around 3 percent, while a Google AI chatbot called Palm peaked at a 27 percent hallucination rate. Other systems from Meta and Anthropic were between 5 and 8 percent.

Specialized LLMs like Med-PaLM 2, a healthcare-specific LLM mentioned in an earlier post, are more accurate and presumably have lower hallucination rates. Google’s research indicates that Med-PaLM’s longform answers align with clinician-provided answers in terms of scientific consensus with between 92 and 93 percent accuracy. The Med-PaLM mentions that Med-PaLM 2 scored “86.5% accuracy on the MedQA medical exam benchmark in research.” 

Measuring accuracy against benchmarks and performance on medical exams isn’t the same as measuring hallucination rate. It’s a proxy, though, since hallucinations should show up as errors in these kinds of evaluations. The severity of any hallucination is not transparent.  

Mitigating Hallucinations in LLMs

Is it possible to reduce the number, and possibly severity, of hallucinations? Absolutely. Are we going to be able to eliminate hallucinations from LLMs altogether? I, personally, doubt this, since they are in many ways a by-product of the underlying architecture. Others may be more hopeful and knowledgeable.

End users can — and should — do their own fact checking when using these generalized LLMs to mitigate the risks posed by hallucinations. Automated or semi-automated fact checking may someday be baked into LLM-based products. 1 Further, work is being done on automated detection of misinformation, including medical misinformation, which LLM hallucinations can resemble. 

Finally, practices and precedents for evaluating AI outputs for appropriateness already exist. For years, search relevance and accuracy (and ad relevance, product recommendations, etc.) have been evaluated, often at large scale, to improve suitability of algorithmic outputs. These and other similar practices are part of LLM development.

All that said, keep in mind that generalized LLMs are (probably) not being fine-tuned on medical information specifically, let alone on content developed to enable and promote health literacy. Whether or not systems are hallucinating, it may seem to health literacy practitioners that AI chatbots are not as appropriate as they could or should be. 

Bias: When AI Chatbots Amplify Society

No discussion of trustworthiness and responsibility in AI would be complete without some discussion of bias. This post by AI Myths provides a much more thorough and thought-provoking discussion of bias in AI, not limited to LLMs, than I can offer briefly here. 

Bias isn’t a problem that technologists created. It’s a human problem that society has perpetuated. LLMs surface this bias, and amplify it, due to the data used to train them. 

What’s the impact of LLM bias on health literacy? This article in The Lancet suggests that, due to the demographics that contributed to the training data itself, some perspectives are underrepresented. Such underrepresentation impacts the cultural, social, and linguistic appropriateness of outputs for certain audiences, which will impact appropriateness for health literacy needs. 

Removing bias from data sets, trying to equalize representation of data, and using other means to technically identify and eliminate bias can help mitigate and reduce bias.

But, at the end of the day, these techniques just aren’t enough. 

AI practitioners have a responsibility to try to eliminate, reduce, and mitigate bias in their work, which must include practices that strive to understand bias as a human problem. We must include a range of perspectives, not just those of technologists, and a range of approaches, including those that aren’t limited to technical ones, across the entire AI development lifecycle. 

As a field, we should be more open to standards, audits, and transparency around how LLMs are trained and the outputs that they generate. Responsible practices around identifying potential misuse of AI, such as through discriminatory practices, are also critical.

The good news is that we’re seeing more accountability for bias, which is promising. In this post about political bias in LLMs, Meta was quoted as saying that they will “continue to engage with the community to identify and mitigate vulnerabilities in a transparent manner and support the development of safer generative AI.” It’s a step in the right direction. 


AI's Human Nature

Will we ever eliminate all the challenges and limitations with LLMs? My personal opinion is no.

At the end of the day, one of the most important things to keep in mind about any AI is that it, at least for now, is human made. LLMs are designed to imitate our intelligence and are trained on data we, collectively as a society, create.

Even with great care, responsible practices, and dedication, there will be oversights and wrong turns. Because LLMs are human made, they reflect our own foibles, biases, misunderstandings, cultural contexts, and perspectives. And due to how LLMs work and what they can do, they can mimic and amplify our human qualities and riff on them in sometimes unexpected ways.

All that said, should we try to eliminate all the challenges and limitations with LLMs? Absolutely.



1. For example, a technique called “Retrieval Augmented Generation” (RAG) can steer generative chatbot outputs with facts obtained from knowledge bases.

About the Author

Temese


Temese Szalai has been an AI practitioner in the field of Natural Language Processing (NLP) and Machine Learning (ML) for over 20 years. She is the founder and principal of Subtextive, an independent woman-owned/woman-led provider of socio-technical AI consulting focused on AI initiatives for unstructured data for organizations large and small across a range of industries.


0 comments
24 views

Permanent Hyperlink to Share