Hallucinations are perhaps the most important limitation of generative AI to stem. This article—the second in our six-part series to help you get the most value from AI —discusses the implications of hallucinations, causes, and seven techniques to mitigate them.
Hallucinations are both errors in fact and errors in logic. OpenAI’s FAQ states: “ChatGPT is not connected to the internet, and it can occasionally produce incorrect answers. It has limited knowledge of world and events after 2021 and may also occasionally produce harmful instructions or biased content. We'd recommend checking whether responses from the model are accurate or not.” The FAQ also notes that “ChatGPT will occasionally make up facts or ‘hallucinate’ outputs.”
The problem of Large Language Models (LLMs) generating incorrect or fictional output is compounded by the fact the output is presented confidently and often sounds legitimate.
In our previous article (AI Strategies Series: How LLMs Do—and Do Not—Work), we discussed how language models predict likely and coherent strings of text; they do not look up data. They’re not encyclopedias, and they don’t validate their outputs. So, much like the predictive text on your phone, it’s not surprising the generated text isn’t always factual. Making matters worse, their training data is limited in scope, lacks the most recent information, and can include some false or biased information. The entire purpose of LLMs is to generate text, so they’re generally optimized to provide some answer instead of responding “I don’t know.”
That said, the makers of these models are responding to criticisms about hallucinations. In some cases, they have begun mitigating hallucinations with bolt-on solutions such as incorporating fact-checking, grounding answers with fetched data (with methods like web browser plug-ins), and improving the training data and reinforcement learning process in newer models.
LLM hallucinations have made the news and caused reputational damage. For example, the introduction of Google Bard was very famously panned due to its incorrect claims about the James Webb telescope, and this had a serious financial impact on Google’s stock price at the time. If you use an LLM for your work, be sure to check the results for accuracy and work within your employer’s guidelines.
These models have also famously made up various sources and citations, causing significant legal headaches that we’ll discuss in a future article.
Below is an example of an error in fact. The AI model was asked a bit of a trick question: What’s the world record for crossing the English Channel entirely on foot. The cited individual did at one point set a world record for crossing the English Channel, but there are several inaccuracies:
He crossed by swimming, not on foot
It took him seven hours, not 14 hours 51 minutes.
His record was set in 2005, not 2020, and it was broken in 2007, well before the model’s 2021 training data cutoff
Unsurprisingly for a predictive text model, the AI has constructed a text with reasonable or likely words, not factual data. Remember that an LLM is a language model, not a knowledge model. It shouldn’t be used like an encyclopedia or database, unless you plan to carefully check all its answers.
Here’s an example of an error-in-logic hallucination. I asked for a word similar to revolt that starts with the letter b.
As shown above in the thread of prompts and responses, the model generated incorrect answers (hallucinations) instead of saying “I don’t have a great answer for that.” It suggested the words rebellion, then uprising, then mutiny—even though none of those start with the letter b.
When given an easier starting word, this model is able to come up with b words—suggesting breeze when I asked for a word similar to wind—instead of hallucinating answers.
Luckily, there are several strategies to mitigate hallucinations, including:
Increase awareness
Use a more advanced model
Provide explicit instructions
Provide example answers
Provide full context
Validate outputs
Implement retrieval-augmented generation
Our series kickoff article focused on providing you with a more intuitive grasp of the mechanics behind Large Language Models. It aimed to shed light on both their capabilities and limitations to help you more effectively and safely use them. By knowing what to expect—and what not to expect—from a Large Language Model, you will be able to both pose questions (prompts) that result in more reliable answers and more critically evaluate them. In other words, you’ll learn how to consistently work with the model’s strengths and capabilities.
One easy approach to stem errors of logic is to use a more advanced language model. GPT3.5 is currently the free model from OpenAI, but the paid version GPT-4 (also known as ChatGPT Plus) will generally produce better results for tasks requiring sophisticated language or logic. That said, GPT-4 has tradeoffs—latency, higher cost, and according to some 2023 studies, a drop in accuracy—so this is not a foolproof solution.
Software engineers evaluating the best language model for a specific task can leverage research that provides a comprehensive analysis of various models, including LLMs from Meta, Anthropic, Google, OpenAI, and open-source models. This research assesses models’ factuality as well as criteria such as logical abilities and avoidance of harm. That said, it’s important to keep in mind that many of these models—and therefore their relative strengths and weaknesses—do evolve over time.
Source: Cornell University. "FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets"
Returning to the example of synonyms for revolt that begin with a b, GPT-4 (the more advanced model) is able to come up with several solutions.
Prompt engineering—the practice of carefully crafting prompts to produce optimal AI outputs—provides our next three techniques. You can get more accurate results by explicitly insisting on them in your prompt. When I instruct the less advanced model to confirm that its five synonyms for revolt do indeed start with b, it returns all b words (though it oddly says bustle doesn’t start with b).
But insisting on accuracy doesn’t always work, as shown in the next example. The same prompt in a new chat returned all non-b words—and incorrectly stated that two of the five words start with b.
For non-trivial questions, an instruction-based technique that often results in better answers is the “chain of thought” technique. In this approach, you ask the model to break down a problem into manageable chunks and explain its thinking step-by-step as it works toward a final solution. And finally, a prompt that explicitly instructs the model that no answer is better than an incorrect answer will often prevent hallucinated results.
Providing examples of correct answers in your prompt is another prompt-engineering technique that can help you get better answers from AI. In my prompt, I gave two examples of b words similar to revolt (betray and backlash). This time the AI model successfully gave me another word.
A final technique we can use from prompt engineering is to provide full context in our prompt. To illustrate this, I asked the model for the top themes in FactSet’s Q1 2021 earnings call from December 2020 (our fiscal year is September through August), which is prior to the model’s training cutoff date. The three themes (and supporting points) the AI model identified—strong financial performance, continued investment in innovation, and expansion of product offerings—look convincing at first glance.
But comparing the AI-generated summary with our actual earnings transcript reveals that the specific financial figures cited in the model’s first theme are incorrect. The quarter was indeed strong, and the numbers cited were in the right ballpark, but they were not the actual, publicly reported performance numbers.
The model’s second theme is investment in innovation. While innovation is a frequent topic of FactSet earnings calls, the word innovation does not appear in the Q1 2021 earnings transcript.
The third theme is expansion of products, specifically mentioning ESG and private markets. While ESG is mentioned in the actual publicly reported transcript (because of an acquisition of an ESG data firm), private markets were not mentioned that quarter.
So in the end, all three themes from the model are wrong. They sound good, and if you asked someone familiar with FactSet to guess themes from any random quarter’s earnings transcript, the model’s output would sound plausible. And that’s exactly what the LLM does: It predicts (i.e., makes up) a plausible answer to a given question. The text generated is predictive based on common language patterns, not factual based on research.
The best solution to this problem is to provide full context. In this case, paste the entire text of the actual transcript into the prompt. Large Language Models excel at language manipulation, including summarization, theme identification, and sentiment analysis. When given the full text, not only are the resulting three themes a good representation of the transcript, but the supporting subpoints—client retention, the Truvalue Labs ESG acquisition, and lower travel and office costs—are all mentioned several times throughout the transcript.
These are relevant and accurate themes for the Q1 2021 earnings call. When provided the actual text from the call transcript, the AI model performed significantly better than when it relied solely on its “blended smoothie” of training data to generate answers.
That said, asking for high-level themes in an earnings transcript is a relatively simple request for a LLM; the models may not perform as well when asked more challenging questions about SEC filings without an advanced approach.
It’s also important to know that without extra plug-ins or enhancements, language models are not able to access website text from URLs you paste into a prompt. You need to pass in the full text from the webpage, not a web address.
If you do pass in a URL, you may get an answer that appears the AI model has accessed the original text, but it’s actually hallucinating an answer based on the words in the URL. In our example, it sees the terms “factset” and “earnings call” in text of the URL, and it once again makes a best guess at the themes for a FactSet earnings call. But the output is generic and doesn’t reflect the actual transcript.
You can prove this by using TinyURL to shorten the URL, removing full-word clues. This time, the AI model reveals that it cannot browse external links, and it requests more context from you.
Fact-checking AI outputs is critical to catch hallucinations. Take a risk-based approach to generative AI and always validate outputs—especially for fact-based, higher-stakes use cases and those outside the wheelhouse of Large Language Models.
For example, generative AI is well-suited to:
Creative writing that’s unbound by factual constraints
Brainstorming and generating ideas
Proposing alternate wording for style or clarity
Jogging your memory with the forgotten name of a book or noteworthy figure you describe
Language models are intended for—and really good at—text manipulation. You should still review their output, but rewording, summarizing, reformatting, changing tone, or extracting specific text or themes are all great uses of LLMs.
In many cases, you’ll also get good answers to general knowledge questions that show up a lot in a model’s training data. For example, it will suggest sleep, nutrition, exercise, friendship, and stress management as common ways to help maintain wellbeing.
However, when asking about higher-risk or highly regulated industries (e.g., legal, medical, or financial questions), a more exacting discipline like math or coding, or specific references or citations, it’s essential to carefully review the output and validate it against a trusted source. Think of generative AI as your overconfident, eager-to-please intern, not your expert teacher, and be sure to double-check its work.
We’ve reviewed a number of techniques that interactive users of AI models can employ, but perhaps the most important strategy to avoid hallucinations is one only available to engineers building software products on top of an LLM. Retrieval-augmented generation, or RAG, is the programmatic version of providing context in a prompt. It’s also known as grounding the answers in facts.
With RAG, the software first looks for an answer to the user’s question in a trusted database. For example, a user help chatbot would look through existing help documentation. It would then combine the best matches from its database with the text of the user’s question and let the LLM format a conversational response to the user. This technique vastly reduces hallucinations because it does not rely on the training data smoothie to generate factual answers.
RAG will come up several times throughout our six-part AI strategies series because it’s a useful technique to overcome both hallucinations and several additional challenges we’ll discuss. Engineers considering fine-tuning a language model will find that adding a RAG approach will improve the overall results with respect to accuracy, explainability, incorporating up-to-date knowledge, and permitting user-based security. Read our quick explanation Reducing Hallucinations from Generative AI to learn more about RAG.
Generative AI can help organizations increase productivity, enhance client and employee experiences, and accelerate business priorities. Understanding the implications of hallucinations (including errors in fact and logic), their causes, and techniques to mitigate them will help you become a more effective user of AI technologies.
In the meantime, watch for part three next week in this six-part series: explainability. If you missed part one, check it now: How LLMs Do—and Do Not—Work. You can also visit FactSet Artificial Intelligence.
This blog post is for informational purposes only. The information contained in this blog post is not legal, tax, or investment advice. FactSet does not endorse or recommend any investments and assumes no liability for any consequence relating directly or indirectly to any action or inaction taken based on the information contained in this article.