FactSet Insight - Commentary and research from our desk to yours

AI Strategies Series: Inconsistent and Outdated Responses

Written by Lucy Tancredi | Feb 21, 2024

So far in our AI strategies series we’ve discussed LLM capabilities and limitations, hallucinations, and lack of explainability. Now let’s turn to part four, a discussion of why LLMs provide inconsistent responses to the same prompt and generate outdated answers.

Inconsistent responses to the exact same prompt happen because Large Language Models are intentionally built with some allowance for variability. In a previous installment of this series, we looked at an example of an LLM completing the phrase “the students opened their...”. A common completion might be the word books, but sometimes it might return laptops. To determine what will come next, language models compute the probability distribution of potential next words in a given sequence of words.

Programmers can control how creative (or variable) an LLM’s response is—i.e., how likely it is to pick words lower on the calculated probability list—with a temperature setting that ranges from 0 (lowest) to 1 (highest). Setting a very low temperature value provides the most precise or consistent response. Setting a high temperature will generate a wider range of answers.

The introduction of variability allows for unexpected or imaginative output. This may be desirable if you’re looking for a creative answer or want a variety of responses, such as asking for a favorite food. In that instance, a low temperature always returns “pizza,” while a high temperature generates a wider range of answers.

Source: Temperature as explained in a course by DeepLearning.AI

Most LLMs are trained to be well-aligned with human expectations. Even when asking for a high-temperature (creative) response, you’re unlikely to get an improbable completion like “the students opened their clouds” or “my favorite food is cars.”

The temperature setting is a common technique software engineers use with LLM APIs to prevent inconsistent responses to the same prompt. Unfortunately, users of ChatGPT can’t set a temperature interactively through the user interface (except in a special developer sandbox.) However, they can ask it to be more or less creative directly in the prompt.

Below, the model generated “the students opened their books” when I asked for a likely completion, and it generated “opened their portals” when I asked for a “very creative, unlikely” completion.

Let’s now look at an example of inconsistent responses to the exact same prompt.

I asked an AI model to give me Tesla’s market cap. I received a variety of answers citing different months before its outdated training data cutoff in late 2021. Sometimes the answer was within the range of Tesla’s actual market cap for the month cited, and sometimes it was not. But the main point is that it gave me different answers every time I asked the exact same, fact-based question.

In our installment on ways to overcome hallucinations, we discussed how using a better model can sometimes give better answers. Will that work here?

Not necessarily. When I used GPT-4 instead of GPT-3.5, the response narrowed to a specific day in 2021 rather than returning the market cap for an entire month. But its answer was a billion dollars off the actual value that day.

In subsequent attempts, the AI model still gave me different answers from different days or months. When I used another one of our strategies to stem hallucination—insisting on accuracy in the prompt—it correctly stated that it could not respond.

The examples above also demonstrate outdated knowledge, which happens because the model is only trained with data up to a specific day. Retraining with new data is computationally intensive and doing so daily would be prohibitively expensive. In addition, legal and ethical concerns likely prevent model manufacturers from using up-to-the-minute training data that has not been vetted. In November 2023, Open AI released a new model GPT-4 Turbo with training data through April 2023.

The real solution for both outdated and inconsistent responses, as in our discussions of hallucinations and explainability, is Retrieval-Augmented Generation (RAG). Below is an answer from FactSet Mercury, which first uses the RAG method to fetch the factual answer from a governed FactSet database, and then uses generative AI to present the response in a conversational way.

The answer is correct and, moreover, current as of the date I asked the question. If I repeat the question, I get the exact same answer every time. Mercury also goes beyond providing just an answer. It displays a time-series chart of the data, shows the prove-it data and the sources behind the calculation, and suggests next best actions such as viewing the current capitalization, comparable companies, or price summary.

Conclusion

Generative AI can help organizations increase productivity, enhance client and employee experiences, and accelerate business priorities. Understanding the tools to overcome inconsistent responses and outdated knowledge will help you get the most value from AI technologies.

In the meantime, watch for part five next week in this six-part series: security and data privacy. If you missed the previous articles, check them out:

AI Strategies Series: How LLMs Do—and Do Not—Work

AI Strategies Series: 7 Ways to Overcome Hallucinations

AI Strategies Series: Explainability

 

This blog post is for informational purposes only. The information contained in this blog post is not legal, tax, or investment advice. FactSet does not endorse or recommend any investments and assumes no liability for any consequence relating directly or indirectly to any action or inaction taken based on the information contained in this article.