Using Large Language Models to Converse with Your Data

Data Science and AI

By Yogendra Miraje | August 26, 2024

The emergence of generative AI has amplified the importance of reliable data in factual, data-centric decision-making processes. With instruction-tuned Large Language Models (all LLM references in this article are instruction-tuned LLMs) such as ChatGPT on the scene, interaction with machines has advanced significantly and opened new possibilities for conversing with data in natural language—similar to talking with a colleague.

This breakthrough offers potential for seamless data interactions through text. The convenience of data retrieval is now at the user's fingertips, where a chat experience provides a promising, simplified alternative to navigation through traditional data-product interfaces that often involve a learning curve.

The recent launch of the beta release of a Large Language Model-based knowledge agent—FactSet Mercury, which supports junior banker workflows and enhances fact-based decision making—exemplifies this advancement. Mercury users can effortlessly prompt, for example, "Show me the top 50 banks by assets in California," among numerous other queries.

Constructing a chatbot system that harnesses vast quantities of enterprise data is a challenging yet rewarding task. Let's delve into how FactSet is addressing these challenges, paving the way in data-driven innovation.

Retrieval Augmented Generation

LLMs generate a reasonable response based on their training data, which is mostly from the Internet. However, to get answers from non-public data, which is generally the case in the enterprise setting, one must enhance the LLM with that data.

This method of augmenting responses is known as Retrieval Augmented Generation (RAG). LLMs can be used in the context of proprietary data to generate factually correct answers, as this approach combines LLM reasoning with factual data. The advantages of RAG are two-fold: there is no need to re-train or finetune the LLM, and hallucinations are reduced since answers are derived from the proprietary data instead of the LLM directly.

To understand how to derive the answers in context of the data, let us look at the types of data. In a typical enterprise environment, data is primarily categorized into two types: unstructured and structured.

Unstructured data doesn't follow a specific format or structure. Examples of unstructured data commonly found in enterprise settings include text data (e.g., emails and documents), news articles, and transcripts. These data sources are rich in information, primarily for qualitative insights.

Using the RAG model for unstructured data like text involves combing through a vast collection of pre-indexed documents from trusted knowledge sources such as your company’s proprietary data or third-party data. The objective is to identify the most relevant documents for the user's prompts and then generate a response to the user’s question in context of those documents.

Structured data, in contrast, is organized in a defined manner, often in rows and columns. Database tables are a classic example of structured data. It is often used in quantitative analyses.

However, the application of RAG using structured data poses unique challenges. Unlike text data, structured data cannot be pre-indexed, necessitating a different approach for retrieval and integration with the language model.

Structured Data RAG

Let’s go back to the example user question mentioned before: “Show me the top 50 banks by assets in California.”

While the question appears straightforward, accurately answering it involves several intricate steps. The chatbot must:

Understand the question: Given the chatbot is integrated with both structured and unstructured data, it needs to understand whether this question needs structured data or unstructured data or a combination of both. Since the data required to answer the example question is present in tables, we classify it as a structured data question.
Identify the required data elements: The chatbot must discern the different elements of data that answer the question. In our example, that includes banks, asset sizes, and locations in California.
Determine the data source: The chatbot must know where to find this information. Depending on the setup, this could be a table (or tables) in a database, multiple tables in multiple databases. or some data-provisioning service or API.
Retrieve the data: After locating the data source, the chatbot needs to retrieve the relevant data. This could involve executing a database query either directly or through a data-provisioning layer.
Perform necessary operations: To present the top 50 banks in California, the chatbot must sort them based on their assets and select the top 50. This involves not just retrieving data but also applying the correct sorting and filtering logic.
Generate a user-friendly response: Finally, the chatbot must present the information in a clear, concise manner. This could be a simple text response, a table, or even a visual representation such as a chart, depending on the question and chatbot capabilities.

The California bank example underscores the complexity and sophistication required in AI chatbots, especially when dealing with structured data. It's not just about understanding language. It’s also about effectively interacting with and processing data from various sources to provide accurate, relevant responses.

Semantically Rich Metadata

Most databases weren't designed with LLM capabilities and requirements in mind. This gap poses a significant challenge: LLMs require guidance to navigate and interpret the vast amounts of data and metadata in these databases. However, metadata—the data about data, which is crucial for understanding the content and context of the stored information—may be incomplete, missing, or incompatibly formatted.

To bridge this gap, it's essential to provide LLMs with semantically rich metadata. It refers to the additional descriptive information and context about data. This allows LLMs to effectively map user questions to the correct data sources and the specific fields representing the granular level of data the user seeks.

For instance, in response to "Show me the top 50 banks by assets in California," the LLM must correlate the query with data fields such as bank-name, bank-id, state-name, and asset-value.

By enhancing metadata that way, LLMs can more accurately identify and retrieve the specific pieces of information to answer user queries. This process involves not just recognizing key terms in a question but understanding their relevance and relationship to the data fields within the database. Thus, the effectiveness of an LLM in an enterprise setting hinges significantly on the quality and compatibility of the metadata provided alongside the data model.

Code Generation and Execution

Identifying the necessary data fields is the first step in responding to a user's query. The next step is to retrieve and manipulate the data, which often requires operations such as filtering and sorting. This is achieved with programming, query languages, or a combination of both. Query languages are quite efficient in data retrieval, while programming languages provide a broad range of capabilities for data manipulations.

For an experienced software engineer, writing code for data processing is an easy task. However, enabling a Large Language Model to reliably execute the code is a different matter. Based on the requirements, a LLM can be used for both data retrieval and operations.

LLMs are generally quite adept at writing code for basic data operations, but the challenge intensifies when dealing with complex relationships between data fields. Therefore, if data retrieval is done outside of a LLM, it could reduce the chances of error in the generated code and help the LLM focus on the data operations alone.

To get a LLM to generate a reliable code, it requires explicit guidance, often provided through what's known as prompt engineering. This involves writing instructions that effectively communicate the desired operations to the LLM.

The goal is two-fold: to guide the LLM to understand the task and generate code or commands that accurately perform the required data manipulations. Guidance is crucial for ensuring the LLM can handle the intricacies of data relationships and produce the desired outcome in response to the user's question.

In some rare cases, the generated code is not executable, data is lacking, or an unexpected error occurs. Such scenarios must be handled gracefully, and an appropriate message should be shown to the end user.

Additionally, this code needs to be executed securely on the retrieved data. The output from the code should then be translated into an easily understandable, user-friendly format. A significant amount of work in prompt engineering and software engineering involves developing these components and fine-tuning them to achieve the desired responses.

Infusing the Domain Knowledge

Users might pose questions—for example, “Are large banks more profitable than small banks?”—that are vague. Such questions don't always seek specific factual data but rather a form of analysis or insight. Hence, the chatbot must navigate through a set of assumptions to provide a meaningful answer.

To effectively respond to these types of inquiries, it's beneficial to integrate a knowledge base into the chatbot. The knowledge base would infuse the domain-specific knowledge, helping the chatbot understand and define key concepts such as what constitutes a large or small bank and the metrics that determine profitability.

Using this external knowledge base, the chatbot can correctly interpret the data and enhance the relevant pieces of data or information to construct an answer. This approach handles a broader range of queries, including those that require subjective analysis or drawing conclusions from available data.

Conclusion

The integration of generative AI within the enterprise data realm remains largely untapped, offering vast potential benefits. As the pace of AI development accelerates across business sectors, industries, and regions, we anticipate increasingly sophisticated and innovative applications of generative AI.

To learn more, view our artificial intelligence capabilities on factset.com.

Graham Nelson contributed to this article.

This blog post is for informational purposes only. The information contained in this blog post is not legal, tax, or investment advice. FactSet does not endorse or recommend any investments and assumes no liability for any consequence relating directly or indirectly to any action or inaction taken based on the information contained in this article.

Post Comment

Yogendra Miraje

Lead Machine Learning Engineer

Mr. Yogendra Miraje is the Lead Machine Learning Engineer at FactSet. In this role, he is responsible for leading the engineering efforts to integrate cutting-edge AI solutions into the FactSet data ecosystem. These solutions empower customers to discover content and derive trusted insights from data. Previously, he worked for Truvalue Labs, which FactSet acquired, and contributed to the development of foundational back-end and machine-learning technologies. Yogendra earned a Master's Degree in Computer Science from Northeastern University and a Bachelor's Degree in Engineering from India.

Using Large Language Models to Converse with Your Data

Data Science and AI

Retrieval Augmented Generation

Structured Data RAG

Semantically Rich Metadata

Code Generation and Execution

Infusing the Domain Knowledge

Conclusion

Yogendra Miraje

Lead Machine Learning Engineer

Related Articles

July 30, 2025

June 5, 2025

May 29, 2025

May 6, 2025

Comments