When delivering new LLM-powered products—in the form of new functionalities or tools such as workstations, chatbots, or question/answer (Q/A) systems—it’s important to set up an infrastructure to securely log the flood of new usage metrics. Doing so will be extremely useful down the line, enabling you to protect proprietary data, manage costs, confirm what your clients are asking for, preempt computations, and improve overall product performance.
With products like interactive chatbots, you will be able to provide an easy and automated way for your clients to take advantage of your full suite of services. Aside from requiring an increased amount of security around the collected data, however, monitoring chats requires a slightly different setup than what is traditionally done for general user engagement and Q/A systems.
For example, general user engagement typically only considers whether your clients have interacted with a particular product feature. Therefore, tracking these sessions and how frequently users return to the product enables you to confirm the features clients are using and identify who gains the most benefit. The data can also help you with:
Alpha-beta tests of new features
Recommendations of new products
Identification of clients at risk of deactivating their accounts
Meanwhile, Q/A systems commonly keep track of the questions users ask, helping you identify which features or tools users are struggling with and, possibly, the features they would like you to develop.
The new generation of chatbots will be an interesting mix of these two scenarios since LLM tools will offer a conversational interface to your company’s products. This means you will need to keep track not only how frequently people use the chatbot, but also all products that get called to answer the queries. You would also need to keep track of the specific questions your clients are asking. Altogether, the new approach results in a lot of text information that you will need to store and process.
But with this data, a breadth of possibilities become open to you and other decisionmakers at your company.
Better differentiation. It will be possible to differentiate similar, repeated queries as either 1) user frustration because they are unable to get the response they want, or 2) the user repeating variations of a query that works well.
Efficient computation. The LLM can anticipate that certain queries will require a lot of computational resources to complete—and that the first question is typically modified and filtered for a more specific query. In such cases, the product can preempt wasted computations and instead ask the users for clarification.
Ability to pre-compute. A system monitoring usage can identify continually repeated requests that require a long time to compute. Under these cases, the system can simply learn to pre-compute the necessary data to speed up performance and reduce costs.
Deeper data aggregation. Modern LLMs will likely perform chain-of-thought computations, where the system parses the user query and calls multiple internal queries to aggregate data and answer the question. Because these chats might soon become the only way a client interfaces with your company’s products, it’s important to keep track of all the tools accessed—both to identify critical tools and those that can be deprecated.
Personalization. As your clients access a range of products, you can learn and retain their preferences to deliver a smooth experience across tools and workstations.
Because of these and many other benefits, it’s imperative to monitor and retain usage of the new LLM products. However, most existing frameworks are not equipped to monitor and store the data, so it’s wise to explore bespoke solutions.
Because of the unique nature of your LLM projects, it will be important to track the same data as general applications and Q/A systems as well as a few new metrics.
The standard application features naturally include metrics such as which product is being used, by whom, and when. The browser or operating system might also yield interesting statistics. Additional measures such as the amount of time to reply and a unique identifier for the given session both provide the basis of any usage-logging platform. Such data is the minimum for you to monitor user engagement, determine the overall health of a product, recommend products to your users, and determine revenue at risk.
For Q/A systems, the typical logged information is the question and answer itself. When studied across a user base, it is critical to understanding what your clients are struggling with and the features they typically can’t find or that are producing errors. The same data is necessary for LLM products, as the data analysis can reveal your clients’ needs.
Where LLM products excel, though, is their ability to infer user intention and create a sequence of operations to aggregate measures and provide a proper answer. Because of this you’ll need to build and use your new LLM products on top of your other products. Therefore, it is important to track all products and tools the chatbot called—both to keep usage logs of those products and analyze the efficiency of your LLM products. After all, it is useful to know whether the $10 LLM product is causing $100 of computation costs.
Be sure to also track the parameters and versions of the LLM models you’re using. They will advance and change quickly, so responses might vary greatly over time. Later, you can distinguish cases where an increase in engagement was due to the success of the sales team or the introduction of a more powerful engine. Furthermore, gathering information on the thumbs-up and thumbs-down metrics could be incredibly useful because the chatbots are fully automated, and this information can be one of the metrics to gauge user happiness. Keeping track of the number of tokens needed to generate responses is also a must as it will allow you to compute the cost/benefit tradeoff of your products.
Finally, because LLMs are neither perfect or deterministic, track all runs— the successful ones and those that failed. There will be times when a product will make several calls to an LLM as it aggregates data and considers cases where it generates multiple SQL queries to hit various databases. Some queries might have typos and not return any results. In those situations, your LLM product might gracefully recover, or it might not. But you would still need to track the run as it was not free to execute. Eventually, you will have data to define the success rates of your products.
The final consideration when building out your LLM products is the criteria of the database to store all the resulting data. As mentioned in the previous section, unlike general application monitoring, here you will want to keep track of the queries and replies.
It will be a lot of text to maintain. Judging from initial tests, a typical session might be five queries to the product, each of them using about 600 tokens or 0.001 Mb of data. For one user, the cost is negligible. But if your product engages with 10,000+ user queries every day, you’ll need to store many gigabytes of data every month. And afterwards the data needs to be available for efficient analysis. Furthermore, you’ll need to secure your data to ensure privacy across clients’ queries. Anonymizing is not enough. You will need to limit access to raw data to only a handful of trusted staff and maintain strict governance with role-based access control (RBAC).
Only aggregated statistics should be available for general internal analysis, and even that data must be secured. Either on purpose or accident, your clients will undoubtedly include proprietary information in their requests, which would need to be locked down and protected. And because the raw data is kept secure, set up your underlying database to support the multiple views built on top of it. This requires fast and efficient lookups.
Overall, it’s clear that introducing conversational functionality to your products could transform the way users engage with them. And as is well known, to truly understand client engagement it’s critical to monitor how and if they are using the tools. While you can monitor LLM tools with methods like those tracking regular engagement or Q/A systems, ultimately these tools will require bespoke solutions to be truly effective.
At FactSet, our cognitive and artificial intelligence teams have been utilizing Machine Learning and AI throughout our product since 2007. We’ve also been using Large Language Models such as Bloom, Google’s BERT, and T5 across our suite of products and services since 2018. More specifically, Large Language Models have enabled higher productivity and optimization within data extraction, natural language understanding for search and chat, text generation, text summarization, and sentiment analysis across FactSet’s digital platform.
Our use of LLMs has resonated with clients over the years. We are further investigating key areas of promise from newer LLM models, for example, for summarization and authoring, domain-specific search, and text-to-code functionality. LLM costs, commercial availability, security, and suitability to specific use cases can vary quite a bit, but we are committed to advancing our capabilities further as the LLM market evolves. We also continue our commitment to the privacy and protection of clients’ proprietary information.
This blog post is for informational purposes only. The information contained in this blog post is not legal, tax, or investment advice. FactSet does not endorse or recommend any investments and assumes no liability for any consequence relating directly or indirectly to any action or inaction taken based on the information contained in this article.