3 Red Flags to Avoid the Snake Oil of ChatGPT Stock Pickers

Data Science and AI

By Yuri Malitsky | November 15, 2023

There’s a popular but dangerous trend spreading even into reputable news sources: Articles about how ChatGPT has been used to pick stocks better than professional investment firms to deliver higher returns. These portrayals of Large Language Model capabilities are wildly misleading and have the potential to cause investors real harm.

To understand why, let’s dive into the details of how LLMs work and highlight how to spot the most egregious red flags of this modern-day snake oil.

Red Flag No. 1: Survivorship Bias

First and foremost, one must understand survivorship bias. This is the erroneous tendency when evaluating stocks of concentrating on strategies that have remained successful over time while overlooking those that have failed. This leads to an inaccurate representation of overall performance or viability of the winning strategy.

Consider that due to the popularity of ChatGPT there are unknown numbers of teams exploring how to use the platform for stock selection. If a handful of them just so happen to strike it rich in the short term and beat the returns of investment firms or major benchmarks, that is by no means a reason to promote the strengths of ChatGPT while omitting all the failed results.

It's not enough to simply comment that ChatGPT was used. The article must explain the strategy because that is the only way for other investors to determine the quality of the approach. Doing anything else simply encourages uninformed followership of ChatGPT’s predictions.

Red Flag No. 2: Prediction

When considering Large Language Models in general, one must remember that these systems simply operate with a complicated probability distribution to predict the response to a query one word at a time. In most cases, the predictions are simply following the most frequently mentioned stocks in the model’s training history.

At their core, LLMs are not thinking, and prediction for them has a different meaning than it does in everyday conversations. In most cases, the models do not even plan on where their response is going. For example, given the sentence “The students opened their,” the LLM will generate a list of potential words to finish the sentence (e.g., books, laptops, exams, minds) and select the word most statistically likely to match the context of the request based on feedback it has received during training.

Once a word is selected, the models are unable to realize and correct mistakes. There are new systems that tackle the issue, but in the base models of ChatGPT that are often cited in stock-picking articles, this is how they work. Therefore, if the LLM chose the wrong word in its answer, it might go off on an uncontrolled tangent.

Under these circumstances, when ChatGPT comments about the best stocks to pick, the LLM is replying one word at a time with the stocks that appear most frequently in its training history. Once it picks a stock, the corresponding word selection yields a plausible sounding reason. But a reason can be made up for any stock.

This naturally yields to companies such as Tesla and Apple coming to the forefront. Both had been doing well at the time the model was trained—and just so happened to continue to do well over the course of 2023. Likewise, when asking the models about the outcome of a particular cryptocurrency, the machine is similarly repeating a summary of its underlying training data.

Red Flag No. 3: Mathematical and Data Limitations

Because LLMs are word-generating tools, they should not be relied on to perform complex mathematics or deductive reasoning. Although some modern systems have begun handling simple arithmetic, LLMs should not be trusted to make complex progression models that will highlight the risk profile of a selected portfolio. Based on the training corpus, an LLM can determine which classical approach to utilize, but it would be hard pressed to do the operations itself. Separate specialized AI and statistical tools should perform those operations.

In addition, most LLM models are working with a static knowledge base up to their date of training. For example, ChatGPT 4 uses training data through September 2021. While there are now methods to link chatbots to the internet, even these systems will not be automatically aware of all the most recent data. It is not controversial to claim that investment decisions should be based on the latest information about company fundamentals, the market, and economic environment.

Are There Any Uses for LLMs in Investment Planning?

Although these are the most flagrant red flags to watch for, there are potentially beneficial techniques to help investment decision-makers. One such research area uses LLMs to summarize recent articles and identify companies that are being negatively or positively targeted in the press. This begins to touch on the strengths of modern LLMs to help researchers summarize vast amounts of text data.

As always, one must be careful before putting money to work. While LLMs are remarkably good at summarizing and performing entity extraction and sentiment analysis, they are still prone to mistakes and hallucinations. And as of yet, it is not possible to reliably test for and remove these drawbacks. While a summary might seem legitimate, there is a non-zero chance that parts of it were made up to simply look good as a reply.

So, one must ask about the tolerance of the underlying models if the LLM results are a little off. Furthermore, just because an LLM is stating that a company is getting more positive commentary, this is little different from traditional sentiment analysis and needs to be tested and verified with the same amount of stringency.

In the meantime, there are plenty of alternative established trading signals based on classical and machine learning approaches to assist traders. At FactSet, for example, we have developed predictive Signals where AI models surface insights and context, such as if a company is the target of an activism campaign, is predicted to issue a follow-on, or has experienced recent credit rating changes, as just a few general examples. Leveraging these alerts could very well form a solid foundation to assist the investment decision-making process.

There is little doubt that as development of Large Language Models continues, they will one day be extremely useful in making strategic decisions such as helping to manage a portfolio. But that day is not here yet, so in the meantime it’s critical that individuals stay focused on the time-tested principles of building portfolios and avoid scams promising market-beating returns from the supernatural stock-selection “abilities” of LLMs.

This blog post is for informational purposes only. The information contained in this blog post is not legal, tax, or investment advice. FactSet does not endorse or recommend any investments and assumes no liability for any consequence relating directly or indirectly to any action or inaction taken based on the information contained in this article.

Post Comment

Yuri Malitsky

Senior Vice President of Enterprise Analytics

Dr. Yuri Malitsky is Senior Vice President of Enterprise Analytics at FactSet. In his role, he leads the analysis of internal data to enhance FactSet's competitive advantage and better understand clients’ needs. His team uses machine learning, optimization, and statistical analysis models to support data-driven decision-making. Prior to FactSet, Dr. Malitsky worked in the investment banking sector at Morgan Stanley and JPMorgan. He earned a Bachelor of Computer Science degree from Cornell University and a PhD in Computer Science from Brown University.

3 Red Flags to Avoid the Snake Oil of ChatGPT Stock Pickers

Data Science and AI

Yuri Malitsky

Senior Vice President of Enterprise Analytics

Related Articles

July 30, 2025

June 5, 2025

May 29, 2025

May 6, 2025

Comments