To gain an appreciation of how Large Language Models (LLMs) like ChatGPT work, it is important to first understand a foundational AI concept called embeddings. Embeddings are used in natural language processing (NLP) to represent words and phrases in a way that computers can process—as vectors, or N-dimensional arrays, of numbers. These numerical vectors capture the meaning of words by representing different characteristics of a word with numbers in an array.
Let's explain with a food-based example (Figure 1). We can represent words like apple and bread as vectors in a multi-dimensional space, where each dimension captures a specific aspect of the food, such as taste or color, with a real number in a range from 0 to 1.
This table defines a four-dimensional vector where the dimensions represent sweetness, crunchiness, redness, and vitamin C content. Apple and orange would have high values for sweetness, while jalapeno would have a low value. Similarly, carrot would have high crunchiness, while the crunchiness value for bread would be low.
Figure 1: Food Words with Four Dimensions and Plotted in Two-Dimensional Space
If you were able to visualize this in N-dimensional space, you would see the words apple and orange are more semantically similar to each other (because of their shared high sweetness and vitamin C content) than either are to jalapeno or bread. (In Figure 1, we simplify this visualization to 2-dimensonal space.) If we added an embedding for toast, its vector would be very close to the one for bread, with a slight offset due to the crunchiness factor.
By using vector math, AI is able to “understand” both the meanings of—and the relationships between—words. For example, it could infer that toast is related to bread in a similar way popcorn is related to corn kernel.
In practice, automated algorithms determine the number and type of features that best characterize a given set of words, typically with hundreds of dimensions. You can create embeddings for more than just single words. For example, you can create an embedding of an entire news story, and then store a set of news story embeddings in a special database (known as a “vector database”) that allows you to find the most relevant matches given a question about the news. This is an important technique used with modern Large Language Models to retrieve relevant data for a user query and avoid AI hallucinations.
FactSet’s AI-generated list of private company comparables uses embeddings to compare companies in much more nuanced ways than can be done with a simple industry classification or set of curated keywords. The proximity of companies in vector space—represented as a two-dimensional cluster chart in Figure 2—provides rich information about how similar those companies actually are.
Figure 2: FactSet's AI Comps Use Embeddings to Cluster Comparable Companies at a Nuanced Level
Source: FactSet
A keyword-based approach might group all “autonomous vehicle technology” together, but an embedding-based approach will distinguish between automated parking technology and automated collision-detection technology without the need for human curation of additional keywords or categories.
Embeddings allow computers to better understand and process human language, paving the way for more advanced NLP and AI applications like Large Language Models. While Generative AI applications like ChatGPT may seem to have a human-like understanding of language, their abilities are in reality rooted in statistics and math.
To learn more, visit FactSet Artificial Intelligence and read our additional LLM articles:
This blog post is for informational purposes only. The information contained in this blog post is not legal, tax, or investment advice. FactSet does not endorse or recommend any investments and assumes no liability for any consequence relating directly or indirectly to any action or inaction taken based on the information contained in this article.