Featured Image

Unraveling the Sentiment Data Purchasing Process

Data Science and Technology

By Christian Cifelli  |  October 28, 2020

Much has already been said about sentiment analysis, how to leverage Natural Language Processing (NLP), and the value in sentiment data. In this article, we will provide insight on how to approach the sentiment data market, one that is being continuously filled with an abundance of different options.

Taking a broad look at the market for sentiment-related content, you find a variety of different providers, sources, approaches, and strategies. The market of sentiment content continues to expand as the value of leveraging the data becomes more widely accepted. However, the decision to invest in sentiment data brings with it additional decision points that are essential to understand when pursuing various vendors and sources. For instance, an important question to ask yourself as you begin your analysis is: What is my investment horizon? A question as simple as that serves as a filter to narrow down the sea of options.

There are additional considerations when engaging with data providers in the sentiment space that will quickly eliminate datasets that are not a fit for a given investment style and approach. The three areas on which we’ll focus here are: the source of the data used for the sentiment analysis, how the data is being tagged, and whether to use the Buy, Build, or Blend approach.

Buy, Build, or Blend

Before digging into sourcing and tagging, it’s important to delve into the three common approaches for sentiment data consumption: Buy, Build, and Blend. Simply put, the differing approaches come down to whether the preference is for raw data, derived sentiment scores, or a mix of the two. Within those three options is significant nuance that is important to understand as a data consumer.

  • Build is taking on the acquisition of raw data sources to perform in-house NLP. Building a proprietary sentiment algorithm requires large upfront cost and work but provides a solution with full transparency into the output. It is especially valuable in situations where an end user will need to verify the values used in a model; it is also important when compliance is a hurdle.
  • Buying on the other hand means relying on third-party data providers for sentiment analysis. This approach is a viable solution when the priority is to consume a large number of sources, and a lack of exact insight into the model and approach is not a barrier to entry. Buying from a vendor can also be a cost-friendly solution as it circumvents the need to aggregate a large number of sources, store and manage data, and build out a team of in-house experts.
  • Blending data lies somewhere between those two strategies. Blending is the integration of alternative data vendors in conjunction with in-house sentiment analysis. This best-of-both-worlds solution is often valuable when you can pair your scoring of raw content with a data vendor who is analyzing and scoring the same source. Having the ability to test a non-traditional score makes it easy to gauge the usefulness of the raw data before any custom work needs to be done. Additionally, it provides a way to validate or compare your work to something in the market.


When it comes to sentiment, the choice of source is often dictated by the investment approach. Different sources inherently provide data that is best suited for different users and applications. To illustrate this, we’ll begin by looking at corporate communication as a source in comparison to social media.

Corporate communication can range from a company’s earnings call to its 10-K filing. This source is often viewed as the gold standard when it comes to valuing a company but with it come challenges such as a lower frequency and few events on which to train a model. Additionally, this source can be seen as having inherent bias. On the other hand, social media provides information at a breakneck pace, suiting itself well to shorter-term investment horizons. However, this data is less reliable as contributors aren’t always viewed as experts on the field on which they are commenting. Additionally, tagging social media activity to a target investable company is often not straightforward.

Another key source, and perhaps the most popular, is news. News is a category of data that provides insight into market opinion at a pace that is rivaled only by social media. However, the difference between news and social media is that typically the sources and content of news are held in higher regard. Vendors that offer sentiment derived from news can differentiate themselves by supplying news volume (accounting for bias associated with different media outlets) and identifying articles with unique information, rather than restating other sources.


Tagging, as a topic of differentiation, is relevant no matter the source or approach. When it comes to raw text, a data provider may deliver their own content tagging to ensure greater usability. This may come in the form of metadata tagging, which helps contextualize the text by providing details such as its source, relevant companies, and people, as well as more document-specific information such as publish date. Additionally, data consumers might leverage a more advanced form of tagging called Named Entity Recognition (NER). NER allows elements of text to be classified as a person, place, or thing. Classifying pieces of the text prior to data parsing can save an end user valuable time.

Providers who deliver derived sentiment values based off of raw text might also tag articles or sections with events or categories. For instance, a news article may be tagged to M&A activity or an earnings release. To provide context to the score, a vendor may also include detail on the weighting, relevance, or confidence of the value. In cases when the raw data behind a score is unavailable or not reported, detailed tagging can serve as a suitable substitute.


Leveraging a decision-making framework like the one described above eliminates the guesswork and unneeded burden of evaluating every data provider in the market. Thinking through which approach (Buy, Build, or Blend), data source, and tagging options best suit your investment needs offer an ability to streamline what can be a time-consuming process.


Christian Cifelli

VP, Content & Technology Strategy

Mr. Christian Cifelli is Vice President, Product Strategy within the Content and Technology Solutions group at FactSet. In this role, he focuses on developing an in-depth understanding of our clients' content and integration needs, identifying market trends, and serving as a subject matter expert to inform decisions and define the content pipeline for Open:FactSet. Mr. Cifelli earned a B.S. in Finance from Villanova University.