Featured Image

Going Beyond Standard Insights Through Content Synergies and Symbology

Data Science and Technology

By Steve Markovits  |  August 18, 2021


In recent years, there’s been much discussion surrounding data overtaking oil as the world’s most valuable resource. The Economist previously explored how today’s data economynow primarily managed by internet giants such as Amazon, Facebook, and Microsoftmay require a new approach to antitrust rules.

As big data grows, it’s important to recognize that in order for data to reach its full potential, disciplined work is required to clean any source of data before it can be used within existing data models or external datasets. In fact, The New York Times estimates that 50-80% of a typical data scientist’s day-to-day activity consists of wrangling and cleaning data, rather than the more insightful task of analyzing, modeling, and visualizing it.

Understanding the Benefits of a Connected Data Model

In finance, the ever-increasing desire for “alternative” data within data models alongside more traditional “core” content often means that data models must leverage a range of content sets from a variety of sources. This frequently dictates that time-consuming work must be carried out before the value of the data can truly be exposed, meaning that resources can be spent evaluating content that may later be deemed as ineffective or of little value.

Diagram of Connectivity Between Datasets

The following diagram provides an example of connectivity using FactSet’s Entity Master database, which connects company hierarchy relationships, people data and other entity relationships, and FactSet’s Symbology Master database, which connects financial instruments between three levels of granularity. Global, regional, and listing-level identifiers can then be further mapped to industry-standard identifiers such as SEDOLs, ISINs, tickers, and CUSIPs.


Source: FactSet

One of the biggest hurdles that need to be crossed when combining datasets from separate vendors is the process of identifier mapping and accurately establishing relationships between the entities that are present in the data. This topic is widely discussed within network and graph theory realms; both identify relationship connections such as parent and ultimate parent companies and their subsidiaries and affiliates, individuals and their roles within certain organizations, and business segment linkages, as well as company supply chain relationships such as suppliers, customers, and competitors.

In this article, we will address how mapping external data sources to entity and symbology master databases can be used to enhance and expedite the data cleansing process and ensure that all data can be accurately and reliably aggregated. We will also address the value of employing more than one data source within a single data model and explore how combining various sources of data can often bring about additional insights that have the potential to be greater than the sum of their parts.

The remainder of this piece will be dedicated to providing an example data model that will look to predict industry revenue figures for a chosen supplier ahead of official reporting. This couldn’t be achieved using a single data source without relying on the bias of publicly available data from financial brokers or industry insiders.

Combining FactSet and Cortera Datasets in a Predictive Model

This example will use nowcasting techniques that rely on Cortera’s Spend Insights dataset, which is traditionally used to predict financial health and future performance by identifying trends in purchasing, spending, and payment behaviors via a network of contributing participants.

For our predictive model, we will be using Cortera’s business-to-business data to identify spending patterns of the customers of a specific target supplier identified by FactSet’s Supply Chain Relationships. This will allow us to gain insight into how customer spending directly contributes to the revenue of the supplier.

In addition to the Cortera data, we will employ the use of the following FactSet datasets in order to complete the inner workings of the model:

  • FactSet Supply Chain Relationships: This dataset provides data on an ad-hoc basis and has been built to expose business relationship interconnections among companies, providing access to the complex networks of companies’ key customers, suppliers, competitors, and strategic partners. We will be using this data to identify the customer and supplier relationships between various companies in our model.
  • FactSet RBICs with Revenue: This dataset has been designed to normalize non-standardized business segment reports by mapping companies’ segment revenues to the granular sectors of FactSet Revere Business Industry Classification System (RBICS). We will be using this data in the model to identify the business segments, the various suppliers operating within them, and the percentage of revenue that each business segment provides to the suppliers’ total sales figures.
  • FactSet Fundamentals: This dataset is composed of annual and interim/quarterly data and detailed historical financial statement content. We will be using this data to provide historic Sales/Revenue reporting figures for each company as a benchmarking metric in our prediction model.

To make use of these datasets within a single model, we must consolidate Cortera’s proprietary identifiers as well as FactSet’s entity- and security-level identifiers. We must also consider the differing reporting frequencies. Thanks to a clearly laid-out symbology mapping and date structure, we can make light work of this task using querying tools such as SQL or Pandas.

Diagram of the Predictive Model 


Source: FactSet

Calculating Data with the Predictive Data Model

As shown in the diagram above, we use the following steps to calculate data within our predictive model:

  1. Divide data into each end-of-month period as defined in the Cortera Spend Insights database
  2. Enter a single target supplier, represented by a FactSet Entity ID
  3. Find all customers of target supplier from the FactSet Supply Chain Relationships database
  4. Find all suppliers to all the customers defined in the previous step (who can therefore be identified as competitors of our target supplier) from Supply Chain Relationships database
  5. Identify the market share of industry spending that can be attributed to the target supplier using FactSet’s RBICS with Revenue and Fundamentals

    As a first step, we leverage FactSet RBICS with Revenue to identify the percentage of revenue that each supplier (i.e., the target and its competitors) is generating within a target industry from which they’re operating. We then leverage FactSet Fundamentals to retrieve each suppliers’ total revenue for the period, which can then be multiplied by the revenue percentages to identify the actual revenue each supplier generates in the target industry.

    Next, the revenue from target industry values are summed for all suppliers (i.e., the target and its competitors) for each quarterly period.

    Finally, we divide the industry revenue values for the target supplier by the summed industry revenue values for all suppliers. This gives us the percentage of market shares attributed to the target supplier.
  6. Sum the total industry spending from the Cortera dataset for all customers for each monthly period
  7. Multiply the summed customer industry spending by the percentage of market share that the target supplier takes from industry; this identifies the initial estimated customer spending on the target supplier
  8. Identify the percentage of estimated customer spending on the target supplier that makes up for the target suppliers’ sales figures of quarter
  9. Calculate the values from steps 1-8 to generate the prediction universe

One final factor to address is the target industry mentioned in steps 5 and 6. While the industry sector stated in step 5 is defined by the FactSet Revere Business Industry Classification System (RBICS) dataset, the business segments used in step 6 are defined by Cortera; therefore, they are not directly comparable. Thankfully the RBICS data is granular enough that we can match its sectors to the Cortera Segments with a relative degree of accuracy. The table below demonstrates such linkages between the two classification sources.

Cortera Segment Cortera Segment ID FactSet RBICS Sector Name FactSet rbics sector id
 Information Technology  31  Technology  55
 Rail  5 Rail Transportation       4015102010
 Air  1 Air Passenger Transportation  40153010
 Apparel & Outdoors  36 Apparel and Necessary Products 201010
 Advertising  47 Marketing and Advertising Services  10101010

Sources: Cortera, FactSet

For purposes of this article we will only use target suppliers that receive 100% of their revenue from a specific sector. This means that the estimate of industry revenue will predict the total revenue figures for each period.

Interpreting Results

The charts shown below display the predicted sales values for a target supplier (blue line graph) as created by our data model, along with the actual reported sales values for the supplier (orange bar chart). Each quarterly period is split into individual monthly periods with three possible time-period indicators shown at the bottom of each chart:

  • -2M: Represents minus two months from the fiscal quarter reporting date. Data shown within columns with this value represent the predicted quarter end sales value two months prior to the event. For example, the -2M column for October 2019 represents the estimate for the end of October 2019 using data from the end of August 2019.
  • -1M: Represents minus one month from the fiscal quarter reporting date. Data shown within columns with this value represent the predicted quarter end sales value one month prior to the event. For example, the -1M column for October 2019 represents the estimate for the end of October 2019 using data from the end of September 2019.
  • +0M: Represents the actual fiscal quarter reporting date. Data shown within columns with this value represent both the actual reported sales values and the predicted quarter end sales value. For example, the +0M column for October 2019 shows actual and predictive data as per October 2019.

Viewing Results

MongoDB, Inc. (Information Technology)


Sources: Cortera, FactSet

Ralph Lauren Corp. (Apparel & Outdoors)


Sources: Cortera, FactSet


We have shown how our data model serves as a performance predictor that can have direct bearing on the share price of a stock. The model is intended to demonstrate how interconnecting content, including the use of datasets that go beyond standard one-to-one mapping of securities (e.g., FactSet Supply Chain Relationships), can lead to deeper insights and more opportunities to strengthen alpha and risk models.

The information contained in this article is not investment advice. FactSet does not endorse or recommend any investments and assumes no liability for any consequence relating directly or indirectly to any action or inaction taken based on the information contained in this article.

Get the Connecting the Dots white paper

Steve Markovits

VP, Content & Technology Strategy

Mr. Steve Markovits is Vice President, Product Strategy within the Content and Technology Solutions group at FactSet. In this role, he focuses on the data integration and analysis of third-party data providers as well as serving as a subject matter expert to inform decisions and define the content pipeline offered on the Open:FactSet Marketplace.