In recent years, there’s been much discussion surrounding data overtaking oil as the world’s most valuable resource. The Economist previously explored how today’s data economy—now primarily managed by internet giants such as Amazon, Facebook, and Microsoft—may require a new approach to antitrust rules.
As big data grows, it’s important to recognize that in order for data to reach its full potential, disciplined work is required to clean any source of data before it can be used within existing data models or external datasets. In fact, The New York Times estimates that 50-80% of a typical data scientist’s day-to-day activity consists of wrangling and cleaning data, rather than the more insightful task of analyzing, modeling, and visualizing it.
In finance, the ever-increasing desire for “alternative” data within data models alongside more traditional “core” content often means that data models must leverage a range of content sets from a variety of sources. This frequently dictates that time-consuming work must be carried out before the value of the data can truly be exposed, meaning that resources can be spent evaluating content that may later be deemed as ineffective or of little value.
The following diagram provides an example of connectivity using FactSet’s Entity Master database, which connects company hierarchy relationships, people data and other entity relationships, and FactSet’s Symbology Master database, which connects financial instruments between three levels of granularity. Global, regional, and listing-level identifiers can then be further mapped to industry-standard identifiers such as SEDOLs, ISINs, tickers, and CUSIPs.
Source: FactSet
One of the biggest hurdles that need to be crossed when combining datasets from separate vendors is the process of identifier mapping and accurately establishing relationships between the entities that are present in the data. This topic is widely discussed within network and graph theory realms; both identify relationship connections such as parent and ultimate parent companies and their subsidiaries and affiliates, individuals and their roles within certain organizations, and business segment linkages, as well as company supply chain relationships such as suppliers, customers, and competitors.
In this article, we will address how mapping external data sources to entity and symbology master databases can be used to enhance and expedite the data cleansing process and ensure that all data can be accurately and reliably aggregated. We will also address the value of employing more than one data source within a single data model and explore how combining various sources of data can often bring about additional insights that have the potential to be greater than the sum of their parts.
The remainder of this piece will be dedicated to providing an example data model that will look to predict industry revenue figures for a chosen supplier ahead of official reporting. This couldn’t be achieved using a single data source without relying on the bias of publicly available data from financial brokers or industry insiders.
This example will use nowcasting techniques that rely on Cortera’s Spend Insights dataset, which is traditionally used to predict financial health and future performance by identifying trends in purchasing, spending, and payment behaviors via a network of contributing participants.
For our predictive model, we will be using Cortera’s business-to-business data to identify spending patterns of the customers of a specific target supplier identified by FactSet’s Supply Chain Relationships. This will allow us to gain insight into how customer spending directly contributes to the revenue of the supplier.
In addition to the Cortera data, we will employ the use of the following FactSet datasets in order to complete the inner workings of the model:
To make use of these datasets within a single model, we must consolidate Cortera’s proprietary identifiers as well as FactSet’s entity- and security-level identifiers. We must also consider the differing reporting frequencies. Thanks to a clearly laid-out symbology mapping and date structure, we can make light work of this task using querying tools such as SQL or Pandas.
Source: FactSet
As shown in the diagram above, we use the following steps to calculate data within our predictive model:
One final factor to address is the target industry mentioned in steps 5 and 6. While the industry sector stated in step 5 is defined by the FactSet Revere Business Industry Classification System (RBICS) dataset, the business segments used in step 6 are defined by Cortera; therefore, they are not directly comparable. Thankfully the RBICS data is granular enough that we can match its sectors to the Cortera Segments with a relative degree of accuracy. The table below demonstrates such linkages between the two classification sources.
Cortera Segment | Cortera Segment ID | FactSet RBICS Sector Name | FactSet rbics sector id |
Information Technology | 31 | Technology | 55 |
Rail | 5 | Rail Transportation | 4015102010 |
Air | 1 | Air Passenger Transportation | 40153010 |
Apparel & Outdoors | 36 | Apparel and Necessary Products | 201010 |
Advertising | 47 | Marketing and Advertising Services | 10101010 |
Sources: Cortera, FactSet
For purposes of this article we will only use target suppliers that receive 100% of their revenue from a specific sector. This means that the estimate of industry revenue will predict the total revenue figures for each period.
The charts shown below display the predicted sales values for a target supplier (blue line graph) as created by our data model, along with the actual reported sales values for the supplier (orange bar chart). Each quarterly period is split into individual monthly periods with three possible time-period indicators shown at the bottom of each chart:
Sources: Cortera, FactSet
Sources: Cortera, FactSet
We have shown how our data model serves as a performance predictor that can have direct bearing on the share price of a stock. The model is intended to demonstrate how interconnecting content, including the use of datasets that go beyond standard one-to-one mapping of securities (e.g., FactSet Supply Chain Relationships), can lead to deeper insights and more opportunities to strengthen alpha and risk models.
The information contained in this article is not investment advice. FactSet does not endorse or recommend any investments and assumes no liability for any consequence relating directly or indirectly to any action or inaction taken based on the information contained in this article.