Featured Image

Accelerating Feature Engineering from Semi-Structured Data with Dask

Data Science and AI

By Drake Bushnell  |  November 16, 2020

The popularity of Python is growing across all industries and it's not surprising why. The community, breadth of open-source packages, and ease of use make it an excellent choice for beginners and experts alike. However, when leveraging a high-level programming language such as Python, as opposed to a low-level language such as C, speed is one trade-off that frequently presents itself. The impact becomes increasingly evident as data processing bridges into semi-structured and unstructured data, as well as when the amount being analyzed is simply too much for a single machine to handle.

In my recent work with the FactSet XML Transcripts DataFeed, I explored how Dask could be applied to accelerate a traditionally cumbersome task while keeping me in my language of choice—Python. For anyone interested in reading more about why many Python users adopt Dask for big data routines, check out the documentation on their webpage here.

The Problem 

Replicating the 2019 paper When Managers Change Their Tone, Analysts and Investors Change Their Tune requires parsing and analyzing thousands of earnings call transcripts. For my recreation of this analysis, I used the Russell 3000 from 2011 to 2020, which translated into approximately 90,000 documents. Assuming the process for retrieving, parsing, and analyzing each transcript takes one second, it would take 23 hours just to build out this new dataset! In reality, it will likely take far greater than one second and more than one attempt. How can I do this more efficiently without losing days on end?

A Solution 

Enter the open-source project, Dask. For those who are unfamiliar, "Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love." My three drivers for choosing Dask were:

  • The compatibility with Pandas, Numpy, and Scikit-learn significantly reduced the learning curve
  • Dask Schedulers are designed to scale computations on a laptop or in the cloud
  • Dask is pure Python, so no need to learn Spark, Scala, or PySpark

Let's review the data pipeline to get a clear picture of where Dask introduced efficiencies to this workflow. I will need to accomplish the steps shown here:

Dask1

The pipeline I constructed leverages a variety of tools and packages from Microsoft SQL Server and Apache’s Parquet, among others. The remainder of this article will focus on how Dask can accelerate the analysis performed in the document parsing and feature engineering steps seen below.

The Process

As shown below, both stages contain multiple steps for each transcript:

Dask2

One approach to this problem is to construct a for-loop statement around a list of documents to handle each of these stages sequentially. Another is to leverage Dask and introduce parallelism, dividing the work among the resources available on a single machine or a cluster of them.

I chose to center this workflow around Dask.bag. It is an excellent choice for processing semi-structured data in parallel. The code snippet below details the methods used, but here is a plain English translation:

  • The dask.bag collection from_sequence is given a list of filenames for analysis
  • Document parsing is parallelized through the use of a map function (i.e., submitting multiple functions at once)
  • Feature engineering is parallelized through the use of a map function
  • The output is a dask.bag data type that can then be flattened and converted to a Pandas DataFrame

Dask3

The user defines a Dask client-enabling parallelism. A local cluster is created through the commands provided below. Here, we can see that for this cluster, there are four cores, four workers, and about 17 GB of RAM to distribute our workload across:

Dask4

With Dask's visualize method, we can see the breakdown of how these steps will be processed once evoked. Building on the example above, this is the execution plan for the four transcripts we submitted:

Dask5

Other Dask features, such as dask.delayed, can be used to parallelize this workflow even further. For now, let's benchmark the performance between a traditional for-loop against the dask.bag collection to see the benefits of parallelism in action.

Dask6 

The Takeaways

The chart above highlights two key points. First, Dask will not always improve performance. There are several cases where native Python is better optimized for a given task. Until we begin to work with hundreds of documents, there are minimal performance gains.

Second, Dask will decrease the number of minutes spent calculating transcripts data. Dask starts to truly shine in this arena as we see a nearly 50% drop in the calculation time when splitting the work across four workers.

For data-hungry processes, whether in the research or production stage, calculation time matters. Second, for researchers, leveraging tools like Dask means the ability to iterate through ideas faster and work with more data in less time with minimal changes to underlying code, which can be priceless. In production, this means executing your data pipeline more efficiently or granting the ability to increase the amount of data used without sacrificing timeliness.

Data Exploration Webcast

Drake Bushnell

Vice President, Product Strategist, CTS Strategy & Development

Comments

The information contained in this article is not investment advice. FactSet does not endorse or recommend any investments and assumes no liability for any consequence relating directly or indirectly to any action or inaction taken based on the information contained in this article.