Modern Data Lakes for Data-Driven Organizations

Written by Marcin Kulakowski | Oct 1, 2020

The nature of data has changed drastically over the past three decades. Most business-critical decisions made by large organizations in the 1990s were based on transactional data generated by business processes and systems. This type of structured data included data stored in data warehouses, Online Transaction Processing (OLTP) systems (e.g., Oracle, DB2, SQL Server), and other types of common data repositories.

Introduction

Before data could be generated by interactions between people and machines, infrastructures were built based on the need to manage transactional data. This semi-structured or unstructured data included web pages, social media, videos, and music. It was then followed by sensor data primarily created by machine monitoring, servers, networks, thermostats, lights, and other devices.

If we take a look at transactional data that is traditionally created by business applications and compare it to the big data of today, there are several differences in the three Vs: volume, velocity, and variety. The biggest challenge in big data initiatives is connecting employees to the right data and making sure they understand how it can be used to make better business decisions. As a result, it is very difficult for businesses to have the vision and expertise to build and operate these platforms.

With more and more data being generated, the need for advanced analytics has increased significantly. What began with descriptive analytics in the transactional world has evolved into analytics that we have today. From business intelligence dashboards with descriptive analysis all the way to machine learning to perform predictive analysis, it’s becoming increasingly important to consider how new and innovative technologies will continue to shape the future of data analysis.

Shortcomings of Data Warehouses

In a data warehouse-centric world, if the defined structure of the data in the warehouse did not fit into your analysis or you wanted to analyze and discover unstructured massive volumes of information, you were simply out of luck.

The ever-changing nature of data and analytics has caused different technologies to emerge. Data warehouses were previously built on relational technologies that served as central repositories for all the data collected by an organization’s business systems. Data is extracted, transformed, and loaded (ETL) into a data warehouse that supports the reporting, analytics, and mining of curated datasets.

Traditional data warehouses mostly take a top-down approach; they are designed first and then implemented. Since these warehouses have limited scalability, they cannot properly support all the data users and workloads in today’s world. Managing the uptick in available data and an infrastructure that supports it can be incredibly challenging. Inadequate elasticity comprised of stiff, inflexible architectures and rigid maintenance costs also makes things difficult. Data warehouses are forced to keep the servers and software on 24/7 basis, but with limited volumes of data transfers and an inability to consolidate siloed datasets, this is incredibly challenging.

Using unstructured information for deep learning and analytics is simply not feasible. In a data warehouse-centric world, if the defined structure of the data in the warehouse did not fit into your analysis or you wanted to analyze and discover unstructured massive volumes of information, you were simply out of luck. Business analysts and other users needed to rely on data professionals for all data requests and were forced to wait until raw data was collected, processed, and loaded it into the data warehouse in the desired form before they could start using it. This was a very time consuming, costly, and cumbersome process.

Strengths of Data Lakes

Data lakes provide organizations with much more flexibility and agility—two characteristics that are critical to building a data-driven enterprise. In many ways, data warehouses are turning into the data marts of the past.

Over the past few years, data lakes have emerged as data management solutions that can satisfy the needs of big data and provide new levels of advanced analytics. They accept data in all formats from a variety of sources and can provide a flexible environment for making intelligent, data-driven business decisions.

There are two ways that data lakes address the shortcomings of the data warehouse. Firstly, data lakes store data in structured, semi-structured, and unstructured formats. Secondly, the data schema is decided upon reading the data (i.e., rather than upon loading or writing it), so you can always modify it. This is especially helpful when there is extra information or structures that you need from the raw data, leading to an increase in a company’s overall agility. It also means that the data is quickly available because it does not have to be curated before it can be consumed by the processing engines. Since data lakes are so cost-effective, there is never any need to throw away or archive the raw data; it is always available if needed. With greater flexibility and accessibility and a wide range of benefits including reduced costs, unlimited storage, and various capabilities for processing both structured and unstructured data, it’s no wonder why companies are looking into emerging data warehouse technologies that are built for the cloud architecture. One of these is Snowflake.

Snowflake is a fully managed Software as a Service (SaaS) data-lake platform built specifically to leverage the cloud. It provides storage and computes scales separately through its comprehensive on-demand resource management. In addition to reducing database maintenance and eliminating performance tuning, Snowflake also provides significant SQL compatibility, a simple process to perform data restatements via SQL merges, and other DML support (i.e., updates, deletions, and inserts). It also includes data virtualization features and supports a range of programming languages such as Python, R, and Spark. However, one of the most important strengths of Snowflake is that the platform is cloud-agnostic; data is available on Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) providers. This alone has allowed Snowflake to accomplish what other firms have not in terms of performance, cost, flexibility, deployment options, and ease of use.

Having cloud-based data-lake technologies such as Snowflake in place will enable companies to lead to the fourth V of big data, which is value. This refers to the ability to discover, analyze, and transform data from high-volume or cross-platform systems into meaningful patterns and trends. Data lakes provide organizations with much more flexibility and agility—two characteristics that are critical to building a data-driven enterprise. In many ways, data warehouses are turning into the data marts of the past. Furthermore, the cloud plays a big role in reducing operational complexity so that every enterprise can pursue including a data lake architecture and data infrastructure strategy. Enterprises can eliminate the complexity of operations and at the same time enjoy the wonderful cost of ownership benefits while building a data lake architecture.

However, with great power comes great responsibility. Only through proper data governance, data quality management, and metadata management can organizations achieve the fifth V, which is veracity. This is where trust in the accuracy, quality, and content of the organization’s information assets lies.

Goals and Principles for Data Governance

Too often, there are organizational and cultural silos that limit the data sharing between business organizations. That’s why companies need to break down organizational silos with good communication and governance that will encourage information sharing.

A successful data strategy requires the following interrelated guidelines:

Aligning business strategies with data strategies
Managing the people, process, policies, and culture around the enterprise data
Leveraging and managing data for strategic advantage by applying master data management, data quality, data architecture, and data modeling in place
Coordinating and integrating different data sources with planning, inventory, data integration, and metadata management

To apply a structured data governance framework, four principles are needed:

Organization and People
Process and Workflow
Data Management and Measures
Culture and Communication

Having those four principles in place and with help of technology and tools, we can achieve vision and strategy to tackle any business goals and objectives and solve any data issues and challenges.

Too often, there are organizational and cultural silos that limit the data sharing between business organizations. That’s why companies need to break down organizational silos with good communication and governance that will encourage information sharing. Often, they need to provide the mechanism for coming together and sharing, collaborating, and learning from one another.

Data governance processes and workflows are different for data lakes and the data sources that feed data lakes. Data lakes, where big data exploration usually happens, is lightly governed. Data sources, on the other hand, are heavily governed with structured data models, metadata, data lineage, and so forth.

Conclusion

Metadata management and governance is different from a data lake versus a variety of data sources. It’s important to consider other exploratory and rapidly changing environments such as Open Source Development, Open Data, and other platforms based on your needs. Data feeds from various sources should also have more traditional metadata management that applies data models, data lineage, and business metadata. Therefore, we should be able to answer questions such as: What does this term mean? Who is the owner or steward of the data? Who can I go to ask a question?

By establishing these guidelines and a sound data strategy, your modern cloud-based data lakes can achieve greater value and veracity, and the benefits can be prioritized and implemented in a seamless, phased-in approach that accommodates the specific requests of any organization. Transforming your business into a data-driven organization and culture is not an easy transformation and it might have some roadblocks to implement.

Going forward, companies seeking to implement models need to find ways to monitor and charge for use of the data resources involved. The fact that this is much easier to do in a cloud infrastructure points to where data-driven organizations and data lakes are headed.

View full post