Company

Data Lakes, Data Lakehouses, and Everything In-Between

By Van Diamandakis
Mar 01, 2021

With all the buzz about the impending Databricks IPO – following the company’s recent massive cash infusion and staggering $28 billion valuation – it’s worth taking a minute to better understand what Databrick’s tech really represents. In this post, we’ll take a look at data lakes, data warehouses, and how Databricks’ data lakehouse paradigm proposes to offer the next step in the evolution of cloud data analytics.

What’s a Data Warehouse?

We all know what a database is. They’re the foundation of our data-driven world, and have been so for decades. But as early as the 1980’s, researchers at IBM realized the limits of databases in enterprise business decision-making. They were excellent at handling a large volume of simple queries from one data source – but less so for a massive quantity of queries about data from multiple sources.

The data warehouse was conceived to solve this. The data warehouse architecture was designed to draw data from multiple sources within an organization, and facilitate reporting and analysis. In more technical terms, the data warehouse was designed to analyze relational data from line-of-business applications and transactional systems, and was optimized for fast SQL queries that support operational reporting and analysis.

However, data warehouses were not designed for real-time analysis, only historical. Nor were they designed to ingest and handle analysis of the massive quantities of data that modern digital enterprises produce.

What’s a Data Lake?

Data lakes were designed from the ground up to hold big data in its raw form. Drawing on multiple sources, these repositories were architected to store structured, semi-structured, or unstructured data. By enabling data to be stored in a flexible format, data lakes facilitate multiple and diverse future or ad hoc usage scenarios.

Uniquely, data lakes were also designed to leverage clusters of inexpensive and scalable commodity hardware. This enabled a cost-effective scale of storage that was previously not viable.

However, as more and more data accumulated in data lakes, enterprises realized that storing data and using it were two entirely different challenges. Data began backing up in data swamps because organizations were unable to match the performance, security or business tool integration of their (more expensive but more manageable) data warehouses.

Here’s a quick summary of the differences between data lakes and data warehouses:

	Data Lake	Data Warehouse
Best for which users?	Data scientists	Business users
How is data structured?	Raw	Processed
How accessible is data?	Highly accessible and quick to update	More complicated and costly to make changes
What’s the purpose of the data?	For future use	Non-real time analysis

So, What’s a Data Lakehouse?

First off, although Databricks has adopted the term “data lakehouse” and is branding it, the company is actually not the first to use it. AWS used the term “lake house” in 2019 when discussing a change in Amazon Redshift Spectrum. Before that, in 2017, Snowflake claimed that one its customers was using Snowflake to combine structured and schema-less data processing into what that customer called a “data lakehouse.”

Terminology aside, the data lakehouse was conceived to fuse the low-cost storage benefits of a data lake with the data management and data structure features of a data warehouse. The lakehouse paradigm blurs the lines between the two because it enables schema to be enforced over curated data subsets in specific data lake zones or in associated analytical databases - while still maintaining the flexibility and cost advantages of cloud data storage. The key enabler here is the data lakehouse’s structured transactional layer, known as Delta Lake.

What Does All This Mean?

The emergence of the data lakehouse paradigm seems to deliver the best of data warehouses and data lakes. This convergence should offer enterprises greater simplicity, more benefits and applications – and may well dramatically change the cloud-based analytics landscape.

Yet data lakehouses are cloud-based by design. And it’s worthwhile recalling that many enterprises still lack fully cloud-based datasets. There are multiple paths by which organizations can move to the cloud. Adopting advanced technology like WANdisco’s LiveData Migrator can enable a nuanced, sophisticated and – most importantly - non-blocking cloud migration approach. Moreover, for companies wary of vendor lock-in, our LiveData for MultiCloud solves the exponentially growing challenge of keeping data available and consistent across multiple cloud environments in different geographies.

FOLLOW

Get notified of the latest WANdisco Blog posts and Newsletter.

Terms of Service and Privacy Policy. You also agree to receive other marketing communications from WANdisco and our subsidiaries. You can unsubscribe anytime.

Data Lakes, Data Lakehouses, and Everything In-Between

FOLLOW

SUBSCRIBE

Related Blog Posts

How IoT Will Transform Transportation

Data Modernization: A Data Leader’s Answer to Driving Valuable Business Outcomes

3 Ways the Oil & Gas Industry is Applying IoT to Cut Costs

Data Activation Platform

Solutions

Source Code Management / ALM

WANdisco Data Migrator

Use Cases

Customers

WANdisco Customer Case Studies

Partners

Partner Network

Featured Partners

The New Azure Cloud Data Migration Playbook

Company

Investors

Latest News & Events

2022 State of Data Activation Report

Resources

Resource Library

Customers & Case Studies

WANdisco Edge to Cloud

Data Lakes, Data Lakehouses, and Everything In-Between

FOLLOW

SUBSCRIBE

Related Blog Posts

How IoT Will Transform Transportation

Data Modernization: A Data Leader’s Answer to Driving Valuable Business Outcomes

3 Ways the Oil & Gas Industry is Applying IoT to Cut Costs