Data Lakes, Data Lakehouses, and Everything In-Between

By Van Diamandakis, Mar 01, 2021

With all the buzz about the impending Databricks IPO – following the company’s recent massive cash infusion and staggering $28 billion valuation – it’s worth taking a minute to better understand what Databrick’s tech really represents. In this post, we’ll take a look at data lakes, data warehouses, and how Databricks’ data lakehouse paradigm proposes to offer the next step in the evolution of cloud data analytics.


What’s a Data Warehouse?

We all know what a database is. They’re the foundation of our data-driven world, and have been so for decades. But as early as the 1980’s, researchers at IBM realized the limits of databases in enterprise business decision-making. They were excellent at handling a large volume of simple queries from one data source – but less so for a massive quantity of queries about data from multiple sources.


The data warehouse was conceived to solve this. The data warehouse architecture was designed to draw data from multiple sources within an organization, and facilitate reporting and analysis. In more technical terms, the data warehouse was designed to analyze relational data from line-of-business applications and transactional systems, and was optimized for fast SQL queries that support operational reporting and analysis.


However, data warehouses were not designed for real-time analysis, only historical. Nor were they designed to ingest and handle analysis of the massive quantities of data that modern digital enterprises produce.


What’s a Data Lake?


Data lakes were designed from the ground up to hold big data in its raw form. Drawing on multiple sources, these repositories were architected to store structured, semi-structured, or unstructured data. By enabling data to be stored in a flexible format, data lakes facilitate multiple and diverse future or ad hoc usage scenarios.


Uniquely, data lakes were also designed to leverage clusters of inexpensive and scalable commodity hardware. This enabled a cost-effective scale of storage that was previously not viable.


However, as more and more data accumulated in data lakes, enterprises realized that storing data and using it were two entirely different challenges. Data began backing up in data swamps because organizations were unable to match the performance, security or business tool integration of their (more expensive but more manageable) data warehouses.


Here’s a quick summary of the differences between data lakes and data warehouses:


 

Data Lake Data Warehouse

Best for which users?

Data scientists

Business users

How is data structured?

Raw

Processed

How accessible is data?

Highly accessible and quick to update

More complicated and costly to make changes

What’s the purpose of the data?

For future use

Non-real time analysis

So, What’s a Data Lakehouse?


First off, although Databricks has adopted the term “data lakehouse” and is branding it, the company is actually not the first to use it. AWS used the term “lake house” in 2019 when discussing a change in Amazon Redshift Spectrum. Before that, in 2017, Snowflake claimed that one its customers was using Snowflake to combine structured and schema-less data processing into what that customer called a “data lakehouse.”


Terminology aside, the data lakehouse was conceived to fuse the low-cost storage benefits of a data lake with the data management and data structure features of a data warehouse. The lakehouse paradigm blurs the lines between the two because it enables schema to be enforced over curated data subsets in specific data lake zones or in associated analytical databases - while still maintaining the flexibility and cost advantages of cloud data storage. The key enabler here is the data lakehouse’s structured transactional layer, known as Delta Lake.


What Does All This Mean?


The emergence of the data lakehouse paradigm seems to deliver the best of data warehouses and data lakes. This convergence should offer enterprises greater simplicity, more benefits and applications – and may well dramatically change the cloud-based analytics landscape.


Yet data lakehouses are cloud-based by design. And it’s worthwhile recalling that many enterprises still lack fully cloud-based datasets. There are multiple paths by which organizations can move to the cloud. Adopting advanced technology like WANdisco’s LiveData Migrator can enable a nuanced, sophisticated and – most importantly - non-blocking cloud migration approach. Moreover, for companies wary of vendor lock-in, our LiveData for MultiCloud solves the exponentially growing challenge of keeping data available and consistent across multiple cloud environments in different geographies.

FOLLOW

SUBSCRIBE

Get notified of the latest WANdisco Blog posts and Newsletter.

Our LiveData Story

Related Blog Posts

https://wandisco.com/news-events/blog/tech-trends/leverage-data-first-strategy-your-aws-cloud-migration

Tech & Trends

Leverage a Data-First Strategy for Your AWS Cloud Migration

Leverage a Data-First Strategy for Your AWS Cloud Migration

Oct 12, 2021

Read More
https://wandisco.com/news-events/blog/tech-trends/how-wandisco-enables-high-availability-distributed-ledgers

Tech & Trends

How WANdisco Enables High Availability for Distributed Ledgers

Overview of recent work integrating WANdisco’s Distributed Coordination Engine (DConE) with two of t...

Aug 13, 2021

Read More
https://wandisco.com/news-events/blog/tech-trends/three-considerations-hadoop-cloud-migration

Tech & Trends

Three Considerations for Hadoop-to-Cloud Migration

As enterprises shift from Hadoop to cloud-based platforms, they are focusing not just on the end res...

Aug 03, 2021

Read More

Seeing is Believing. Try WANdisco Now.

Fully-featured, self-service and automated.

Start migrating Hadoop data in minutes, at any scale, to any cloud

Cookies and Privacy

At WANdisco, we respect your concerns about privacy and value the relationship that we have with you.

Like many companies, we use technology on our website to collect information that helps us enhance your experience and our products and services. The cookies that we use at WANdisco allow our website to work and help us to understand what information and advertising is most useful to visitors.

Please take a moment to familiarise yourself with our cookie practices and let us know if you have any questions by getting in touch through any of the methods listed on our "Contact Us" page.

We have tried to keep this Notice as simple as possible, but if you’re not familiar with terms, such as cookies, IP addresses, and browsers, then read about these key terms first.