Data Lakes, Data Lakehouses, and Everything In-Between

By Van Diamandakis
Feb 28, 2021

With all the buzz about the impending Databricks IPO – following the company’s recent massive cash infusion and staggering $28 billion valuation – it’s worth taking a minute to better understand what Databrick’s tech really represents. In this post, we’ll take a look at data lakes, data warehouses, and how Databricks’ data lakehouse paradigm proposes to offer the next step in the evolution of cloud data analytics.


What’s a Data Warehouse?

We all know what a database is. They’re the foundation of our data-driven world, and have been so for decades. But as early as the 1980’s, researchers at IBM realized the limits of databases in enterprise business decision-making. They were excellent at handling a large volume of simple queries from one data source – but less so for a massive quantity of queries about data from multiple sources.


The data warehouse was conceived to solve this. The data warehouse architecture was designed to draw data from multiple sources within an organization, and facilitate reporting and analysis. In more technical terms, the data warehouse was designed to analyze relational data from line-of-business applications and transactional systems, and was optimized for fast SQL queries that support operational reporting and analysis.


However, data warehouses were not designed for real-time analysis, only historical. Nor were they designed to ingest and handle analysis of the massive quantities of data that modern digital enterprises produce.


What’s a Data Lake?


Data lakes were designed from the ground up to hold big data in its raw form. Drawing on multiple sources, these repositories were architected to store structured, semi-structured, or unstructured data. By enabling data to be stored in a flexible format, data lakes facilitate multiple and diverse future or ad hoc usage scenarios.


Uniquely, data lakes were also designed to leverage clusters of inexpensive and scalable commodity hardware. This enabled a cost-effective scale of storage that was previously not viable.


However, as more and more data accumulated in data lakes, enterprises realized that storing data and using it were two entirely different challenges. Data began backing up in data swamps because organizations were unable to match the performance, security or business tool integration of their (more expensive but more manageable) data warehouses.


Here’s a quick summary of the differences between data lakes and data warehouses:


 

Data Lake Data Warehouse

Best for which users?

Data scientists

Business users

How is data structured?

Raw

Processed

How accessible is data?

Highly accessible and quick to update

More complicated and costly to make changes

What’s the purpose of the data?

For future use

Non-real time analysis

So, What’s a Data Lakehouse?


First off, although Databricks has adopted the term “data lakehouse” and is branding it, the company is actually not the first to use it. AWS used the term “lake house” in 2019 when discussing a change in Amazon Redshift Spectrum. Before that, in 2017, Snowflake claimed that one its customers was using Snowflake to combine structured and schema-less data processing into what that customer called a “data lakehouse.”


Terminology aside, the data lakehouse was conceived to fuse the low-cost storage benefits of a data lake with the data management and data structure features of a data warehouse. The lakehouse paradigm blurs the lines between the two because it enables schema to be enforced over curated data subsets in specific data lake zones or in associated analytical databases - while still maintaining the flexibility and cost advantages of cloud data storage. The key enabler here is the data lakehouse’s structured transactional layer, known as Delta Lake.


What Does All This Mean?


The emergence of the data lakehouse paradigm seems to deliver the best of data warehouses and data lakes. This convergence should offer enterprises greater simplicity, more benefits and applications – and may well dramatically change the cloud-based analytics landscape.


Yet data lakehouses are cloud-based by design. And it’s worthwhile recalling that many enterprises still lack fully cloud-based datasets. There are multiple paths by which organizations can move to the cloud. Adopting advanced technology like WANdisco’s LiveData Migrator can enable a nuanced, sophisticated and – most importantly - non-blocking cloud migration approach. Moreover, for companies wary of vendor lock-in, our LiveData for MultiCloud solves the exponentially growing challenge of keeping data available and consistent across multiple cloud environments in different geographies.

FOLLOW

SUBSCRIBE

Get notified of the latest WANdisco Blog posts and Newsletter.

Terms of Service and Privacy Policy. You also agree to receive other marketing communications from WANdisco and our subsidiaries. You can unsubscribe anytime.

Related Blog Posts

https://wandisco.com/news-events/blog/tech-trends/how-iot-will-transform-transportation

Tech & Trends

How IoT Will Transform Transportation

IoT is at the core of forces reshaping transportation: providing greater safety; making travel more...

https://wandisco.com/news-events/blog/tech-trends/3-ways-og-industry-applying-iot-cut-costs

Tech & Trends

3 Ways the Oil & Gas Industry is Applying IoT to Cut Costs

Oil and gas companies that use IoT can cut operating costs and free up cash to finance migration to...

Cookies and Privacy

We use technology on our website to collect information that helps us enhance your experience and understand what information is most useful to visitors.
By clicking “I ACCEPT,” you agree to the terms of our privacy policy.

Cookie Setting