Enterprise Confessions: What Their Data Lake Was Missing
Posted in Industry on Apr 24, 2018
Capturing and understanding data is super-critical, and for today’s enterprises ‘no data’ equals ‘no business.’ As a response, companies have rushed to fill huge data lakes, designed to pool every scrap of useful information as an enterprise resource.
Yet most CDOs, CTOs and EVPs of Data & Cloud agree that data lakes are not living up to their initial promise.
How can we build strategies and best practices that help extract true, lasting value from data lake investments and limit risks?
DISCOtecher sat down with Alex Shutt and Sam Simpson recently to hear their experiences of working on data lake initiatives, and how to gain the maximum benefit from data lake projects.
DISCOtecher: Hi Alex, Sam, great to have you with us here today. Is the term “data lake” another way of saying a well-managed repository?
SAM: Actually, most data lakes are not well-managed, so they’re not really well-managed repositories. Because of the decrease in storage costs, people are tempted to save as much data as they can, but with no real plan about how they’re going to use it in the future. The thing is – a constant I think – is that data loses its relevance over time, and without a clear time for extracting it and the tools to do so, the storage admins just end up with rapidly growing data volumes to manage that never actually accomplishes anything – sometimes thought of as a ‘data graveyard.’
“Because of the decrease in storage costs, people are tempted to save as much data as they can, but with no real plan about how they’re going to use it in the future.”
DISCOtecher: That’s interesting to think of data losing value over time. Can you tell me more about that as I thought that data was generally good to hang on to?
ALEX: I think it’s not so much the data itself that loses its value, it’s the ability to maintain its relevance and utility that decreases over time. For example, you may have stored years and years of customer transactions, but if you lose the metadata around that such as the timing and location of the transactions, even though the data is still theoretically rich then you have weakened your ability to understand it and extract value for your business.
As data ages, similar lessons apply; maybe your business processes or systems have changed, which reduces the relevance of historic transactions. Perhaps data protection rules will limit how long you can store the details and metadata of expired customer accounts. The individual transaction data never changes, but if you can’t easily map the transactions to customer profiles, the insight you can gain from keeping those transactions is markedly less. The waters have become murkier, if you will! But if you can run your analytic queries while the accounts are still active, the trends and patterns observed are still valid and useful even after the source data is gone.
SAM: Yes, totally agree. Cloudera started using the term Enterprise Data Hub (EDH), replacing the negative connotation of a data lake as a place where data goes to die and introducing the idea of active data inquiry.
ALEX: That’s right. Rather than pouring all your data into the lake, unmanaged – and maybe seen as simply an opex burden – the ‘hub’ term seems more dynamic, positioning data at the heart of the business.
“How a data lake is positioned within an organization is, from what we’ve seen, the main factor in whether it’s successful.”
DISCOtecher: I’m learning a lot so far. What common pitfalls catch customers, and how can these be avoided?
ALEX: How a data lake is positioned within an organization is, from what we’ve seen, the main factor in whether it’s successful. As Sam said, one thing that people do with a data lake is just pour all their data into it without planning what to do with it. This tends to happen when the data lake is a central, IT-led initiative. It’s like, “If we load all these data sources together, someone will be able to get value from it.”
But just doing that without engaging the business units properly means the data that gets added is likely to be of questionable quality – it isn’t organized in the same way that the business works, it can’t be easily indexed or searched, it’s not granular enough etc. These projects tend to be the ones that wither away and eventually get closed down, leaving a bad perception of the whole concept of data lakes. The alternative common approach is a business led project, typically a single department or geographic unit.
The last few years have seen a lot of excitement around big data and Hadoop, which definitely ties in with the Gartner hype cycle. The thought process is very much “if we don’t have it, our competitors will, and then they’ll have a better product, better marketing, because they’ll have more insight than we will, so, quick, create a data lake.”
In the early days of Hadoop we’d have organizations directly asking us what we knew about their competitors’ data strategies; they were paranoid they were the ones behind the curve. So things get rushed, and internal processes (or the IT department!) gets bypassed. When we then get involved, the discussion is about: “We’ve built our data lake, it’s going well, we think we can get some value out of it, but we now realize we need governance, we’re trying to add backup and disaster recovery, but it’s harder than we thought. Can you help us?”
“We’ve spoken with global organizations where the IT department has discovered multiple data lake / Hadoop projects across the business, inevitably using different technologies, and they now need new replication technologies that can replicate all these diverse data sets back into a controlled, central “hub”, where they can then apply the required processes and governance.”
DISCOtecher: Yikes. That sounds pretty tricky.
ALEX: Yes, it is! These projects start with a small POC, with a small amount of data to see what they can do with this idea, and then more and more projects and departments start pouring their data in and start querying it. Great! Then very quickly the data lake has become a huge unwieldy beast that everyone’s depending on, and it hasn’t had any of the normal sort of infrastructure around it to protect it, scale it, and operationalize it properly. At that point, things can get messy, and hard to put it back in the box, as it were. This is obviously true for any software being added to existing infrastructure, but data lakes have so many stakeholders and competing requirements that the entire project can get deadlocked.
We’ve spoken with global organizations where the IT department has discovered multiple data lake / Hadoop projects across the business, inevitably using different technologies, and they now need new replication technologies that can replicate all these diverse data sets back into a controlled, central “hub”, where they can then apply the required processes and governance. Whichever starting point you come from, retrofitting the gaps after the event is much harder than setting it all up at the beginning.
DISCOtecher: That sounds like a very smart approach
ALEX: In brief, a data lake initiative needs business and IT to work together from the start. This helps clarify what governance, security and data protection you need to put in place — then you have a solid foundation to let the entire business build on. The number of companies who seemed to manage that first time are perhaps understandably very rare, but it’s getting better. CDOs and CIOs are now starting on their second or third data lake project, having learnt lessons at previous companies!
“In brief, a data lake initiative needs business and IT to work together from the start. This helps clarify what governance, security and data protection you need to put in place — then you have a solid foundation to let the entire business build on.”
DISCOtecher: I met with a data lake subject matter expert about a year ago, and I remember them saying this same thing, that data lake development must be driven by the customer’s environment. If I understand you correctly, would you say that defining your use case up front, with the knowledge that you will most likely need a smart data replication approach as a recipe for success?
ALEX: Definitely. The greatest successes come from those companies who haven’t rushed the initial data lake deployment and let it get out of control. The IT team has engaged across all lines of business from the start – it’s everyone’s data lake, even when its new and empty. At that point, when the business case, objectives and expectations are established, you can then put in your replication, security and related enterprise tools that form the foundation of successful data lake infrastructure.
DISCOtecher: Do you feel like it’s a budget constraint that holds back the speed at which data lake strategies are executed, is it the balancing act of other projects, or is it both?
ALEX: I think it’s more the other projects. Typically, they’re small teams, potentially they’ve come from other departments, other technical realms, database, administrative, and so on. And they are run ragged trying to keep up with their big data effort.
DISCOtecher: What is the most forward-thinking data lake strategy you’ve seen to date?
ALEX: The strategy must be based on the recognition that a successful data lake needs to be self-serving; departments and analysts in all parts of the business should be able to access the data lake to come up with their own queries, to get their own value from it, in a controlled, governed, secure way. The key idea is that a successful data lake isn’t about the data that is or isn’t stored in it; it’s a framework, a set of processes and resources that allow the business to exploit data easily, rapidly and securely. Data lake success is all about enabling a data-led culture in the business units, and the IT team enabling that capability.
DISCOtecher: Thank you so much for your time and thoughtful discussion on the topic, Alex and Sam.