Blog

26 Mar 2019DISCOtecher

Emerging trends in disaster recovery for data lakes

In this Q&A, Katherine Sheehan, Senior Solutions Architect at WANdisco, and DISCOtecher, WANdisco’s Director of Product & Channel Marketing, discuss current trends in disaster recovery and high availability for on-premises and cloud Hadoop data lake deployments.


DISCOtecher: In recent years, we’ve seen Hadoop mature and business adoption increase. How has that affected disaster-recovery and high-availability strategies?

Katherine: When it comes to Hadoop workloads, it’s clear that object storage is the future—and Gartner’s 2017 Hype Cycle for Data Management predicts that adoption of Hadoop in the cloud will grow dramatically in the coming years. However, a large number of businesses currently use on-premises Hadoop deployments to drive their day-to-day operations, so the evolution toward cloud will be a gradual process.

Because these on-premises Hadoop platforms are often supporting critical workloads, unplanned downtime can have a significant impact on a company’s bottom line. This has led to increasingly stringent service-level agreements (SLAs) for availability. In the IT department, these SLAs change the emphasis from: “How can we restore our data in a recovery scenario?” to: “How quickly can we restore our data in a recovery scenario?”

Naturally, the more data in a Hadoop environment, the longer the recovery process can take—in some cases, many weeks. For businesses with SLAs measured in hours, disaster recovery and high availability are rapidly moving up the agenda: and this is a theme that surfaces again and again in our conversations with clients.


DISCOtecher: For many IT departments, open-source tools like DistCp must seem attractive options to deliver effective disaster recovery. But are they actually fit for purpose for large, enterprise data sets?

Katherine: It’s absolutely true that tools like DistCp are a natural starting point for Hadoop disaster recovery. However, this batch-based approach to data protection has its limits—especially for large amounts of data.

When it comes to RPO, it’s crucial to understand what open-source tooling can’t do. If you rely on snapshots of Hadoop data taken at intervals throughout the day, the business is going to lose any changes made after the last batch window in the event of a recovery scenario. What’s more, creating snapshots of large amounts of data will put significant stress on the environment, which can reduce productivity for teams trying to use the cluster for analytics workloads.

Open-source approaches to disaster recovery also tend to require manual, time-consuming processes to rebuild a Hadoop cluster and get it running again. And all the while the cluster is offline, the business is potentially running up costs of thousands of dollars per day in lost productivity, missed sales opportunities, or even compliance penalties.


Open-source approaches to disaster recovery also tend to require manual, time-consuming processes to rebuild a Hadoop cluster and get it running again. And all the while the cluster is offline, the business is potentially running up costs of thousands of dollars per day in lost productivity, missed sales opportunities, or even compliance penalties.



DISCOtecher: So how are intrepid organizations solving these challenges?

Katherine: Meeting demanding SLAs for availability and data protection usually requires a continuous-availability configuration, which presents significant technical challenges for large-scale Hadoop workloads.

The typical approach is to distribute copies of data across one or more clusters, to ensure that an outage at one location won’t result in loss of data or downtime for analytics services. However, the key challenge with distributed high-availability configurations like this is data consistency. With multiple clusters working on the same data, it’s crucial to ensure that changes are continuously replicated across every location, which is impossible to achieve with open-source tools.

Businesses might look at commercial offerings such as Cloudera Backup and Disaster Recovery (BDR), but tools like this are, at their core, just simple extensions of the functionality already offered by DistCp. As a result, Cloudera BDR and similar tools fail to solve the challenge of delivering consistency across large and fast-changing data sets on multiple clusters.

WANdisco Fusion is the only platform on the market today that delivers the continuous replication between clusters that ensures data consistency. That’s because WANdisco Fusion is powered by a unique technology called DConE, which uses consensus to keep Hadoop and object store data accessible, accurate, and consistent in distributed locations, across any mixed storage environment. Businesses can use WANdisco Fusion to ensure consistent data between on-premises Hadoop clusters and cloud instances, and even between multi-cloud platforms.


That’s because WANdisco Fusion is powered by a unique technology called DConE, which uses consensus to keep Hadoop and object store data accessible, accurate, and consistent in distributed locations, across any mixed storage environment.



DISCOtecher: You mentioned that Hadoop in the cloud will dominate in the future. Can a hybrid cloud disaster recovery strategy make that transition easier?

Katherine: Absolutely. While we know that increasing numbers of businesses are targeting a cloud-only strategy, many enterprises are not yet ready to move away from their on-premises Hadoop deployments. Enabling a hybrid cloud scenario, with on-premises production and synchronized cloud operations, provides a perfect stepping stone. By ensuring both on-premises and cloud data sets are consistent, the business can run applications and analytics on the platform that most suits their needs. When they’re ready to make the switch to cloud, the latest version of their data is already there and ready to use.

And this isn’t just a theory: some of the world’s largest enterprises are already using WANdisco Fusion to enable hybrid- and multi-cloud disaster recovery to protect their data. For example, when AMD wanted to ensure its semiconductor manufacturing operations could run even if its primary data center went offline, the company engaged WANdisco to help deliver a solution.


And this isn’t just a theory: some of the world’s largest enterprises are already using WANdisco Fusion to enable hybrid- and multi-cloud disaster recovery to protect their data.


As well as helping AMD move 100 TB of on-premises data to Azure without disruption, WANdisco Fusion now enables the company to continuously replicate production data to the cloud. Because data is identical in both environments, AMD’s data-driven manufacturing processes can continue as normal, even if the primary site suffers an outage. If you want to get the inside track on the project, you can watch the video here.


DISCOtecher: If people want to learn more about enabling maximum availability for their business-critical Hadoop services, what’s the next step?

Katherine: If you want to learn more about LiveData strategies, this video from WANdisco’s VP Product Management, Paul Scott-Murphy, is a great place to start. Or if you’re ready to achieve a future-ready approach to disaster recovery and high availability, click here to schedule a consultation with one of our experts, or if you’re near San Francisco, come find us at the Strata Data Conference from March 25-28!


About the author

 

At WANdisco, we value our relationships with industry experts and partners and highly value their educational material. This blog series is made up of their opinions and ideas relevant to our followers and we support them - and respect their personal viewpoints.

Twitter: @WANdisco

 

Katherine Sheehan is a Senior Solutions Architect at WANdisco responsible for the development of North American channel partnerships.


About WANdisco

WANdisco is the LiveData company that empowers enterprises to revolutionize their IT infrastructure with its groundbreaking distributed coordination engine (DConE) in the WANdisco Fusion platform, enabling companies to generate hyperscale economics with the same IT budget — across multiple development environments, data centers, and cloud providers. WANdisco Fusion powers hundreds of the Global 2000, including Cisco Systems, Allianz, AMD, Juniper, Morgan Stanley and more. With significant OEM relationships with IBM and Dell EMC and go-to-market partnerships with Amazon Web Services, Cisco, Microsoft Azure, Google Cloud, Oracle, Alibaba and other industry titans — WANdisco is igniting a LiveData movement worldwide.

For more information on WANdisco, visit http://www.wandisco.com

Email an Expert

Talk to us about making data movement reliable without downtime

* REQUIRED FIELDS