How to achieve a disruption-free migration to Azure Data Lake Storage Gen2
By WANdisco , Feb 14, 2019 in Industry
Paul Scott-Murphy, VP of Product Management at WANdisco, and DISCOtecher, Director of Product and Channel Marketing at WANdisco, discuss how clients can get to ADLS Gen2 without downtime for their critical systems.
DISCOtecher: Can you please share why WANdisco is so excited about the ADLS Gen2 GA announcement?
Paul: The general availability of ADLS Gen2 is a significant announcement from Microsoft, and being able to take advantage of it without downtime or disruption will be important for every organization.
The benefits that come from bringing your large-scale data sets to ADLS Gen2 are huge, and Microsoft recognizes the importance of a strong ecosystem of partners to help make this happen. WANdisco is excited to be leading the way with solutions for hybrid architectures and migration strategies for data at any scale. WANdisco solutions eliminate the risk of disruption to business applications, and are compatible with the big data technologies that enterprises are using today. We make it easy to adopt ADLS Gen2.
DISCOtecher: Why is WANdisco such an important Microsoft partner for Azure customers?
Paul: As Microsoft customers know, ADLS is designed to support petabyte-scale analytics workloads with massive throughput, both of which are key capabilities for enterprise data lakes. Because ADLS Gen2 offers the familiar benefits of ADLS Gen1—such as file system semantics, structured security and scale—and the performance of Azure Blob Storage, customers can boost the cost-efficiency, performance and scale of their analytics workloads substantially by migrating from ADLS Gen1 to ADLS Gen2.
Realizing these benefits requires customers to migrate in a way that doesn’t disrupt the critical analytics workloads already running in ADLS Gen1—and we know that’s not a trivial problem.
The WANdisco Fusion platform helps Azure customers avoid the data consistency challenges of migrating large, fast-changing data sets, and that’s why WANdisco is a stand-out partner for Microsoft.
DISCOtecher: Can you give us some examples of these data consistency challenges, and how WANdisco Fusion solves them?
Paul: Enterprise IT teams face the reality that it’s not possible to move petabytes of data overnight: you have to do it over time, without stopping applications that depend on that data. While this migration is happening, you also need to continually update your cloud data to reflect any changes made on-premises. From that vantage point, the question of data consistency becomes crucial, and businesses need to ask themselves how to achieve this while replicating their changing data.
If you have analytics applications in the cloud and on-premises that need to access the same data, strong—not eventual—consistency is critical to ensure that all your users are working with the same information.
There are vendors that offer eventual consistency via change data capture tools, which is effectively a form of transaction log replay. But only WANdisco Fusion’s distributed coordination engine (DConE) enforces consistency by coordinating activities performed by big data applications against their data. Without this approach, it is impossible to avoid conflicting changes being made to the same data in different locations (e.g. from application working against ADLS Gen1 and other applications using ADLS Gen2). Reconciling conflicts between data at scale can be essentially impossible, so avoiding them in the first place is critical to solving the challenges that come with data migration.
At scale, data replication for hybrid architectures must address multiple levels of information—kind of like a layer cake. WANdisco Fusion works over these layers.
DISCOtecher: Thank you — that’s an interesting metaphor. So if the first layer of the cake is the data itself, what’s the next one?
Paul: The next layer of information is your metadata, and it’s just as important as the data itself. A good example of this is security metadata: the policies and permissions that you apply to your data to control and limit access. If the security of your data is important, you need to keep this type of metadata consistent in exactly the same way, rather than leaving your data exposed while you attempt to rebuild those policies later. Aside from the huge amount of work that would involve, it’s a prime risk area that could leave you with some very serious exposure.
If we think specifically about migrating data from ADLS Gen1 to ADLS Gen2, this is an important consideration.
Crucially, ADLS Gen1 customers who want to use ADLS Gen2 can’t afford the cost of rebuilding security policies for their data—so WANdisco Fusion really is a key enabler for these types of migration projects.
DISCOtecher: What are the other layers in the data replication model, and why are they important to consider in hybrid cloud operations?
Paul: There are multiple layers of metadata—a technology like Apache Hive is a good example. Hive allows you to apply structure to your data for analytics purposes. This metadata lets applications query big data without the need to transform its structure first. WANdisco Fusion can replicate this metadata so that hybrid architectures can take advantage of information at scale using standard analytic toolsets across multiple environments.
The final layer consists of the big data applications themselves. Bringing applications to a new storage platform can be risky if you’ve taken a big-bang approach and cut everything over at the same time, because there’s no way to fall back if something goes wrong. Having the ability to eliminate that risk by being able to test applications individually and over an extended period with a hybrid architecture is going to be critical. With WANdisco Fusion, this strategy becomes simple, and can account for failures either with individual applications, or even entire clusters.
DISCOtecher: So WANdisco Fusion also provides DR for analytics workloads?
Paul: That’s right—as you might expect, the strong data consistency that WANdisco Fusion provides also makes it a natural choice to deliver high-availability and disaster-recovery capabilities in the Azure cloud. Pranav Rastogi, Program Manager, Azure Big Data, Microsoft, touches on these capabilities in a blog, which you can read here.
That’s right — as you might expect, the strong data consistency that WANdisco Fusion provides also makes it a natural choice to deliver high-availability and disaster-recovery capabilities in the Azure cloud.
DISCOtecher: If any Microsoft customer wants to learn more about using WANdisco Fusion to support a simple, disruption-free migration from ADLS Gen1 to ADLS Gen2, what should they do next?
Paul: WANdisco has been delivering solutions for migrating big data environments for many years. WANdisco Fusion is a natural fit for customers wanting to do the same for ADLS Gen2. If you want to dig into some more of the technical details, our associates on the Azure ADLS team have published a terrific blog here, or you can watch our WANdisco Fusion demo here. Or if you’re ready to explore how the solution works in the real world, click here to learn how AMD uses WANdisco Fusion to protect critical business data against disaster.