Three Big Data Migration Risks and Costs: WANdisco vs. Cloudera
Posted in Company on Dec 20, 2019
Whether you are moving an on-premises Hadoop installation to the cloud, moving data between clouds, or moving data between cloud regions, your team is working to select the lowest cost and lowest risk migration approach.
To inform your choice of migration approach, Frank Cohen, an application performance and architecture expert, published performance test results comparing two approaches in October 2019. His paper, “Comparing WANdisco LiveMigrator to Cloudera BDR For Moving Data To The Cloud A Methodology and Results,” provides a fact base for your evaluation of cloud migration approaches.
Cloudera BDR migrations create unnecessary business risk
Frank concluded that there is considerable risk in a Cloudera BDR migration approach. The top three migration risks of this technology are that it:
- disrupts operation of on-premises applications,
- delivers inconsistent data and risk data loss, and
- incurs costly overhead with extensive time to migrate, manual checking and repeated scans.
Cloudera BDR does not accommodate datasets that are modified while being copied. It cannot ensure that copied data will be equivalent to those in the source system, unless you can guarantee that no change occurs during copy. Because larger datasets take longer to copy to a target environment, migrating large data volumes with Cloudera BDR requires extensive disruption to on-premises applications. This is at direct odds with the need to maintain enterprise SLAs expected of mission critical workloads that benefit from the scale and capabilities of Hadoop.
Moving cold, static datasets is simple, while moving changing datasets with enterprise SLAs is very challenging. Cloudera BDR has not been designed with these requirements in mind. It requires the disruption of on-premises application operations during migration. Can you afford to adopt a migration strategy that will fail your enterprise SLAs?
Attempts to reconcile data copied by BDR to match that in the source environment cannot just be added on top of Cloudera BDR to solve this problems. Large datasets take time to scan for this reconciliation, and by the time a single pass is complete, the data will have already changed, making it impossible to bring target data to a consistent state. For example, 1 PB of data takes about 100 days to migrate over a 1 Gb/s network. Data transfer appliances that can overcome WAN bandwidth limitations actually provide limited end-to-end benefit, because the load and unload times remain high with limited bandwidth into and out of the devices, and because there are additional overheads for the processes surrounding their use for data transfer.
Data will need to continue to change during migration. Any technology that relies on data remaining static provides no solution when data accuracy is important. Even determining whether data are consistent is challenging at scale, because iterating through a large dataset to identify differences itself takes time, and data will continue to change while that is underway.
Cloudera BDR attempts to apply this old approach, and suffers from the challenges of excessive migration time, datasets being inconsistent across source and target environments due to continued change, and having no mechanism to determine or ensure data consistency. Adding to those challenges, is the fact that it has these shortcomings even if it operates in an environment that does not suffer hardware or system failures. Real-world clusters incur failures all the time, and any technology that relies on perfect operation of a distributed environment is simply doomed to fail as a result.
A few quotes from Frank Cohen's performance evaluation that emphasize these challenges include:
“BDR requires manual intervention after an outage and may require the entire BDR job to be rerun, typically with checksums on, which consumes significant resources.”
“BDR requires scripts to be maintained”
“BDR requires every data node in every cluster to be able to communicate with every other, ports have to be configured for every single data node in every participating cluster.”
The overhead of activities to attempt non-disruptive, no-downtime big data migration are significant. What resources do you need to manage, schedule and maintain BDR migration scripts? How prone is the batch approach to delays? What resources are needed when transfers fail or are interrupted? What resources are needed to account for changes in the data during the migration?
WANdisco provides lower migration risk and cost
Alternatively, WANdisco LiveMigrator provides a fully automated big data migration to the cloud with no application downtime during migration, no risk of data loss, and no inconsistencies even when your data sets are under active change. Unlike Cloudera BDR, WANdisco LiveMigrator quickly and continuously replicates changes to provide cloud data that is:
- always available,
- always accurate and protected, and
- at the lowest IT cost.
To avoid the risk of business disruption during migration, Live Migrator offers 100% business continuity for hybrid, multi-region and cloud environments with the continued operation of on-premises clusters. With no impact to donor cluster operations during migration, Live Migration is the approach companies use to meet their critical SLAs.
To eliminate the business risk of poor data quality, LiveMigrator is an automated approach to big data migration that provides validation of data consistency between the shared systems. As changes can occur anywhere in the donor system, Live Migration ensures that the beneficiary has consistent data on completion without data loss.
LiveMigrator minimizes IT resources with speed to migrate and no code maintenance. Frank Cohen’s test demonstrated that “WANdisco LiveMigrator’s performance is 38 times superior to Cloudera BDR in measurements of Time To Available Data (TTAD).” LiveMigrator provides automated replication across all major commercial Hadoop distributions, storage, and analytic services. It lowers project cost, reduces project completion time, and speeds the adoption of new cloud services. LiveMigrator provides data replication that is compatible with all cloud vendors and does not require a “Big Bang” cut over for applications.
With a proven, automated path to compelling cloud technologies, cost structures, and analysis opportunities, leading companies are avoiding the risks and costs of Cloudera BDR’s big data migration approach. Live Migration offers the IT team automated migration at scale across all major commercial Hadoop distributions to cloud with a single scan of the source storage, even while data continues to change. Live Migration does not require scripts, code maintenance, transfer devices, scheduling, or reviewing. Migrate to the cloud without risk using LiveMigrator.
About the author
As VP of Product Management at WANdisco, Paul has overall responsibility for the definition and management of WANdisco's product strategy, the delivery of product to market and its success. This includes direction of the product management team, product strategy, requirements definitions, feature management and prioritization, roadmaps, coordination of product releases with customer and partner requirements, user testing and feedback.