Three Considerations for Hadoop-to-Cloud Migration
By Tony Velcich, Aug 03, 2021
Enterprises consider their data and analytics platforms strategic assets that are crucial to digital transformation and business continuity. Yet even as these systems increasingly form the foundations of enterprise business models, some of them remain a massive challenge to organizations. On-premises Hadoop deployments are an example of this — they are complex, unscalable, and increasingly a burden for IT departments.
That’s why more and more enterprises are migrating away from Hadoop and towards modern cloud-based platforms.
There are numerous forces driving enterprises to migrate away from Hadoop. Often, it’s a combination of inherent Hadoop limitations alongside demands for advanced analytics services from the field - services that Hadoop can’t effectively provide. More specifically, enterprise teams are looking to leave Hadoop due to:
Enterprises are discovering that Hadoop can’t keep up with their business goals. If only samples of big data can be processed, rather than entire petabyte-scale datasets; or if network computations can’t be completed even within weeks or months rather than days, then the viability of Hadoop deployments is clearly in question.
Unreliable and unscalable
When clusters can’t scale up to meet computing requirements or scale down to cut costs, enterprises relying on Hadoop are frequently left in data, productivity, and budgetary limbo. And the problem isn’t just with the usage and output of these systems — maintaining, patching, and upgrading Hadoop is an operational and human resources burden, too.
Questionable long term viability
We’ve discussed in previous articles the (rather dire) long-term outlook for on-premises Hadoop. And we’re not the only ones who think so. Even enterprises still strategically committed to Hadoop question the platform’s technological viability and the business stability of its vendors. This is leading enterprises to view Hadoop not only as an impediment, but also as a liability.
Three top Hadoop-to-cloud migration considerations
Once the decision to move away from Hadoop has been made, here are three questions to take into consideration before implementation:
1. What’s the scale of the data migration?
As a rule, the larger the scale, the more complex the migration. And while numerous options exist for small data volumes, few of these work well at scale. Migrating large volumes of data takes time. So, if you’re migrating data over a network, make sure to calculate the time it will take based on your network’s bandwidth while taking into consideration the schedule and size of other workloads.
2. What amount of data changes occur in your Hadoop environment?
Business disruption is a top concern for planned Hadoop migration projects, and handling on-premises data changes during migration is a key challenge noted by enterprises that have already migrated Hadoop data to the cloud. Handling this is challenging because typical Hadoop production environments are very active, with high levels of data ingests and updates. Measurements at one of our customer’s implementations showed peak loads for their on-premises Hadoop deployment reaching upwards of 100,000 file system events per second, and loads over a 24 hour period averaging 20,000 file system events per second. This ongoing activity adds to migration time and complexity, leaving enterprises with three options for managing changes during migration:
Don’t allow changes to happen (leads to system downtime and business disruption)
Develop a custom solution to manage changes
Leverage tools (like WANdisco) that are purpose-built to handle changes
3. Will your migration approach require manual or custom development efforts?
There are a number of Hadoop-to-cloud data migration methodologies and approaches, each with its own considerations. For example, data transfer devices like the Azure Data Box can get Petabyte-scale datasets from Point A to Point B. Yet these solutions may require system downtime or some method for handling data changes that occur during the transfer process. Similarly, network-based data transfer with manual reconciliation of data changes may work for small volumes, but isn’t viable at scale.
Hadoop comes packaged with DistCp, a free tool that is frequently used to start data migration projects…but less so to finish them. The problem is that DistCp was designed for inter/intra-cluster copy of data at a specific point in time — not for ongoing changes. DistCp requires multiple passes and custom code or scripts to accommodate changing data, making it impractical for an enterprise-class migration.
Finally, there are next-gen automated migration tools (like WANdisco LiveData Migrator) that allow migrations to occur while production data continues to change — with no system downtime or business disruption. These solutions enable IT resources to focus on strategic development efforts, not on migration code.
The bottom line
As enterprises migrate away from Hadoop in favor of cloud-based platforms, they are looking more closely not just at the end results of migration, but at the process itself. Large-scale enterprise data migration is a massive enterprise project — there’s no question. Yet by choosing the right tools for the job — tools that enable business data to flow freely and core business functions to continue unhindered, even during Petabyte-scale migration — the viability of this strategic shift increases dramatically.
Tony is an accomplished product management and marketing leader with over 25 years of experience in the software industry. Tony is currently responsible for product marketing at WANdisco, helping to drive go-to-market strategy, content and activities. Tony has a strong background in data management having worked at leading database companies including Oracle, Informix and TimesTen where he led strategy for areas such as big data analytics for the telecommunications industry, sales force automation, as well as sales and customer experience analytics.