Beware of data migration traps
April 15 2021
You want to modernize your data infrastructure, and put your Hadoop in the cloud? Here are five questions you need to ask
1 What is the scale of the data migration?
There are several ways to transfer small amounts of data to the cloud, particularly if the data is static and unchanging. The danger lies in assuming that the same approaches will work with a large volume of data, especially when that data is changing while moving to the cloud. If the data set is large and static, enterprises need to calculate if there is enough time and bandwidth before starting the migration or enough time to load it onto a bulk transfer device (e.g. AWS Snowball or Azure Data Box), have the device shipped to the cloud service provider, and then uploaded.
The real challenge arises when migrating a large volume of data that actively changes during the process. In this situation, the approaches that may work on small data sets will not be effective, requiring system downtime that lead to significant business disruption and failed data migration projects. Companies that choose to transfer data across the network often fail to consider all of the other business processes that are sharing the same network resources for their day-to-day operations. Even if there is a dedicated network channel, this needs to be factored in since enterprises usually can’t use all of their bandwidth for the migration without impacting other users and processes.
Enterprises need to make sure there is a mechanism in place to ensure the data is throttled sufficiently so there are no negative business impacts. In many instances, companies that turn on the faucet and start moving data wind up saturating the pipe and impacting other parts of the business. They are forced to shut off the migration, and restart it at the end of the business day.
2 How will you maintain consistent data between the source and target during migration?
When you need to migrate data that is actively changing—either new data is being ingested, or existing data is being updated or deleted—you’ve got a choice to make. You can either freeze the data at the source until the migration is complete, or allow the data to continue to change at the source. In this case, you need to figure out how to take into account those changes so when the migration is complete, you don’t end up with a copy that's already badly out of date.
To prevent data inconsistencies between source and target, find a way to identify and migrate any changes that may have occurred. The typical approach is to perform multiple iterations to rescan the data set and catch changes since the last iteration. This method allows you to iteratively approach a consistent state. However, if you've got a big enough volume of data and it's changing frequently, it may be impossible to ever catch up with the changes being made. This is a fairly complicated problem and many times people don't really anticipate the full impact it will have on their resources and business.
The other option is to freeze the data at the source to prevent any changes from occurring. This certainly makes the migration task a lot simpler. With this approach you can be confident that the data copy you made to upload to the new location—whether over a network connection or via a bulk transfer device—is consistent with what exists at the source because there weren't any changes allowed during the migration process.
The problem with this approach is that it requires system downtime and results in disruption to your business. These systems are almost always business critical, and to try and bring them down or freeze them for an extended period of time usually isn’t acceptable to the business processes that rely on them. Using a bulk transfer device, it can take days to weeks to perform the transfer. If you transfer data over a dedicated network connection, it will depend on the network bandwidth you have available. In order to move a petabyte of data over a one gigabit link it will take over 90 days. For the vast majority of organizations, days, weeks or months of downtime and business disruption is just not acceptable.
3 How will you handle manual reconciliation of the migration process or any outage?
If you stopped the migration or incurred an outage, how do you figure out the point from which you recover to know exactly how much of that data has been correctly migrated? Depending on the tools you’re using, will it even be possible to resume from that point, or will you effectively have to start the process over from the beginning? It’s a complex problem and utilizing a manual process adds significant risk and costs when you have to unexpectedly interrupt and resume the migration. Any attempt to manually synchronize data is resource intensive, costly and error prone. It is difficult trying to do this manually across two environments, and significantly more complicated if attempted across multiple environments.
Organizations with deep technical expertise in Hadoop will be familiar with DistCp (distributed copy), and often want to leverage this free open-source tool to develop their own custom migration scripts. However, DistCp was designed for inter/intra-cluster copying, and not for large scale data migrations. DistCp only supports unidirectional data copying for a specific point-in-time. It does not cater to changing data and requires multiple scans of the source to pick up changes made between each run. These restrictions introduce many complex problems. Organizations are better off utilizing their valuable resources on development and innovations using the new cloud environment, rather than building their own migration solutions.
4 Will you need a hybrid environment that supports changes at both source and target?
The use of hybrid cloud deployments is increasingly popular. That may entail the use of a public cloud environment together with a private cloud or an organization’s traditional on-premises infrastructure. For a true hybrid cloud scenario changes need to be able to occur in any location, and the change needs to be propagated to the other system. Approaches that only account for unidirectional data movement do not support true hybrid cloud scenarios because they require a source/target relationship.
This is further complicated when you go beyond just two endpoints. We are seeing more and more distributed environments where there isn’t just one source and a destination, but multiple cloud regions for redundancy purposes or even across multiple cloud providers. To avoid locking yourself into a single point solution, you need to be able to manage live data across multiple endpoints. In this case you need a solution that can replicate changes across multiple environments and resolve any potential data change conflicts, preferably before they arise.
5 What application dependencies exist that cause data gravity?
Data gravity refers to the ability of data to attract applications, services and other data. The greater the amount of data, the greater the force (gravity) it has to attract more applications and services. Data gravity also often drives dependencies between applications.
For example, there might be one application that takes output from another application as its input that may in turn feed other applications that are further downstream. Business units or users who designed a given application will know what their inputs are, but they may not be aware of everyone that is using the data they have created. It becomes very easy to miss a dependency. When the application is moved to the cloud, the resulting generated data will not be synchronized back down to the on-premises environment, and suddenly other applications further down the workflow aren’t getting current data.
Many enterprises fail when attempting to migrate their data to the cloud. Answering these five questions can make the difference between a successful migration or falling into a data migration trap that can waste time, money and potentially harm businesses.