Leverage a Data-First Strategy for Your AWS Cloud Migration
By Tony Velcich, Oct 12, 2021
When migrating from Hadoop to AWS, data needs to be available immediately to realize the business and technological benefits from cloud computing. Here’s how using a data-first approach will de-risk your AWS data migration.
All the cost benefits of moving applications as part of an AWS cloud migration won’t matter if the data isn’t available to deliver business value. A retail executive who can’t access the performance dashboard on Monday morning to see how stores in their territory performed over the weekend will not be pleased with a big bang migration where the data is unavailable during migration — and the fallout will definitely hit the IT department. That’s why companies would benefit from taking a data-first approach to their AWS cloud migration initiatives.
In a recent virtual event, WANdisco Chief Technology Officer Paul Scott-Murphy outlined how a data-first approach can deliver business benefits faster while reducing risk of the migration itself.
Key takeaways from the session are below:
A data-first approach to migration makes it possible for data scientists to immediately start using cloud-scale analytics platforms in AWS. Data becomes the central element of migrating to the cloud. A data-first approach takes into account both the volume of data that may be sitting in on-premises systems and the fact that datasets change over time and then provides a way to migrate data so that it is immediately available in the cloud.
Migrate data early as part of your AWS cloud migration
Storage is the first piece of cloud migration. In many AWS cloud migrations, the foundation for data migration is the AWS storage capability itself, S3. Amazon S3 was the first cloud-scale service for cloud storage and the foundation for the existing data lake in Amazon. It can be used in place of on-premises datasets held in platforms like Hadoop and provides different types of storage classes. Amazon S3 is also undergoing continuous improvement and cost reductions, making it ideal for large-scale storage — even more so than on-premises platforms.
Migrate metadata as part of AWS data migration
Metadata is the next piece of a data-first migration strategy, and it’s essential to use the right tools so that the metadata is accessible when needed. The Glue Data Catalog for AWS works as a central metadata repository accessible from services provided by AWS and its partners. Using the Glue Data Catalog is essential for a cloud migration strategy from platforms like Hadoop.
Previously, companies would need to use technologies like Apache Hive to hold metadata. However, in AWS, the Glue Data Catalog stores metadata regarding data services, transformations, and targets for transformations. Unlike other services, Glue Data Catalog is fully managed and fully Hive compatible, enabling companies to open up access to metadata previously stored in Hive across a broader range of cloud services.
Analyze data using EMR
The third step is compute. Amazon EMR is one of the central services available for compute needs for analytic workloads in the cloud. This service is a cloud big data platform that provides functionality to enable technologies like Spark, Hive, and HBase. The advantages of using EMR include its elasticity, security, and flexibility, as well as its industry-leading low total cost of ownership, according to IDC. Using EMR opens up use cases for data sets in the cloud, including machine learning, ETL, clickstream analysis, and other services.
Often, WANDisco customers will leverage storage, metadata, and compute, as well as third-party services like Databricks and Snowflake. This lets them run analytics against large datasets that stretch beyond basic storage use cases. Taking a data-first approach to migration enables many of the analytic platforms available in AWS to function against previously locked up data on-premises, quickly and without business disruption.
The difference a data-first approach makes
A data-first migration means that companies use data as the central element for their migrations to AWS or the cloud. But to do this, they need to consider what happens with data in their on-premises environment. For example, data in Hadoop doesn’t remain stationary; it is constantly changing and constantly ingesting new data, which could be hundreds of terabytes or petabytes.
A data-first approach considers the large volume of data, how the data changes, and that the business may directly depend on the data being available at all times. Data migration cannot disrupt business operations, so there must be a way to migrate data immediately. This means that data is available in the cloud and that changes to data occurring on-premises are also available immediately in the cloud. This is live data, and to do this, companies need to implement a solution that can do this without interrupting the business. It needs to be introduced simply, not require application changes, and scale to the volume of data involved. Supporting AWS data migration and availability at any scale without data loss, without data inconsistencies, and without disrupting data operations is the definition of a data-first migration approach.
A data-first strategy means moving as much of your live data into the cloud as fast as possible to take advantage of cloud scale storage, analytics, and new capabilities.
Tony is an accomplished product management and marketing leader with over 25 years of experience in the software industry. Tony is currently responsible for product marketing at WANdisco, helping to drive go-to-market strategy, content and activities. Tony has a strong background in data management having worked at leading database companies including Oracle, Informix and TimesTen where he led strategy for areas such as big data analytics for the telecommunications industry, sales force automation, as well as sales and customer experience analytics.