Successfully Building Data Lakes
By Tony Velcich
Feb 11, 2021
Successfully Building Data Lakes
By: Tony Velcich, Sr. Director Product Marketing, WANdisco
As enterprises feel the urgency to accelerate their digital transformation, they need to take the time to plan the development of their data lakes to take advantage of advanced data analytics. In Gartner’s recent report, Building Data Lakes Successfully — Part 1 — Architecture, Ingestion, Storage and Processing1, there are several recommendations for data and analytics technical professionals responsible for building data lakes.
“‘Begin with the end in mind.’ Improve the chances of success by building data lakes iteratively for the specific requirements of certain business groups, sets of users or analytics use cases, rather than taking a ‘big bang,’ enterprise wide approach.
Learn lessons from data warehouse experiences. Do not build a data lake hoping that the enterprise will figure out how to use it. Build proof of value rather than proof of concept before jumping into data lake implementations. Organizations with good engineering skills and the right use cases have implemented data lakes successfully. With data lakes, it is very easy to get into a ‘technology tail-chase trail.’ Avoid that and stay focused on business outcomes from the data lake.
Avoid doing ad hoc ingestion or data processing on a data lake. Build a framework or leverage third-party vendors to provide self-service-driven ingest, storage provisioning and processing.”
Once a company has set out their objectives and determined how their data will be used, they can confidently move towards achieving their goals and migrating their data. Data lakes can be used to leverage advanced cloud capabilities, including new and innovative AI and machine learning capabilities that could not be done on-prem, but this does not mean that all the challenges are behind them. Gartner also reports:
“For organizations moving to the cloud data lake from on-premises deployments, there are multiple challenges to be resolved. The growing divide between on-premises and cloud data silos isn’t going to go away soon. Some of these challenges include:
Keeping data synchronized, which is one of the biggest challenges faced by hybrid solutions. Challenges of synchronizing data are a way to identify when data has changed and a mechanism to propagate changes to the corresponding copies...”
Gartner offers the following considerations and best practices when migrating data from on-premises to cloud:
“Always make copies of data read-only if possible.
Avoid maintaining more than two copies of data. Keep only one copy on-premises and another only in the cloud.
Leverage tools…to help manage your data synchronization.
If a dataset is accessed in both environments, organizations need to establish a primary storage location for it in one environment and maintain a synchronized copy in the other.
Develop a strategy to resolve conflicts.
When large datasets are transferred to the cloud, ensure data always goes through a dedicated network line with a certain speed from the organization data center to the cloud data center.”
WANdisco designed LiveData Plane to address the above issues. It provides enterprises with active-active Hadoop data replication across multiple distributed and diverse IT environments regardless of geographic location, Hadoop distribution or cloud storage provider. Leveraging WANdisco’s patented Distributed Coordination Engine, LiveData Plane utilizes consensus technology to ensure continuous availability and consistency of actively used data across any combination of Hadoop distributions and cloud storage. That eliminates the need to maintain two copies of data with one on-prem and one in the cloud with one of them read-only. Nor do enterprises need to establish a primary storage location while maintaining a synchronized copy in another environment to avoid data conflicts. With LiveData Plane, users can straightforwardly handle these challenges rather than finding an indirect way to work around them.
Migrating the data
When enterprises are ready to begin migrating data from on-premises to the cloud, they must make big decisions on how to migrate while keeping their data consistent. Gartner offers some considerations and best practices for migration.
“Leverage commercial third-party vendor tools or managed services to move on-premises file system data to object storage in a public cloud in an automated way.”
When it comes to migrating data lakes, WANdisco’s LiveData Migrator is a powerful tool for enabling enterprises to migrate data while ensuring there is zero data loss, even while the data are actively changing. Business operations can continue as usual while migration occurs as all ongoing data changes are replicated to the target cloud environment. An automated, self-service solution, LiveData Migrator easily migrates data at any scale within minutes from on-prem to any public cloud without any business downtime or the use of engineers or consultants. Continuous data migration to data lakes can be achieved in a single scan of the source datasets and can process the ongoing changes that occurred to achieve a complete and continuous data migration.
Data lake migration is a critical part of enterprises’ digital transformations and with some advanced planning, building a strong framework, and using the best tools, the process can be completed without complications. With the right steps, enterprises’ can take full advantage of their data in the cloud to gain fast and insightful business analytics for the benefit of all stakeholders.
1 Gartner, “Building Data Lakes Successfully — Part 1 — Architecture, Ingestion, Storage and Processing” by Sumit Pal, October 7, 2020.