Tech & Trends

Successfully Building Data Lakes

By Tony Velcich, Feb 11, 2021

Successfully Building Data Lakes


By: Tony Velcich, Sr. Director Product Marketing, WANdisco


As enterprises feel the urgency to accelerate their digital transformation, they need to take the time to plan the development of their data lakes to take advantage of advanced data analytics.  In Gartner’s recent report, Building Data Lakes Successfully — Part 1 — Architecture, Ingestion, Storage and Processing1, there are several recommendations for data and analytics technical professionals responsible for building data lakes.


“‘Begin with the end in mind.’ Improve the chances of success by building data lakes iteratively for the specific requirements of certain business groups, sets of users or analytics use cases, rather than taking a ‘big bang,’ enterprise wide approach.


Learn lessons from data warehouse experiences. Do not build a data lake hoping that the enterprise will figure out how to use it. Build proof of value rather than proof of concept before jumping into data lake implementations. Organizations with good engineering skills and the right use cases have implemented data lakes successfully. With data lakes, it is very easy to get into a ‘technology tail-chase trail.’ Avoid that and stay focused on business outcomes from the data lake.


Avoid doing ad hoc ingestion or data processing on a data lake. Build a framework or leverage third-party vendors to provide self-service-driven ingest, storage provisioning and processing.”


Once a company has set out their objectives and determined how their data will be used, they can confidently move towards achieving their goals and migrating their data. Data lakes can be used to leverage advanced cloud capabilities, including new and innovative AI and machine learning capabilities that could not be done on-prem, but this does not mean that all the challenges are behind them. Gartner also reports:


“For organizations moving to the cloud data lake from on-premises deployments, there are multiple challenges to be resolved. The growing divide between on-premises and cloud data silos isn’t going to go away soon. Some of these challenges include:

 

  •    Keeping data synchronized, which is one of the    biggest challenges faced by hybrid solutions. Challenges of    synchronizing data are a way to identify when data has changed and a    mechanism to propagate changes to the corresponding copies...”    


Gartner offers the following considerations and best practices when migrating data from on-premises to cloud:


  •    “Always make copies of data read-only if possible.


    

  •    Avoid maintaining more than two copies of data. Keep only one copy on-premises and another    only in the cloud.


    

  •    Leverage tools…to help manage your data synchronization.

       

        
  •    If a dataset is accessed in both environments, organizations need to establish a primary storage location for it in    one environment and maintain a synchronized copy in the other.

       

        
  •    Develop a strategy to resolve conflicts.

           

  •    When large datasets are transferred to the cloud, ensure data always goes through a dedicated network line with    a certain speed from the organization data center to the cloud data    center.”


WANdisco designed LiveData Plane to address the above issues. It provides enterprises with active-active Hadoop data replication across multiple distributed and diverse IT environments regardless of geographic location, Hadoop distribution or cloud storage provider. Leveraging WANdisco’s patented Distributed Coordination Engine, LiveData Plane utilizes consensus technology to ensure continuous availability and consistency of actively used data across any combination of Hadoop distributions and cloud storage. That eliminates the need to maintain two copies of data with one on-prem and one in the cloud with one of them read-only. Nor do enterprises need to establish a primary storage location while maintaining a synchronized copy in another environment to avoid data conflicts. With LiveData Plane, users can straightforwardly handle these challenges rather than finding an indirect way to work around them. 


Migrating the data

When enterprises are ready to begin migrating data from on-premises to the cloud, they must make big decisions on how to migrate while keeping their data consistent. Gartner offers some considerations and best practices for migration.


“Leverage commercial third-party vendor tools or managed services to move on-premises file system data to object storage in a public cloud in an automated way.”


When it comes to migrating data lakes, WANdisco’s LiveData Migrator is a powerful tool for enabling enterprises to migrate data while ensuring there is zero data loss, even while the data are actively changing. Business operations can continue as usual while migration occurs as all ongoing data changes are replicated to the target cloud environment. An automated, self-service solution, LiveData Migrator easily migrates data at any scale within minutes from on-prem to any public cloud without any business downtime or the use of engineers or consultants. Continuous data migration to data lakes can be achieved in a single scan of the source datasets and can process the ongoing changes that occurred to achieve a complete and continuous data migration.


Data lake migration is a critical part of  enterprises’ digital transformations and with some advanced planning,  building a strong framework,  and using the best tools, the process can be completed without complications. With the right steps, enterprises’ can take full advantage of their data in the cloud to gain fast and insightful business analytics for the benefit of all stakeholders.

   

   1    Gartner, “Building Data Lakes Successfully — Part 1 —    Architecture, Ingestion, Storage and Processing” by Sumit Pal,    October 7, 2020.

   

   
   

   

   
   

FOLLOW

SUBSCRIBE

Get notified of the latest WANdisco Blog posts and Newsletter.

Mailing list form embedded here once it exists.

Our LiveData Story

Related Blog Posts

https://wandisco.com/news-events/blog/tech-trends/automate-migration-apache-hive-metastore-aws-glue-data-catalog-accelerate-time-value

Tech & Trends

LiveData Migrator makes migrating to AWS Glue Data Catalog easy. In two steps, teams can migrate from an Apache Hive metastore to AWS Glue Data Catalog.

Teams can easily migrate from an Apache Hive metastore to AWS Glue Data Catalog. LiveData Migrator e...

May 13, 2021

Read More
https://wandisco.com/news-events/blog/tech-trends/covid-19-accelerated-cloud-adoption-are-you-ready-whats-next

Tech & Trends

COVID-19 Accelerated Cloud Adoption. Are You Ready for What’s Next?

LiveData Migrator addresses the challenges associated with large-scale cloud data migration enabling...

Apr 29, 2021

Read More
https://wandisco.com/news-events/blog/tech-trends/learn-about-azure-cloud-storage-solutions-azure-storage-day

Tech & Trends

Learn about Azure cloud storage solutions at Azure Storage Day

Microsoft is hosting Azure Storage Day on April 29, 2021 where you can learn more about Azure cloud...

Apr 19, 2021

Read More

Seeing is Believing. Try WANdisco Now.

Fully-featured, self-service and automated.

Start migrating Hadoop data in minutes, at any scale, to any cloud

Cookies and Privacy

At WANdisco, we respect your concerns about privacy and value the relationship that we have with you.

Like many companies, we use technology on our website to collect information that helps us enhance your experience and our products and services. The cookies that we use at WANdisco allow our website to work and help us to understand what information and advertising is most useful to visitors.

Please take a moment to familiarise yourself with our cookie practices and let us know if you have any questions by getting in touch through any of the methods listed on our "Contact Us" page.

We have tried to keep this Notice as simple as possible, but if you’re not familiar with terms, such as cookies, IP addresses, and browsers, then read about these key terms first.