Accelerating Time to Value for Cloud Analytics and Machine Learning
By Van Diamandakis, Jul 02, 2020
Machine learning (ML) in the cloud is powering a whole new generation of intelligent and predictive cloud analytics solutions like Azure Databricks and Azure Synapse. The benefits of cloud economics, tooling and flexibility, along with next-level insights to drive real time business decisions are the primary drivers behind the growing trend of on-premise data lake migrations to the cloud.
Cloud analytics services like Synapse are designed to collect and analyze current and actionable data – delivering insights into processes and workflows that can impact business operations. But what if you need those insights immediately, and you need them in the hands of employees and experts who are working simultaneously across the globe in real time and always accurate and up to date? IT stakeholders are turning to the cloud for faster, more accurate and timelier business insights – especially in the face of Covid-19 where companies are looking to operate as economically possible and millions are forced into remote working locations.
Even before the pandemic, a 2019 survey by TechTarget found that 27% of respondents plan to deploy cloud analytics in 2020. That same study points to an increase in cloud technology as the number two activity that companies are employing to improve employee experience and productivity, and notes that 38% of companies plan to bolster their cloud technology in 2020. In speaking to the experts at AWS and Azure, that number is higher today. Hindsight is also 2020!
There are multiple reasons that organizations are moving their data lakes and analytics capabilities to the cloud. First among them is cost: The move streamlines a workforce, so even though there are start-up costs involved in the migration process, the long-term cost-benefit analysis plays out in their favor. Companies are also able to run faster and lighter with cloud analytics with no need to run dedicated client-side applications and IT teams freed of the necessity of coordinating upgrades across an entire infrastructure. In our experience across our customer base at WANdisco and in working with CSPs like Azure and AWS, we have found, on average, that the total cost of ownership to manage a 1PB Hadoop data lake on premise over a three year period costs a company $2M. To manage that same 1PB in AWS S3 or Azure ADLS Gen 2 storage costs $900,000 over three years.
The question is how to most rapidly (time to value) migrate that 1PB data lake with zero downtime and ensuring the data is consistent on prem and in the cloud during migration as the data is always changing if it’s business critical. The architects and data teams have two choices.
They can use various flavors of open source DistCP tools and scripts, which is the manual approach to a data lake migration. Don't be fooled by fancy names by the Hadoop or Cloud vendors. It’s all DistCP under the covers. What’s wrong with this approach? It’s an IT project. And like most IT projects, 61% of them either fail or suffer cost and SLA overruns. Here’s what you have to do in this scenario:
Find a project manager to run the entire project
Find a business analyst to define requirements
Peel of a Hadoop and cloud architect to review requirements and design a solution
Tap into an already overworked development team to take on the DistCP scripting work within an existing sprint
Do unit testing and then validation testing
How long can this take?
We have seen teams struggle for months and even years depending on data volume and business requirements around acceptable application downtime, data availability and data consistency. We’ve seen companies put 8-10 people on projects, fail after 6 months, then pay $1M to a systems integrator and fail after another 9 months. OUCH.
There is a better way. And forward-looking companies like AMD, Daimler, and many others have figured it out. How? By leveraging modern technology to automate data lake migration and replication to the cloud with WANdisco LiveData Cloud Services. Why? Because of the patented Distributed Coordination Engine platform.
What is that?
This innovation is founded on fundamental IP which is based around forming consensus in a distributed network. This is an extremely hard problem to solve and to this day some people believe that it cannot be solved. So what is this problem at a high level? If you have a network of nodes, distributed across the world with little to no knowledge of the distance and bandwidth between the nodes, how can you get the nodes to coordinate between each other without worrying about any failure scenarios?
The solution is the application of a consensus algorithm and the gold standard in consensus is an algorithm called Paxos. Our chief Scientist Dr. Yeturu Aahlad, an expert in distributed systems, devised the first, and even now only, commercialised version of Paxos. By doing so, he solved a problem that had been puzzling computer scientists for years.
WANdisco’s LiveData Cloud Services are based on this core IP including our products focused on analytical data and the challenge of migrating this data to the cloud and keeping the data consistent in multiple locations.
As businesses request to have data available in a more and more decentralized environment, the old mechanisms to provide and manage data are not sufficient anymore. Moreover, the amount of data is rising exponentially which leads to a phenomenon called data gravity. With an increasing volume of data, the more it is a challenge to provide this in a distributed environment, allow changes to the data in any environment, and ensure it remains consistent across all environments. Additionally regulation and compliance requirements make it even more challenging for data managers to fulfil businesses needs.
About the author
Van Diamandakis, SVP of Marketing, WANdisco
Van is a proven Silicon Valley technology executive with over 25 years of operational experience that draws upon his track record leading global marketing transformations, driving to meaningful financial events including IPOs and acquisitions. Van has been at the forefront of B2B technology marketing and brings a unique ability to marry creativity, data, technology and leadership skills to rapidly build brand equity and successfully navigate tech companies through inflection points, accelerating revenue growth and valuation.
Related Blog Posts
Tech & Trends
LiveData Migrator makes migrating to AWS Glue Data Catalog easy. In two steps, teams can migrate from an Apache Hive metastore to AWS Glue Data Catalog.
Teams can easily migrate from an Apache Hive metastore to AWS Glue Data Catalog. LiveData Migrator e...
May 13, 2021Read More
Tech & Trends
COVID-19 Accelerated Cloud Adoption. Are You Ready for What’s Next?
LiveData Migrator addresses the challenges associated with large-scale cloud data migration enabling...
Apr 29, 2021Read More
Tech & Trends
Learn about Azure cloud storage solutions at Azure Storage Day
Microsoft is hosting Azure Storage Day on April 29, 2021 where you can learn more about Azure cloud...
Apr 19, 2021Read More