That data is being crunched by analytics, machine learning (ML), and artificial intelligence (AI) models to detect behavior from patterns and gather insights.Developing and maintaining an on-premises data lake, to make sense of the ingested data, is a complex undertaking. To maximize the value of data and use it as the basis for critical decisions, the data platform must be flexible and cost-effective.In this post, I will outline a solution for building a hybrid data lake with Alluxio to leverage analytics and AI on Amazon Web Services (AWS) alongside a multi-petabyte on-premises data lake. Alluxio’s solution is called “zero-copy” hybrid cloud, indicating a cloud migration approach without first copying data to Amazon Simple Storage Service (Amazon S3).The hybrid data lake approach detailed in this post allows for complex data pipelines on-premises to coexist with a modern, flexible, and secure computing paradigm on AWS.Alluxio is an AWS Advanced Technology Partner with the AWS Data & Analytics Competency that enables incremental migration of a data lake to AWS.Solution OverviewData platforms are being built with decoupled storage and compute to scale capacity independently.
Applications and data catalogs remain unchanged even when hot or archival data is selectively moved from storage on-premises to Amazon S3.With data in S3, it’s made available for other AWS platforms such as:Amazon SageMaker to build, train, and deploy machine learning models.AWS Glue with crawlers to create a data catalog for the migrated data.Amazon Athena and Amazon QuickSight for business analytics and visualization.Getting StartedIn the following tutorial, I will look at how to use Alluxio to bridge the gap between on-premises data and compute engines on AWS.