Defining a DevOps strategy for a data lake requires extensive planning and multiple teams. This typically requires multiple development and test cycles before maturing enough to support a data lake in a production environment.
We accomplish this as follows: applying separation of concerns (SoC) design principle to data lake infrastructure and ETL jobs via dedicated source code repositories, a centralized deployment model utilizing CDK pipelines, and AWS CDK enabled ETL pipelines from the start.
Here are a few important details: We will discuss data formats, Glue jobs, ETL transformation logics, data cataloging, auditing, notification, orchestration, and data analysis in more detail in AWS CDK Pipelines for Data Lake ETL Deployment GitHub repository.
In this post, we showed how to utilize CDK Pipelines to deploy infrastructure and data processing ETL jobs of your data lake in dev, test, and production AWS environments.