This project is a data pipeline illustrating some of the best practices in pipeline manage such as: unit testing, monitoring and observability, and serverless archiecture. This project is built utilizing CDK(Python), Docker, Former2, and the AWS Console Recorder.
Ultimately, this project is here to illustrate just what is possible when leveraging the AWS CDK to further the goal of combining infrastucture, CI/CD, and development into a singular practice gaining popularity known as DataOps.
Initially, this project was developed as a single stack. As the complexity grew, I decided to break the project into several stacks with inter-stack dependencies. All of this is governed by the constraint that all stacks are within the same account and region. Once, I broke the project up into several stacks, I opted to introduce a monitoring stack which functions to create a CloudWatch alarm which monitors the throughput of the Kinesis Firehose created in the data pipeline stack.
Below is the reference architecture for this project. All of this was developed over the course of a weekend. Hopefully, this illustrates some of the strengths of combining infrastructure and development into a singular practice.
Prior to configuring your virtualenv, ensure you have the invoke and poetry libraries installed globally for your python version.
To manually create a virtualenv on MacOS and Linux:
$ python3 -m venv .venv
After the init process completes and the virtualenv is created, you can use the following step to activate your virtualenv.
$ source .venv/bin/activate
If you are a Windows platform, you would activate the virtualenv like this:
% .venv\Scripts\activate.bat
Once the virtualenv is activated, you can install the required dependencies.
$ pip install -r requirements.txt
$ poetry install
Ensure that the pre-commit hooks are configured using the following command
$ inv install-hooks
Note: this git workflow will now look something like:
git add <file>
git commit
git add .
-- if there are code correctionsgit commit
-- to verify that the pre-commit hooks are resolved:q
-- to exit the message prompt and utilize the more robust command belowgit cz
-- to make a descriptive commit to the repo
Configure the tasks.py
file such that the AWS_PROFILE is set to your AWS CLI profile which you want to work out of.
At this point you can now determine the names of the available stacks.
$ inv ls
To deploy the stacks, just use cdk deploy --all --profile <profile name>
which will stand-up all the necessary infrastructure for the CDK Data Pipeline.
To add additional dependencies, for example other CDK libraries, just use
poetry add <library name>
command.
inv ls
list all stacks in the appinv synth
emits the synthesized CloudFormation templateinv deploy
deploy this stack to your default AWS account/regioninv diff
compare deployed stack with current statecdk docs
open CDK documentation
Enjoy!