Data-Pipeline-Apache-Airflow

Overview

This project is about building an Airflow ETL Pipeline for Sparkify Company. The company wants to automate and monitor their data warehousing ETL on AWS. The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to. Also, wants Data Quality tests to run against their datasets after the ETL steps have been executed to catch any discrepancies in the datasets.

Project Dataset

There are two datasets that reside in S3:

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data

Database Schema Design

Project Files

Create_tables.sql: Contains CREATE SQL statements.

DAG

sparkify_dag: Has all the imports and task templates in place and task dependencies.

Operators

stage_redshift.py: Loads data from S3 to Redshift, The operator creates and runs a SQL COPY statement based on the parameters provided.
load_dimension.py: Loads and transforms data from staging tables to dimension tables.
load_fact.py: Loads and transforms data from staging tables to fact tables.
data_quality: Creates the data quality operator, which is used to run checks on the data itself.

Helpers

sql_queries.py: For the SQL transformations.

Data Pipeline

Will use Airflow to create the ETL pipeline. The Data Pipeline steps consist of:

load data from S3 to the staging table in Amazon Redshift for this task I created StageToRedshiftOperator in the stage_redshift.py file. The operator creates and runs a SQL COPY statement based on the parameters provided.
load data from staging table to dimension tables, I created LoadDimensionOperator. Dimension loads are often done with the truncate-insert pattern where the target table is emptied before the load.
load data from staging table to fact tables, I created LoadFactOperator.
It's important to check the quality. So, I created the DataQuailtyOperator. The operator's main functionality is to receive one or more SQL-based test cases along with the expected results and execute the tests. For each test, the test result and expected result needs to be checked and if there is no match, the operator should raise an exception and the task should retry and fail eventually.
It's very important to define the task dependencies.

Visualization of the DAG

After the DAG is finished, I go to the Redshift query editor to check the data.

Query Example in Redshift query editor

Prerequisites:

Create an IAM User in AWS.
Attach Policies: AdministratorAccess, AmazonRedshiftFullAccess and AmazonS3FullAccess
Create a redshift cluster.

Airflow Connection:

Connect Airflow and AWS (AWS Credentials).
Run /opt/airflow/start.sh, Click on the Admin tab and select Connections.
Then create Amazon Web Services conn you will Enter Access Key in login and Secret key in password from the IAM User credentials.
Connect Airflow to the AWS Redshift Cluster.
Create Postgres Conn with credentials to Redshift

How to run

Run /opt/airflow/start.sh.

Author

Esraa Ahmed |

Created on 28/09/2022

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
dags		dags
imgs		imgs
plugins		plugins
README.md		README.md
airflow.png		airflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Pipeline-Apache-Airflow

Overview

Project Dataset

Database Schema Design

Project Files

DAG

Operators

Helpers

Data Pipeline

Prerequisites:

Airflow Connection:

How to run

Author

About

Releases

Packages

Languages

essraahmed/Data-Pipeline-with-Airflow

Folders and files

Latest commit

History

Repository files navigation

Data-Pipeline-Apache-Airflow

Overview

Project Dataset

Database Schema Design

Project Files

DAG

Operators

Helpers

Data Pipeline

Prerequisites:

Airflow Connection:

How to run

Author

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages