A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
In this project, we will create an ETL pipeline to build a data warehouses hosted on Amazon AWS Redshift.
Datasets used in this project are provided in two public S3 buckets by Udacity.
- Song data: s3://udacity-dend/song_data
- Log data: s3://udacity-dend/log_data
songplays - records in event data associated with song plays. Columns for the table:
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
user_id, first_name, last_name, gender, level
song_id, title, artist_id, year, duration
artist_id, name, location, lattitude, longitude
start_time, hour, day, week, month, year, weekday
Setup the dwh.cfg file (File excluded by .gitignore). File format for dwh.cfg
[CLUSTER]
HOST=''
DB_NAME='dev'
DB_USER='awsuser'
DB_PASSWORD='Passw0rd'
DB_PORT=5439
[IAM_ROLE]
ARN='IAM Role arn'
[S3]
LOG_DATA='s3://udacity-dend/log_data'
LOG_JSONPATH='s3://udacity-dend/log_json_path.json'
SONG_DATA='s3://udacity-dend/song_data'
$ python create_tables.py
$ python etl.py
Starter Template by Udacity
Reference: AWS Redshift Doc