The delivery of the dataset has been greatly appreciated by Amazing Prime as a result would like us to implement a code to receive updates daily. Assisting Britta, we created an automated pipeline that takes in new data, performs the appropriate transformations, and loads the data into existing tables. Refactoring the code from this module proved necessary and the creation of one function that takes in the three files, Wikipedia data, Kaggle metadata, and the MovieLens, rating data and performs the ETL process by adding the data to a PostgreSQL database.
Wikipedia holds up its’ reputation on maintaining an enormous amount of information, including budgets and box office returns, cast and crew, production, and distribution, and so much more. Previously, Britta's coworkers created a script to go through a list of movies on Wikipedia from 1990 to 2018 and extract the data from the sidebar into a JSON. Unfortunately, her coworker can't find the script anymore and just has the JSON file. We'll need to load in the JSON file into a Pandas DataFrame.
• Products developed: ETL Deliverable 1, ETL Deliverable 2 and ETL Deliverable 3
• Data Tools: Jupyter Notebook, PostgreSQL, pgAdmin
• Software: pgAdmin 4.26, Python 3.8.3, Visual Studio Code 1.50.0, Flask Version 1.0.2
- The function converts the Wikipedia JSON file to a Pandas DataFrame, and the DataFrame is displayed in the ETL_function_test.ipynb file.
- An ETL function is written to read in the three data files.
- The function converts the Kaggle metadata file to a Pandas DataFrame, and the DataFrame is displayed in the ETL_function_test.ipynb file.
- The function converts the MovieLens ratings data file to a Pandas DataFrame, and the DataFrame is displayed in the ETL_function_test.ipynb file. Check the wiki_movies_df DataFrame:
- The TV shows are filtered out, and the wiki_movies_df DataFrame is created
- A try-except block is used to catch errors while extracting the IMDb IDs with a regular expression and dropping duplicate IDs.
- The extraction and transformation of the Wikipedia data in the ETL function does the following:
- The cleaned Wikipedia data is converted to a Pandas DataFrame, and the DataFrame is displayed in the ETL_clean_wiki_movies.ipynb file.
- The extraction and transformation of the Kaggle metadata using the ETL function.
- The extraction and transformation of the MovieLens ratings data using the ETL function. Transform and merge the ratings DataFrame:
3.The movies_with_ratings_df and the movies_df DataFrames are displayed in the ETL_clean_kaggle_data.ipynb file as well as below:
- The data from the movies_df DataFrame replaces the current data in the movies table in the SQL database, as determined by the movies_query.png.
- The data from the MovieLens rating CSV file is added to the ratings table in the SQL database, as determined by the ratings_query.png.
3.The elapsed time to add the data to the database is displayed in the ETL_create_database.ipynb file. The elapsed time: