Climate Change 1995-2020 (ETL Project Group 8)

Background

This repository contains a group project that revolved around the concept of ETL (Extracting, Transforming and Loading). We were given a week to select 2 or more sources of data and work closely with our team through all phases of the ETL process. After looking through various datasets, our group decided upon doing data analysis on climate changes over time in the US in the last 25 years (1995-2020).

Data Sources

Kaggle Dataset: Daily Temperature of Major Cities

https://www.kaggle.com/sudalairajkumar/daily-temperature-of-major-cities

Environmental Protection Agency Pre-Generated Data Files

https://aqs.epa.gov/aqsweb/airdata/download_files.html#Annual

Extract

Original Data Sources and How Our Data Was Formatted

Our original data came from two sources. The first, was a Kaggle dataset documenting Daily Temperature of Major Cities. The second was a collection of Pre-Generated Data files of Air Data from the US Environmental Protection Agency (EPA)'s website. You can find both of these sources in the Resources folder of our repository under the Daily Temperature of Major Cities and EPA_annual_conc_by_monitor folders respectively. Each folder contains a zip of the data as well as a link which takes you to the source's page online.

Transform

Data Cleaning/Transformation That Was Required

While there was a lot of data in the Kaggle dataset concerning cities around the world, our team figured the best way to utilize both datasets was to focus solely on the US.

We first took both datasets and cleaned them up via separate Jupyter Notebooks. In the us_temp_analysis notebook, we used Pandas with the Daily Temperature of Major Cities data and created a dataframe with only the US data rows via the Country column. That dataframe was then exported as a CSV called US_Temp_1995_2020.csv. The CSV has the following columns:

TempReading
Region
Country
State
City
Month
Day
Year
AvgTemperature

As a reference, we pulled the State and City columns to create another csv using a dataframe that contained a list of all the unique cities in the US_Temp_1995_2020.csv called US_Temp_City_List.csv.

For the us_epa_analysis notebook, we loaded all of the individual files from 1995-2020. Then we extracted each of the 4 readings in the file into individual dataframes for each reading and pulled the following columns:

Parameter Name
Year
Sample Duration
Pollutant Standard
Metric Used
Method Name
Units of Measure
Max Value
Max DateTime
State Name
County Name
City Name

You can find each reading here: First, Second, Third, Fourth.

Load

Our Final Database, Tables/Collections, Why This Was Chosen

In the final database, we pulled the US Temp Analysis data from Excel and transferred by each individual cell, read each cell for a row, combined it into rows and then imported the files into the database as two separate segments.

As for the US EPA Analysis data, we imported all of the excel documents into Pandas and then loaded it as an "insert many" into MongoDB.

Below is a picture representation of the MongoDB ClimateData:

Analysis

In order to demonstrate the data we placed into MongoDB, data was extracted from the EPA Dataset for ozone readings for Indianapolis, IN. Readings were averaged for each year and a line chart was created showing the parts per million ozone readings over time.

Below is an image of the line chart created from the data:

We found that the ozone parts per million(ppm) varies over the years, but stays within a range of roughly .075 to .100 ppm. This does not indicate an increasing or decreasing trend in the average ozone levels over time.

We also extracted data from the temperature dataset for temperature readings in Indianapolis, IN. These readings were also averaged for each year and a line chart was created showing the variance in temperature readings over time.

Here is an image of the temperature line chart created from the data:

It appears there is a slight positive trend in the average temperature. While it varies from year to year even a few degrees difference can cause drastic effects on sensitive organisms.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
Images		Images
Resources		Resources
.gitignore		.gitignore
README.md		README.md
Upload_Data.ipynb		Upload_Data.ipynb
final_analysis.ipynb		final_analysis.ipynb
initial_analysis.ipynb		initial_analysis.ipynb
us_epa_analysis.ipynb		us_epa_analysis.ipynb
us_temp_analysis.ipynb		us_temp_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Climate Change 1995-2020 (ETL Project Group 8)

Background

Data Sources

Extract

Original Data Sources and How Our Data Was Formatted

Transform

Data Cleaning/Transformation That Was Required

Load

Our Final Database, Tables/Collections, Why This Was Chosen

Analysis

About

Releases

Packages

Contributors 3

Languages

doublealogic/Climate-Change-ETL-Analysis

Folders and files

Latest commit

History

Repository files navigation

Climate Change 1995-2020 (ETL Project Group 8)

Background

Data Sources

Extract

Original Data Sources and How Our Data Was Formatted

Transform

Data Cleaning/Transformation That Was Required

Load

Our Final Database, Tables/Collections, Why This Was Chosen

Analysis

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages