This module was an introduction to the fundamentals of data analysis, including the acquisition, cleaning and exploration of data sets. As part of the assessment we had to complete four tasks outlined below.
-
Readme: README.md
-
Jupyter Notebook: project2020.ipynb
-
PDF of Jupyter Notebook: project2020.pdf
-
Images Folder images
-
Power Production CSV powerproduction.csv
If you have any issues viewing tasks2020.ipynb in github you can use Jupyter NBViewer which is a web application behind The Jupyter Notebook Viewer at https://nbviewer.jupyter.org/
Jupyter is a free, open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document. (Source:downloads.com)
provides you with an alternative to the Windows default command prompt utility through a more capable console emulator that also comes packing a good-looking graphical user interface. (Source:downloads.com)
Problem statement: In this project we had to perform and explain simple linear regression using Python on the powerproduction dataset available on Moodle.
The goal is to accurately predict wind turbine power output from wind speed values using the data set as a basis. Our submission had be in the form of a git repository containing, at a minimum, the following items:
- Jupyter notebook that performs simple linear regression on the data set.
- In that notebook, an explanation of your regression and an analysis of its accuracy.
- Standard items in a git repository such as a README.
To enhance our submission, we had to consider comparing simple linear regression to other types of regression on this data set.
The project is divided into nine sections outlined below.
1.0 Wind Energy
2.0 Simple Linear Regression
3.0 Importing the required libraries
4.0 Creating dataframe
5.0 Data exploration
6.0 Summary statistics
7.0 Simple linear regression algorithm
8.0 Polynomial regression algorithm
9.0 References
1.0 Wind Energy: This section of my project provides an outline of energy demand in the world and how wind energy contributes to this. I also provide some fact and figures on the contribution of wind energy both in Ireland and the wider world along with criteria applied in the location of wind energy turbines.
Learnings: When researching wind energy and wind turbines for the project I gained a new appreciation for energy demand and the factors influencing the location of wind turbines. I also learned to download and add additional functionality in Jupyter Notebooks, including using the add-ons "Autopep8", "spellchecker" and "Table of Contents(2)". Autopep8 is excellent for formatting your code correctly, while spellchecker comes in handy in correcting those pesky spelling mistakes.
Topics: research, wind turbines, energy supply and demand.
2.0 Simple Linear Regression: This section covers linear regression and the equation used to perform it.
Learnings: linear regression, Scikit-Learn.
Topics: regression, linear regression.
3.0 Importing the required libraries: This section covers importing the libraries used for this project. I used Pandas for importing and manipulating the dataset. NumPy for generating arrays and Matplotlib and Seaborn for data visualisation. I also used Scikit-Learn for train_test_split() method, and regressions.
Learnings: I learned two things in this section. I got an understanding of Scikit-Learn and a way of making plots clearer by using the magic command %config InlineBackend.figure_format = ‘retina’.
Topics: NumPy, Pandas, Scikit-Learn, retina display magic command.
4.0 Creating dataframe: This is where I read in the data from the powerproduction.csv and create a dataframe.
Learnings: Dataframe creation.
Topics: Pandas, dataframes.
5.0 Data exploration: In this section I explore the powerproduction.csv dataset. There are only two columns in the dataset speed
and output
. Both are floats and there is 500 records in the dataset with no null values. I also provide some initial observations which are aligned with the research I carried out on wind turbine technology.
Learnings: I suppose what I took away from this section was the importance of taking your time and carrying out detailed research on the topic the data relates to in order to better understand the data and it's attributes.
Topics: Research, wind turbines, Pandas methods.
6.0 Summary statistics: This section creates some summary statistics and plots the data to see what it looks like. From this visualisation it was clear that fitting a linear regression to it would be difficult and that an alternative type would need to be found.
7.0 Simple linear regression algorithm: Section 7 provides an explanation of the linear regression I carried out on the dataset. Starting from the data preparation, training my algorithm, making predictions and then evaluating it.
Learnings: Unlike polynomial regression, I found linear regression easier to understand and implement. However getting a grasp of evaluating the outcome was a struggle.
Topics: Scikit-Learn, train_test_split method(), LinearRegression, polyfit, intercept, slope, mean absolute error, mean squared error, root mean squared error.
8.0 Simple linear regression algorithm: Section 8 provides an explanation of the polynomial regression I carried out on the dataset. Starting from the data preparation, training my algorithm, making predictions and then evaluating it.
Learnings: As stated earlier I found polynomial regression harder to understand and implement.
Topics: Scikit-Learn, train_test_split method(), Polynomial, Polynomial Features, LinearRegression,