Harry Potter and a Data Scientist

Subject PDF:

Cook Book:

DSLR (Datascience X Logistic Regression)

On no! Since its creation, the famous school of wizards, Hogwarts, had never known such an offense. The forces of evil have bewitched the Sorting Hat. It no longer responds, and is unable to fulfill his role of sorting the students to the houses.

The new academic year is approaching. Gladly, the Professor McGonagall was able to take action in such a stressful situation, since it is impossible for Hogwarts not to welcome new students. . . She decided to call on you, a muggle "datascientist" who is able to create miracles with the tool which all muggles know how to use: a "computer". Despite the intrinsic reluctance of many wizards, the director of the school welcomes you to his office to explain the situation. You are here because his informant discovered that you are able to recreate a magic Sorting Hat using your muggle tools.

You explain to him that in order for your "muggle" tools to work, you need students data. Hesitantly, Professor McGonagall gives you a dusty spellbook. Fortunately for you, a simple "Digitalis!" and the book turned into a USB stick.

Data Visualization

Histogram

Which Hogwarts course has a homogeneous score distribution between all four houses ?

python3 histogram.py -d

-d: Display all the histograms.
-f: Show histogram of the feature 'n'.

Scatter plot

What are the two features that are similar ?

python3 scatter_plot.py [-f1{n1} -f2{n2}]

-f1: precise the first feature to use.
-f2: precise the second feature to use.
n1 and n2: index of the features to use

Pair plot

python3 pair_plot.py

Data Analysis:

Some features are homogenous or coherant with other ones, so there existance is not necessary for training the model and can give use= a complex hypothesis that will cause 'Overfitting' Our choice was to remove:

Arithmancy: Homogenous
Astronomy: Similar to 'Defense Against the Dark Arts'
Transfiguration: Semi similar to 'History of Magic'
Potions: Semi homogenous
Care of Magical Creatures: Semi homogenous

Training the model

`python3 logreg_train.py [-BGD | -SGD] <_train dataset_>`

-BGD: Batch Gradient Descent Algorithm
-SGD: Stochastic Gradient Descent Algorithm

Output a file named: ./weights.csv that contain the weights of the model.

In the end of the training the program output the: (using the training set)

Accuracy of the model 98.06%
Confusion Matrix
F1 Score
Balanced Accuracy 98.71%

Predict with the model

`python3 logreg_predict.py [-p] <_dataset_> <_weights_>`

-p: Print the result with the students names in the stdout

Output a file named: ./houses.csv that contain the Indexs and the predicted house affected to the students.

Packages needed

pip3 install pandas
pip3 install matplotlib
pip3 install seaborn

Grade

✔️ 125 [ Accuracy: (training data: 98.06%) (evaluation data: 99%) ]
Miss McGonagall is very happy for the results 🎉🥳

Owners:

Oussama Belouche 1337
Anas Elouargui 1337

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Harry Potter and a Data Scientist

Subject PDF:

Cook Book:

DSLR (Datascience X Logistic Regression)

Data Visualization

Histogram

python3 histogram.py -d

Scatter plot

python3 scatter_plot.py [-f1{n1} -f2{n2}]

Pair plot

python3 pair_plot.py

Data Analysis:

Training the model

`python3 logreg_train.py [-BGD | -SGD] <_train dataset_>`

Predict with the model

`python3 logreg_predict.py [-p] <_dataset_> <_weights_>`

Packages needed

Grade

Owners:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Harry Potter and a Data Scientist

Subject PDF:

Cook Book:

DSLR (Datascience X Logistic Regression)

Data Visualization

Histogram

python3 histogram.py -d

Scatter plot

python3 scatter_plot.py [-f1{n1} -f2{n2}]

Pair plot

python3 pair_plot.py

Data Analysis:

Training the model

python3 logreg_train.py [-BGD | -SGD] <_train dataset_>

Predict with the model

python3 logreg_predict.py [-p] <_dataset_> <_weights_>

Packages needed

Grade

Owners:

`python3 logreg_train.py [-BGD | -SGD] <_train dataset_>`

`python3 logreg_predict.py [-p] <_dataset_> <_weights_>`