Skip to content

Latest commit

 

History

History
155 lines (105 loc) · 4.99 KB

File metadata and controls

155 lines (105 loc) · 4.99 KB

Homework

Solution:

Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.

Dataset

In this homework, we will use the California Housing Prices data from Kaggle.

Here's a wget-able link:

wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

We'll keep working with the 'median_house_value' variable, and we'll transform it to a classification task.

Features

For the rest of the homework, you'll need to use only these columns:

  • 'latitude',
  • 'longitude',
  • 'housing_median_age',
  • 'total_rooms',
  • 'total_bedrooms',
  • 'population',
  • 'households',
  • 'median_income',
  • 'median_house_value',
  • 'ocean_proximity',

Data preparation

  • Select only the features from above and fill in the missing values with median.
  • Create a new column rooms_per_household by dividing the column total_rooms by the column households from dataframe.
  • Create a new column bedrooms_per_room by dividing the column total_bedrooms by the column total_rooms from dataframe.
  • Create a new column population_per_household by dividing the column population by the column households from dataframe.

Question 1

What is the most frequent observation (mode) for the column ocean_proximity?

Options:

  • NEAR BAY
  • <1H OCEAN
  • INLAND
  • NEAR OCEAN

Question 2

  • Create the correlation matrix for the numerical features of your train dataset.
    • In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
  • What are the two features that have the biggest correlation in this dataset?

Options:

  • total_bedrooms and households
  • total_bedrooms and total_rooms
  • population and households
  • population_per_household and total_rooms

Make median_house_value binary

  • We need to turn the median_house_value variable from numeric into binary.
  • Let's create a variable above_average which is 1 if the median_house_value is above its mean value and 0 otherwise.

Split the data

  • Split your data in train/val/test sets, with 60%/20%/20% distribution.
  • Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
  • Make sure that the target value (median_house_value) is not in your dataframe.

Question 3

  • Calculate the mutual information score between above_average and ocean_proximity . Use the training set only.
  • Round it to 2 decimals using round(score, 2)
  • What is their mutual information score?

Options:

  • 0.26
  • 0
  • 0.10
  • 0.16

Question 4

  • Now let's train a logistic regression
  • Remember that we have one categorical variable ocean_proximity in the data. Include it using one-hot encoding.
  • Fit the model on the training dataset.
    • To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    • model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
  • Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

Options:

  • 0.60
  • 0.72
  • 0.84
  • 0.95

Question 5

  • Let's find the least useful feature using the feature elimination technique.
  • Train a model with all these features (using the same parameters as in Q4).
  • Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
  • For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
  • Which of following feature has the smallest difference?
    • total_rooms
    • total_bedrooms
    • population
    • households

Note: the difference doesn't have to be positive

Question 6

  • For this question, we'll see how to use a linear regression model from Scikit-Learn
  • We'll need to use the original column 'median_house_value'. Apply the logarithmic transformation to this column.
  • Fit the Ridge regression model (model = Ridge(alpha=a, solver="sag", random_state=42)) on the training data.
  • This model has a parameter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10]
  • Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest alpha.

Options:

  • 0
  • 0.01
  • 0.1
  • 1
  • 10

Submit the results

  • Submit your results here: https://forms.gle/vQXAnQDeqA81HSu86
  • You can submit your solution multiple times. In this case, only the last submission will be used
  • If your answer doesn't match options exactly, select the closest one

Deadline

The deadline for submitting is 26 September (Monday), 23:00 CEST.

After that, the form will be closed.