Solution:
Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.
In this homework, we will use the California Housing Prices data from Kaggle.
Here's a wget-able link:
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
We'll keep working with the 'median_house_value'
variable, and we'll transform it to a classification task.
For the rest of the homework, you'll need to use only these columns:
'latitude'
,'longitude'
,'housing_median_age'
,'total_rooms'
,'total_bedrooms'
,'population'
,'households'
,'median_income'
,'median_house_value'
,'ocean_proximity'
,
- Select only the features from above and fill in the missing values with median.
- Create a new column
rooms_per_household
by dividing the columntotal_rooms
by the columnhouseholds
from dataframe. - Create a new column
bedrooms_per_room
by dividing the columntotal_bedrooms
by the columntotal_rooms
from dataframe. - Create a new column
population_per_household
by dividing the columnpopulation
by the columnhouseholds
from dataframe.
What is the most frequent observation (mode) for the column ocean_proximity
?
Options:
NEAR BAY
<1H OCEAN
INLAND
NEAR OCEAN
- Create the correlation matrix for the numerical features of your train dataset.
- In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
- What are the two features that have the biggest correlation in this dataset?
Options:
total_bedrooms
andhouseholds
total_bedrooms
andtotal_rooms
population
andhouseholds
population_per_household
andtotal_rooms
- We need to turn the
median_house_value
variable from numeric into binary. - Let's create a variable
above_average
which is1
if themedian_house_value
is above its mean value and0
otherwise.
- Split your data in train/val/test sets, with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the
train_test_split
function) and set the seed to 42. - Make sure that the target value (
median_house_value
) is not in your dataframe.
- Calculate the mutual information score between
above_average
andocean_proximity
. Use the training set only. - Round it to 2 decimals using
round(score, 2)
- What is their mutual information score?
Options:
- 0.26
- 0
- 0.10
- 0.16
- Now let's train a logistic regression
- Remember that we have one categorical variable
ocean_proximity
in the data. Include it using one-hot encoding. - Fit the model on the training dataset.
- To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.
Options:
- 0.60
- 0.72
- 0.84
- 0.95
- Let's find the least useful feature using the feature elimination technique.
- Train a model with all these features (using the same parameters as in Q4).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
- Which of following feature has the smallest difference?
total_rooms
total_bedrooms
population
households
Note: the difference doesn't have to be positive
- For this question, we'll see how to use a linear regression model from Scikit-Learn
- We'll need to use the original column
'median_house_value'
. Apply the logarithmic transformation to this column. - Fit the Ridge regression model (
model = Ridge(alpha=a, solver="sag", random_state=42)
) on the training data. - This model has a parameter
alpha
. Let's try the following values:[0, 0.01, 0.1, 1, 10]
- Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.
If there are multiple options, select the smallest alpha
.
Options:
- 0
- 0.01
- 0.1
- 1
- 10
- Submit your results here: https://forms.gle/vQXAnQDeqA81HSu86
- You can submit your solution multiple times. In this case, only the last submission will be used
- If your answer doesn't match options exactly, select the closest one
The deadline for submitting is 26 September (Monday), 23:00 CEST.
After that, the form will be closed.