-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path05-generalized-linear-models.qmd
1127 lines (728 loc) · 45.6 KB
/
05-generalized-linear-models.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
format:
html:
number-depth: 3
css: summary-format.css
---
# Generalized linear models (GLM)
## Linear regresion
### Getting the Least Squares Line of a sample
As the *population regression line* is unobserved the *least squares line* of a sample is a good estimation. To get it we need to follow the next steps:
1. Define the function to fit.
$$
\hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1} x
$$
2. Define how to calculate **residuals**.
$$
e_{i} = y_{i} - \hat{y}_{i}
$$
3. Define the **residual sum of squares (RSS)**.
$$
RSS = e_{1}^2 + e_{2}^2 + \dots + e_{n}^2
$$
4. Use calculus or make estimation with a computer to find the coefficients that minimize the RSS.
$$
\hat{\beta}_{1} = \frac{\Sigma_{i=1}^{n}(x_{i}-\overline{x})(y_{i}-\overline{y})}
{\Sigma_{i=1}^{n}(x_{i}-\overline{x})}
, \quad
\hat{\beta}_{0} = \overline{y} - \hat{\beta}_{1}\overline{x}
$$
### Getting confident intervarls of coeffients
To estimate the **population regression line** we can calculate **confidence intervals** for sample coefficients, to define a range where we can find the population values with a defined **confidence level**.
> If we want to use 95% of confidence we need to know that after taking many samples only 95% of the intervals produced with this **confident level** would have the true value (parameter).
To generate confident intervals we would need to calculate the variance of the *random error*.
$$
\sigma^2 = Var(\epsilon)
$$
But as we can not calculate that variance an alternative can be to estimate it based on residuals if they meet the next conditions:
1. Each residual have common variance $\sigma^2$, so the variances of the error terms shouldn't have any relation with the value of the response.
2. Residuals are uncorrelated. For example, if $\epsilon_{i}$ is positive, that provides little or no information about the sign of $\epsilon_{i+1}$.
If not, we would end underestimating the true standard errors, reducing the probability a given confident level to contain the true value of the parameter and underrating the *p-values* associated with the model.
$$
\sigma \approx RSE = \sqrt{\frac{RSS}{(n-p-1)}}
$$
Now we can calculate the **standard error** of each coefficient and calculate the confident intervals.
$$
SE(\hat{\beta_{0}})^2 = \sigma^2
\left[\frac{1}{n}+
\frac{\overline{x}^2}
{\Sigma_{i=1}^{n} (x_{i}-\overline{x})^2}
\right]
$$
$$
SE(\hat{\beta_{1}})^2 = \frac{\sigma^2}
{\Sigma_{i=1}^{n} (x_{i} - \overline{x})^2}
$$
$$
\hat{\beta_{1}} \pm 2 \cdot SE(\hat{\beta_{1}}), \quad \hat{\beta_{0}} \pm 2 \cdot SE(\hat{\beta_{0}})
$$
### Insights to extract
#### Confirm the relationship between the Response and Predictors
Use the regression **overall P-value** (based on the F-statistic) to confirm that at **least one predictor** is related with the Response and avoid interpretative problems associated with the number of observations (_n_) or predictors (_p_).
$$
H_{0}: \beta_{1} = \beta_{2} = \dots = \beta_{p} = 0
$$
$$
H_{a}: \text{at least one } \beta_{j} \text{ is non-zero}
$$
#### Accuracy of the model (relationship strength)
If we want to know how well the model fits to the data we have two options:
- **Residual standard error (RSE)**: Even if the model were correct, the actual values of $\hat{y}$ would differ from the true regression line by approximately *this units*, on average. To get the percentage error we can calculate $RSE/\overline{x}$
- **The $R^2$ statistic**: The proportion of variance explained by taking as a reference the **total sum of squares (TSS)**.
$$
TSS = \Sigma(y_{i} - \overline{y})^2
$$
$$
R^2 = \frac{TSS - RSS}{TSS}
$$
$$
R^2 =
\begin{cases}
Cor(X, Y)^2 & \text{Simple Lineal Regresion} \\
Cor(Y,\hat{Y})^2 & \text{Multipline Lineal Regresion}
\end{cases}
$$
#### Confirm the relationship between the Response and each predictor
To answer that we can test if a particular subset of q of the coefficients are zero.
$$
H_{0}: \beta_{p-q+1} = \beta_{p-q+2} = \dots = \beta_{p} = 0
$$
In this case, F-statistic reports the **partial effect** of adding a extra variable to the model (the order matters) to apply a *variable selection* technique. The classical approach is to:
1. Fit a model for each variable combination $2^p$.
2. Select the best model based on *Mallow’s Cp*, *Akaike information criterion (AIC)*, *Bayesian information criterion (BIC)*, and *adjusted* $R^2$ or plot various model outputs, such as the residuals, in order to search for patterns.
But just think that if we have $p = 30$ we will have $2^{30} = =1,073,741,824\ models$ to fit, that it's too much. Some alternative approaches for this task:
- Forward selection
- Backward selection (cannot be used if p >n)
- Mixed selection
#### Size of association between each predictor and the response.
To check that we need to see the $\hat{\beta}_{j}$ *confident intervals* as the real $\beta_{j}$ is in that range.
#### Predicting future values
If we want to predict the average response $f(X)$ we can use the confident intervals, but if we want to predict an individual response $Y = f(X) + \epsilon$ we need to use prediction intervals as they account for the uncertainty associated with $\epsilon$, the irreducible error.
### Standard linear regression model assumptions
- The **additivity assumption** means that the association between a predictor $X_{j}$ and the response $Y$ does not depend on the values of the other predictors, as it happens when there is a *interaction (synergy) effect*
- The **linearity assumption** states that the change in the response Y associated with a one-unit change in $X_{j}$ is constant, regardless of the value of $X_{j}$.
#### Including an interaction term
This approach relax the *additivity assumption* that models usually have.
- **2 quantitative variables**
It consist in adding an extra coefficient which multiplies two or more variables.
$$
\begin{split}
Y & = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{1} X_{2} + \epsilon \\
& = \beta_{0} + (\beta_{1} + \beta_{3} X_{2}) X_{1} + \beta_{2} X_{2} + \epsilon \\
& = \beta_{0} + \tilde{\beta}_{1} X_{1} + \beta_{2} X_{2} + \epsilon
\end{split}
$$
After adding the interaction term we could interpret the change as making one of the original coefficient a function of the another variable. Now we could say that $\beta_{3}$ *represent __the change of__* $X_{1}$ *__effectiveness__ associated with a one-unit increase in* $X_{2}$.
It very important that we keep **hierarchical principle**, which states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant as ***it would alter the meaning of the interaction***.
- **1 quantitative and 1 qualitative variable**
If $X_{1}$ is quantitative and $X_{2}$ is qualitative:
$$
\hat{Y} =
\begin{cases}
(\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3})X_{1} & \text{if }X_{2} \text{ is TRUE}\\
\beta_{0} + \beta_{1}X_{1} & \text{if }X_{2} \text{ is FALSE}
\end{cases}
$$
Adding the $\beta_{3}$ interaction allow the line to change the line slope based on $X_{2}$ and not just a different intercept.
![](img/03-factor-interaction.png){fig-align="center"}
#### Polynomial regression
This approach relax the *linearity assumption* that models usually have. It consist in including transformed versions of the predictors.
$$
\begin{split}
Y & = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} \\
& = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{1}^2
\end{split}
$$
### Possible problems
#### The target doesn't follow a normal distribution
We have two ways to detect this problem:
1. Plot the target variable with a histogram
![](img/46-ames-sale-price-distribution.png){fig-align="center"}
2. Plotting the residual and checking the plot creates a funnel shape, demonstrating a *non-constant variance (heteroscedasticity) of error terms*.
![](img/05-non-constance-variance.png){fig-align="center"}
To solve this problem we can apply transformations to the target variable.
1. Applying a concave function such as $\log{Y}$ or $\sqrt{Y}$. If the target value has 0s we can define the argument offset as 1.
```r
ames_recipe <-
recipe(Sale_Price ~ ., data = ames_train) %>%
step_log(all_outcomes(), offset = 0)
```
2. Applying **Box Cox transformation** is more flexible than the log transformation and will find an appropriate transformation from a family of power transforms that will **transform the variable as close as possible to a normal distribution**. Its lambda ($\lambda$) value varies from -5 to 5 and it's selected based on the traning data.
$$
y(\lambda) =
\begin{cases}
\frac{Y^\lambda - 1}{\lambda}, \text{if } \lambda \neq 0 \\
\log{(Y)}, \text{if } \lambda = 0
\end{cases}
$$
3. If your data consists of values $y \leq 1$ use the **Yeo-Johnson transformation**, which is very similar to the Box-Cox one.
```r
ames_recipe <-
recipe(Sale_Price ~ ., data = ames_train) %>%
step_YeoJohnson(all_outcomes())
```
To make correct interpretations of our predictions it's important to remember that we need to **revert this transformation after getting the prediction**.
This actions will reduce the **relative error** of each prediction.
#### Collinearity
**Collinearity** refers to the situation in which two or more predictor variables are closely related (highly correlated) to one another. It reduces the accuracy of the estimates of the regression coefficients and causes the standard error for $\hat{\beta}_{j}$ to grow. That reduce the **power of the hypothesis test**,that is, the probability of correctly detecting a non-zero coefficient.
Looking at the correlation matrix of the predictors could be usefull, but it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high correlation (**multicollinearity**).
|Detection method|Solutions|
|:---------------|:--------|
|The best way to assess multicollinearity is to compute the **variance inflation factor (VIF)**, which is the ratio of the variance of $\hat{\beta}_{j}$ when fitting the full model divided by the variance of $\hat{\beta}_{j}$ if fit on its own with 1 as its lowest value and 5 or 10 as problematic values of collinearity|1. Drop one of the problematic variables from the regression. </br> </br> 2. Combine the collinear variables together into a single predictor|
$$
\text{VIF}(\hat{\beta}_{j}) = \frac{1}
{1 - R_{X_{j}|X_{-j}}^2}
$$
Where $R_{X_{j}|X_{-j}}^2$ is the $R^2$ from a regression of $X_{j}$ onto all of the other predictors.
#### Non-linearity of the response-predictor relationships
|Detection method|Solutions|
|:---------------|:--------|
|Plot the **residuals versus predicted values** $\hat{y}_{i}$. Ideally, the residual plot will show no discernible pattern. The presence of a pattern may indicate a problem with some aspect of the linear model.|A simple approach is to use non-linear transformations of the predictors, such as $\log{X}$, $\sqrt{X}$, and $X^2$, in the regression model|
![](img/04-residuals-predicted-values.png){fig-align="center"}
#### Correlation of error terms
If there is correlation among the errors, then the estimated standard errors of the coefficients will be biased **leading to prediction intervals being narrower** than they should be.
|Detection method|Solutions|
|:---------------|:--------|
|1. Plot the residuals from our model as a function of time or execution order. If the errors are uncorrelated, then there should be no discernible pattern. </br> </br> 2. Check if some observation have been exposed to the same environmental factors|Good experimental design is crucial in order to mitigate these problems|
#### Outliers
An outlier is a point for which $y_{i}$ is far from the value predicted by the model. Sometimes, they have little effect on the least squares line, but *over estimate the RSE* making bigger p-values of the model and *under estimate the* $R^2$.
|Detection method|Solutions|
|:---------------|:--------|
|Plot the **studentized residuals**, computed by dividing each residual $e_{i}$ by its estimated standard error. Then search for points which absolute value is greater than 3|They can be removed if it has occurred due to an error in data collection. Otherwise, they may indicate a deficiency with the model, such as a missing predictor.|
![](img/06-studentized-residuals-plot.png){fig-align="center"}
#### High-leverage points
Observations with **high leverage** have an unusual value for $x_{i}$. High leverage observations tend to have a sizable impact on the estimated regression line and any problems with these points may invalidate the entire fit.
|Detection method|Solutions|
|:---------------|:--------|
|Compute the leverage statistic. Find an observation with higher value than mean, represented by $(p + 1)/n$. Leverage values are always between $1/n$ and $1$|Make sure that the value is correct and not a data collection problem|
$$
h_{i} = \frac{1}{n} +
\frac{(x_{i} - \overline{x})^2}
{\Sigma_{i'=1}^n(x_{i'} - \overline{x})^2}
$$
In a *multiple linear regression*, it is possible to have an observation that is well within the range of each individual predictor’s values, but that is unusual in terms of the full set of predictors.
![](img/07-studentized-residuals-leverage-plot.png){fig-align="center"}
### Avoid using for classification problems
There are better model to achieve that kind of situation. For example, he linear **discriminant analysis (LDA)** procedure the same response of a linear regression for a binary problem. Other reasons are:
- A regression method cannot accommodate a qualitative response with more than two classes.
- A regression method will not provide meaningful estimates of $Pr(Y|X)$ as some of our estimates might be outside the [0, 1] interval.
### Benefits of Replacing plain least squares fitting
- **Prediction Accuracy:** If n is not much larger than p, then there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions on future observations not used in model training. And if p >n, then there is no longer a unique least squares coefficient estimate: the variance is infinite so the method cannot be used at all. As an alternative, We could reduce the variance by increasing in bias (*constraining* and *shrinking*).
- **Model Interpretability:**There are some methods that can exclude irrelevant variables from a multiple regression model (*feature selection* or *variable selection*).
### Coding example
To perform **Linear Regression** we just need to create the model specification by using **lm** engine.
```{r}
library(ISLR2)
library(tidymodels)
set.seed(123)
BikeshareSpit <- initial_split(Bikeshare)
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
lm_rec_spec <-
recipe(bikers ~ mnth + hr + workingday + temp + weathersit,
data = Bikeshare ) %>%
step_dummy(all_nominal_predictors())
workflow() %>%
add_model(lm_spec) %>%
add_recipe(lm_rec_spec) %>%
last_fit(split = BikeshareSpit) %>%
collect_metrics()
```
## Extending Linear Regression (Data Transformations)
### Polynomial regression
It extends the *linear model* by adding extra predictors, obtained by **raising each of the original predictors to a power**.
As result, if the response is a numeric variable we can fit our model to follow the next form:
$$
y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \dots + \beta_d x_i^d + \epsilon_i
$$
![](img/35-polynomial-4degree-regression.png){fig-align="center"}
On the other hand, we can use the *logistic regression* and apply the same structure to predict the probability of particular class:
$$
\Pr(y_i > 250|x_i) = \frac{\exp(\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \dots + \beta_d x_i^d)}
{1 + \exp(\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \dots + \beta_d x_i^d)}
$$
![](img/36-polynomial-4degree-logistic-regression.png){fig-align="center"}
### Piecewise constant regression
They **cut the range of a variable into K distinct regions** (known as *bins*) in order to produce a qualitative variable. This has the effect of fitting a piecewise constant function.
If we define the cutpoints as $c_1, c_2, \dots, c_K$ in the range of *X*, we can create *dummy variables* to represent each range. For example, if $c_1 \leq x_i < c_2$ is `TRUE` then $C_1(x_i) = 1$ and then we need to repeat that process for each value of $X$ and range. As result we can fit a *lineal regression* based on the new variables.
$$
y_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) \dots + \beta_K C_K(x_i) + \epsilon_i
$$
![](img/37-step-function-regression.png){fig-align="center"}
On the other hand, we can use the *logistic regression* and apply the same structure to predict the probability of particular class:
$$
\Pr(y_i > 250|x_i) = \frac{\exp(\beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) \dots + \beta_K C_K(x_i)}
{1 + \exp(\beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) \dots + \beta_K C_K(x_i))}
$$
![](img/38-step-function-logistic-regression.png){fig-align="center"}
### Piecewise polynomials regression (Natural Spine)
It consist in fitting separate low-degree polynomials over different regions of *X*. For example, a **piecewise cubic polynomial** with a single knot at a point *c* takes the form.
$$
y_i =
\begin{cases}
\beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \beta_{31} x_i^3 + \epsilon_i & \text{if } x_i<c\\
\beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \beta_{32} x_i^3 + \epsilon_i & \text{if } x_i \geq c
\end{cases}
$$
As each polynomial has four parameters, we are using a total of **8 degrees of freedom** in fitting that model. By using that model with `Wage`data, we can see a problem as the model used was **too flexible** and to solve it we need to **constrain** it to be continuous at `age = 50`.
![](img/39-piecewise-cubic-example.png){fig-align="center"}
But as you could see after applying the **continuity constraint** the plot still present an unnatural V-shape that can solve by apply the **continuity constraint** to the *first* and *second* derivative of the function, the end with **5 degrees of freedom** model.
![](img/40-constrained-piecewise-cubic-example.png){fig-align="center"}
In this context, a **natural spline** refers to a *regression spline* with the additional constraints of maintaining **linearity at the boundaries**.
## Subset Selection
This approach involves identifying a subset of the *p* predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.
### Best Subset Selection
It fits all *p* models that contain exactly one predictor, all $\left( \begin{array}{c} p \\ 2 \end{array} \right) = p (p-1)/2$ models that contain exactly two predictors, and so forth. Then it selects the *best* model based on smallest RSS or the largest $R^2$.
- **Algorithm 6.1**
1. Let $\mathcal{M}_0$ denote the null model, which represent the sample mean for each observation.
2. For $k = 1, 2, \dots, p$:
- Fit all $\left( \begin{array}{c} p \\ k \end{array} \right)$ models that contain exactly k predictors.
- Pick the best among these $\left( \begin{array}{c} p \\ k \end{array} \right)$ models using the smallest RSS or the *deviance* for classification (negative two times the maximized log-likelihood), and call it $\mathcal{M}_k$
3. Select a single best model from among $\mathcal{M}_0, \dots, \mathcal{M}_p$ using cross-
validated prediction error, $C_p$ (AIC), BIC or adjusted $R^2$.
This method is really computational expensive as it needs to fit $2^p$ models. Just think that if your data has 20 predicts, then there are over one million possibilities. Thus an enormous search space can also lead to **overfitting** and **high variance of the coefficient estimates**.
### Stepwise Selection: Forward Stepwise Selection
It begins with a model containing no predictors, and then adds the predictors who gives the greatest additional improvement to the fit, one-at-a-time, until all of the predictors are in the model, as result we will need to fit $1+p(p+1)/2$ models.
- **Algorithm 6.2**
1. Let $\mathcal{M}_0$ denote the null model, which represent the sample mean for each observation.
2. For $k = 0, \dots, p-1$:
- Consider all $p-k$ models that augment the predictors in $\mathcal{M}_k$ with one additional predictor.
- Choose the best among these $p-k$ models, and call it $\mathcal{M}_{k+1}$. Here best is defined as having smallest RSS.
3. Select a single best model from among $\mathcal{M}_k, \dots, \mathcal{M}_p$ using cross-
validated prediction error, $C_p$ (AIC), BIC or adjusted $R^2$.
Though it tends to do well in practice, it is not guaranteed to find the best possible model out of all $2^p$ models containing subsets of the *p* predictors, but can be used even if $n < p$.
### Stepwise Selection: Backward Stepwise Selection
It begins with the full least squares model containing all *p* predictors, and then iteratively removes the least useful predictor, one-at-a-time. As result we will need to fit $1+p(p+1)/2$ models
- **Algorithm 6.3**
1. Let $\mathcal{M}_p$ denote the full model, which contains all *p* predictors.
2. For $k = p, p -1, \dots , 1$:
- Consider all *k* models that contain all but one of the predictors in $\mathcal{M}_k$, for a total of $k-1$ predictors.
- Choose the best among these *k* models, and call it $\mathcal{M}_{k-1}$. Here best is defined as having smallest RSS.
3. Select a single best model from among $\mathcal{M}_0, \dots, \mathcal{M}_p$ using cross-
validated prediction error, $C_p$ (AIC), BIC or adjusted $R^2$.
This method don't guarantee to yield the best model containing a subset of the *p* predictors and we need to cofirm that $n \geq p$.
### Stepwise Selection: Hybrid Approaches
In this method variables adds variables sequentially, but after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit. Such an approach attempts to more closely mimic *Best Subset Selection* while retaining the computational advantages of *forward* and *backward stepwise selection*.
### Choosing the Optimal Model
As the model which contains all of the predictors will always have the smallest RSS, we need to estimate the test error rate by:
- Making an adjustment to the training error to account for the bias due to overfitting.
- Using either a validation set approach or a cross-validation approach.
Let's see the methods that relay on correcting the training error:
|Method|Formula|Interpretation|
|:----:|:-----:|:----------|
|$C_p$|$C_p = \frac{1}{n} (\text{RSS} + 2d\hat{\sigma}^2)$|$2d\hat{\sigma}^2$ represent the penalty of adding new predictors. This method is a good estimation of **test MSE** if the $\hat{\sigma}^2$ is an unbiased estimate of the $\sigma^2$|
|Akaike information criterion|$\text{AIC} = \frac{1}{n} (\text{RSS} + 2d\hat{\sigma}^2)$|The book omits irrelevant constants to show that $C_p$ and $\text{AIC}$ are proportional to each other|
|Bayesian information criterion|$\text{BIC} = \frac{1}{n}(\text{RSS} +\log(n) d \hat{\sigma}^2)$|After omitting irrelevant constants, we can see that $\log n > 2$ for any $n > 7$, as consequence, the metric trends to add a heavier penalty on models with many variables than the $C_p$ and tent to select models with fewer predictors|
|Adjusted $R^2$|$\text{Adjusted} \; R^2 = 1 - \frac{\text{RSS}/(n - d -1)}{\text{TSS}/(n-1)}$|A large value of adjusted $R^2$ indicates a model with a small test error, even though the metric doesn't rely on rigorous theoretical justifications.|
*Where:*
- $n$: Number of observations
- $d$: Number of predictors
- $\hat{\sigma}^2$: Estimate of the variance of the error $\epsilon$ associated with each response
![](img/25-cp-bic-adj-r2-example.png){fig-align="center" width=90% height=90%}
## Shrinkage
This approach involves fitting a model involving all *p* predictors and shrinks the estimated coefficients towards zero. Depending on what type of shrinkage is performed, some of the coefficients may be estimated exactly at zero, performing some variable selection.
### Ridge Regression
The method rather than using RSS as the metric to minimize with the regression coefficient $\beta_0, \beta_1, \dots, \beta_p$, it is modified by adding a **shrinkage penalty** that the effect of estimating $\beta_j$ towards zero when the coefficient is close to 0:
$$
RSS + \lambda \sum_{j=1}^p \beta_j^2
$$
*Where:*
- $\lambda$: It's a tuning parameter $\geq 0$ that can be calculated using cross-validation.
- $\text{RSS}$: Present the residual standard error.
As result, the ridge regression will produce a different set of coefficient estimates for each $\lambda$ value, $\hat{\beta}_\lambda^R$. By plotting the new coefficients against the *penalty* used we can see how the coefficients move towards zero without reaching the absolute 0, but may also be useful to plot the $\ell_2 \; \mathcal{norm}$ ($\| \beta \|_2 = \sqrt{\sum_{j=1}^p \beta^2}$) proportion of the *ridge* vs *least squares* coefficients, to compute the *proportion* in which the ridge regression coefficient estimate have been shrunken towards zero.
![](img/26-ridge-regression-change.png){fig-align="center" width=90% height=90%}
As the $\lambda$ to select is sensible to the predictors scaling, it's a good practice scaling the predictors using the next formula, to make all the predictors to have an standard deviation of 1:
$$
\tilde{x}_{ij} =
\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}} =
\frac{x_{ij}}{\sigma_j}
$$
When the number of variables *p* is almost as large as the number of observations *n* the least squares estimates will be extremely variable and this method **increase the bias** by reducing the *flexibility* through $\lambda$ to **reduce the variance** and find the lower error rate, without making many computations as happens the *best subset selection* method.
![](img/27-ridge-regression-bias-variance-trade-off.png){fig-align="center" width=90% height=90%}
### Lasso Regression
*Ridge Regression* don't set the coefficient exactly to zero (unless $\lambda = \infty$). That doesn't affect the model accuracy but doesn't provide any help when we have to interpret a model with many predicts. To over come that problem, the *Lasso Regression* performs ***variable selection*** based on minimization of the next function:
$$
RSS + \lambda \sum_{j=1}^p |\beta_j|
$$
*Where:*
- $\lambda$: It's a tuning parameter $\geq 0$ that can be calculated using cross-validation.
- $\text{RSS}$: Present the residual standard error.
As consequence, the method uses the $\ell_1$ penalty ($\| \beta \|_1 = \sum_{j=1}^p |\beta|$), instead of the $\ell_2$.
![](img/28-lasso-regression-change.png){fig-align="center" width=90% height=90%}
### Coding example
To perform a **Ridge** or **Lasso** regression we need to use the function `linear_reg` and define the `mixture` argument depending on the regression we want to perform.
|Model|Mixture|
|:----|:------|
|Ridge Regression|mixture = **0**|
|Lasso Regression|mixture = **1**|
As both regression depend on the `penalty` parameter we need to use cross-validation to estimate the best one. But, if you want to explore the parameter by yourself you has the next options.
- Fit a model and explore different penalties using `tidy`, `augment`, `predict` or `autoplot` functions.
```{r}
Hitters <-
as_tibble(Hitters) %>%
filter(!is.na(Salary))
ridge_fit <-
linear_reg(mixture = 0, penalty = 0) %>%
set_mode("regression") %>%
set_engine("glmnet") %>%
fit(Salary ~ ., data = Hitters)
tidy(ridge_fit, penalty = 11498)
predict(ridge_fit, new_data = Hitters, penalty = 11498)
autoplot(ridge_fit)+
scale_x_log10(labels = scales::comma_format(accuracy = 1))+
theme_light()
```
Let's have a example using the tidymodels approach. By following the next steps:
1. Splitting the data in testing and training data.
```{r}
set.seed(18)
Hitters_split <- initial_split(Hitters, strata = "Salary")
Hitters_train <- training(Hitters_split)
Hitters_test <- testing(Hitters_split)
```
2. Define the workflow to use.
```{r}
ridge_recipe <-
recipe(formula = Salary ~ ., data = Hitters_train) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
ridge_spec <-
linear_reg(penalty = tune(), mixture = 0) %>%
set_mode("regression") %>%
set_engine("glmnet")
ridge_workflow <-
workflow() %>%
add_recipe(ridge_recipe) %>%
add_model(ridge_spec)
```
3. Use cross-validation to estimate the testing error and tune the penalty.
```{r}
set.seed(40)
Hitters_fold <- vfold_cv(Hitters_train, v = 10)
```
4. Define the penalties to check. Where the range indicates the limit of $x$ in the function $10^x$ and the level the number of step to complete the range.
```{r}
penalty_grid <- grid_regular(penalty(range = c(-5, 5)), levels = 50)
```
5. Let's fit our models.
```{r}
tune_res <-
tune_grid(ridge_workflow,
resamples = Hitters_fold,
grid = penalty_grid)
```
6. Check the test error change in a plot or just export a table.
```{r}
autoplot(tune_res)+
scale_x_log10(labels = scales::comma_format(accuracy = 1))+
theme_light()
collect_metrics(tune_res) %>%
filter(.metric == "rsq") %>%
arrange(desc(mean))
```
7. Select the best model configuration and update the workflow to use that parameter.
```{r}
best_penalty <- select_best(tune_res, metric = "rsq")
best_penalty
ridge_final <- finalize_workflow(ridge_workflow, best_penalty)
ridge_final
```
8. Fit the final model and validate the performance.
```{r}
ridge_final_fit <- fit(ridge_final, data = Hitters_train)
augment(ridge_final_fit, new_data = Hitters_test) %>%
rsq(truth = Salary, estimate = .pred)
```
To perform a Lasso regression we don't need to repeat all the code.
```{r}
lasso_spec <-
linear_reg(penalty = tune(), mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
lasso_workflow <-
workflow() %>%
add_recipe(ridge_recipe) %>%
add_model(lasso_spec)
lasso_penalty_grid <- grid_regular(penalty(range = c(-3, 2)), levels = 50)
lasso_tune_res <-
tune_grid(lasso_workflow,
resamples = Hitters_fold,
grid = lasso_penalty_grid)
autoplot(lasso_tune_res)+
scale_x_log10(labels = scales::comma_format(accuracy = 1))+
theme_light()
```
## Dimension Reduction
This method projects the $p$ predictors into an $M$-dimensional subspace. If $Z_1, Z_2, \dots, Z_M$ represent $M < p$ lineal combinations $Z_m = \sum_{j=1}^p \phi_{jm}X_j$ of **ALL our original predictors** based on some constants $\phi_{1m}, \phi_{2m}, \dots, \phi_{pm}$, then we can use the new variables to fit a linear regression model by least squares.
$$
y_i = \theta_0 + \sum_{m=1}^M \theta_m z_{im} + \epsilon_i,
\qquad i=1, \dots, n.
$$
This is not a feature selection method as each of the *M* principal components used in the regression is a linear combination of all *p* of the original features.In this sense, PCR is more closely related to *ridge regression* than to the *lasso*.
To select the $\phi_{jm}$'s we will discuss two different ways:
### Principal Components Regression (PCR)
The PCR assumes that *the directions in which* $X_1, \dots, X_M$ *show the most variation are the directions that are associated with* $Y$. If it's that is true then fitting the model to $Z_1, \dots, Z_m$ will lead better results than using the original variables $X_1, \dots, X_p$
To perform a *principal components analysis* (PCA):
1. It's recommended to standardize each predictor to have the same scale.
$$
\tilde{x}_{ij} =
\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}} =
\frac{x_{ij}}{\sigma_j}
$$
2. Select as the *first principal component* the variable where the data *vary the most*.
![](img/29-pca-data-example.png){fig-align="center" width=90% height=90%}
3. Project the observations on the first component, to get the largest possible variance
![](img/30-pca-projection.png){fig-align="center" width=90% height=90%}
4. Then maximize the $\text{Var}(\phi_{11} \times (x_1-\overline{x_1}) + \phi_{21} \times (x_2-\overline{x_2}))$ where $\phi_{11}^2 + \phi_{21}^2 = 1$ to the get **the principal component loadings**. As result $Z_1$ it's a weighted average of the to variables.
5. Repeat the process until having *p* distinct principal components *perpendicular* to the previews one.
In general, this method performs better when we just need to use **few principal components**.
![](img/31-pca-test-error.png){fig-align="center" width=90% height=90%}
![](img/32-pca-test-error-less-components.png){fig-align="center" width=90% height=90%}
The **number of principal components**, M, is typically chosen by **cross-validation**.
### Partial Least Squares (PLS)
The PLS approach attempts to find directions that help explain both the response and the predictors by placing the **highest weight** on the variables that are most **strongly related to the response**.
1. Standardize the predictors and **response**.
$$
\tilde{x}_{ij} =
\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}} =
\frac{x_{ij}}{\sigma_j}
$$
2. Compute the first direction $Z_1$ by setting each $\phi_{j1}$ equal to the coefficient from the simple linear regression of $Y \sim X_j$, which it's also proportional to the correlation.
3. Adjust each of the variables for $Z_1$, by regressing each variable on $Z_1$ and taking residuals.
4. Compute $Z_2$ using this *orthogonalized data* (residuals) in exactly the same fashion as $Z_1$ was computed based on the original data and repete the process $M$ times.
5. Use least squares to fit a linear model to predict $Y$ using $Z_1, \dots, Z_M$
To select the number $M$ we can use **cross-validation**.
### PCR vs PLS
The next figure illustrates how each method work.
![](img/67-PCR-vs-PLS-diagrams.png){fig-align="center"}
The next figure illustrates that the first two PCs when using
- **PCR**: Has very **little relationship** to the response variable.
- **PLS**: Has a much **stronger association** to the response variable.
![](img/68-PCR-vs-PLS-example.png){fig-align="center"}
::: {.callout-note appearance="default"}
### Performance Note
In practice, it often performs no better than ridge regression or PCR. While the supervised dimension reduction of PLS can reduce bias, it also has the potential to **increase variance**, so that the overall benefit of PLS relative to PCR **is a wash**.
:::
### Coding example
To run a **PCR** or a **PLS** we just need to set a lineal model.
```{r}
lm_spec <-
linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
```
And tune our recipe to determinate the `threshold` and the number of components to use `num_comp`.
```{r}
pca_recipe <-
recipe(formula = Salary ~ ., data = Hitters_train) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors(),
threshold = tune(),
num_comp = tune())
pca_workflow <-
workflow() %>%
add_recipe(pca_recipe) %>%
add_model(lm_spec)
threshold_grid <-
grid_regular(threshold(),
num_comp(c(1, 20)),
levels = 5)
pca_tune_res <-
tune_grid(pca_workflow,
resamples = Hitters_fold,
grid = threshold_grid)
```
As we can see below the number of component don't have any effect over the test error.
```{r}
autoplot(pca_tune_res)+
scale_x_log10(labels = scales::comma_format(accuracy = 1))+
theme_light()
best_threshold <- select_best(pca_tune_res, metric = "rmse")
best_threshold
final_pcr_fit <-
finalize_workflow(pca_workflow, best_threshold) |>
fit(data = Hitters_train)
final_pcr_fit |>
augment(new_data = Hitters_test) |>
rmse(truth = Salary, estimate = .pred)
```
But thanks to the `DALEXtra` package we can interpret these complex models.
```{r}
library(DALEXtra)
set.seed(1807)
explain_tidymodels(
final_pcr_fit,
data = Hitters_train %>% select(-Salary),
y = Hitters_train$Salary,
label = "RDA",
verbose = FALSE
) %>%
model_parts() |>
plot()
```
## Poisson Regression
If match with the next conditions we can fit a model based on **Poisson Distribution**:
- The outcome variable is the result of **counting**, so $Y \in \{ 0, 1, 2, 3, \dots \}$.
- $Ave(Y) \approx Var(Y) $, if not you can use the **Quasipoisson Distribution**.
The *Poisson Distribution* follow the next function:
$$
Pr(Y = k) = \frac{e^{-\lambda} \lambda^{k}}
{k!}
$$
Where:
- $\lambda$ must be greater than 0. It represents the expected number of events $E(Y)$ and variance related $Var(Y)$
- $k$ represent the number of events that we want to evaluate base of $\lambda$. Its numbers should be greater or equal to 0.
So, it makes sense that the value that we want to predict with our regression would be $\lambda$, by using next structure:
$$
\log{ \left( \lambda(X_1, X_2, \dots , X_p) \right)} = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p
$$
If select this model we need to be aware how to interpret the coefficients. For example, if $\beta_1 = -0.08$ for a categorical variable, we can conclude by calculating $e^{-0.08}$ that ___`r round(exp(-0.08) * 100, 2) |> paste0("%")`___ of events of the base line related to $\beta_0$ would happen.
### Coding example
To perform **Poisson Regression** we just need to create the model specification by loading the **poissonreg** package and using **glm** engine.
```{r}
library(poissonreg)
pois_spec <- poisson_reg() %>%
set_mode("regression") %>%
set_engine("glm")
pois_rec_spec <- lm_rec_spec
workflow() %>%
add_model(pois_spec) %>%
add_recipe(pois_rec_spec) %>%
last_fit(split = BikeshareSpit) %>%
collect_metrics()
```
## Logistic Regression
It models the **probability** ($p(X) = Pr(Y=1|X)$) that Y belongs to a particular category given some predictors by assuming that $Y$ follows a **Bernoulli Distribution**. This model calculates the probability using the ***logistic function*** which produce a S form between 0 and 1:
$$
p(X) = \frac{e^{\beta_{0}+\beta_{1}X}}
{1+e^{\beta_{0}+\beta_{1}X}}
$$
![](img/09-logistic-function-example.png){fig-align="center"}
As the functions returns probabilities is responsibility of the analyst to define a **threshold** to make classifications.
### Estimating coefficients
To estimate $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ the method used is called as *maximum likelihood* which consists in maximizing the **likelihood function**. It is important to clarify that the *least squares approach* is in fact a special case of maximum likelihood.
$$
\ell(\beta_{0}, \beta_{1}) = \prod_{i:y_{i} = 1}p(x_{i})\prod_{i':y_{i'} = 0}p(1-x_{i'})
$$
### Multiple regression
We also can generalize the *logistic function* as you can see bellow.
$$
p(X) = \frac{e^{\beta_{0}+\beta_{1}X_{1}+\dots+\beta_{p}X_{p}}}
{1+e^{\beta_{0}+\beta_{1}X_{1}+\dots+\beta_{p}X_{p}}}
$$
### Interpreting the model
To understand how each variable influence the probability $p(X)$, we need to manipulate the *logistic function* until having a lineal combination on the right site.
$$
\underbrace{ \log{ \left( \overbrace{\frac{p(X)}{1 - p(X)}}^\text{odds ratio} \right)} }_\text{log odds or logit} = \beta_{0}+\beta_{1}X
$$
As we can see, the result of the linear combination is the $\log$ of the *odds ratio*, known as **log odd** or **logit**.
An **odds ratio** of an event presents the likelihood that the event will occur as a proportion of the likelihood that the event won't occur. It can take any value between $0$ and $\infty$, where low probabilities are close to $0$, higher to $\infty$ and equivalents ones are equals to 1. For example, if we have an $\text{odds ratio} = 2$, we can say that it's 2 times more likely that the event happens rather than not.
Applying $\log{(\text{odds ratio})}$ makes easier to compare the effect of variables as values below 1 become negative numbers of the scale of possible numbers and 1 becomes 0 for non-significant ones. To have an idea, an odds ratio of 2 has the same effect as 0.5, which it's hard to see at first hand, but if we apply the $\log$ to each value we can see that $\log{(2)} = 0.69$ and $\log{(0.5)} = -0.69$.
At end, $p(X)$ will increase as $X$ increases if $\beta_{1}$ is positive despite the relationship between each other isn't a linear one.
#### Understanding a confounding paradox
|Simple Regression|Multiple Regression|
|:---------------|:-------------------|
|![](img/10-Default-table-4_2.png){fig-align="center"}|![](img/11-Default-table-4_3.png){fig-align="center"}|
|The positive coefficient for student indicates that for **over all values of balance and income**, a student is *more* likely to default than a non-student.|The negative coefficient for student indicates that for a **fixed value of balance and income**, a student is less likely to default than a non-student.|
![](img/12-Default-multiple-plot-4_3.png){fig-align="center" width=60% height=60%}
The problem relays on the fact that *student* and *balance* are **correlated**. In consequence, a student is riskier than a non-student if no information about the student’s credit card balance is available. However, that student is less risky than a non-student with the *same credit card balance*!
![](img/13-Default-simple-plot-4_3.png){fig-align="center" width=60% height=60%}
#### Multinomial Logistic Regression
We also can generalize the *logistic function* to support more than 2 categories ($K > 2$) by defining by convention the last category $K$ as a **baseline**.
For $k = 1, \dotsc,K-1$ we use function.
$$
Pr(Y = k|X= x) = \frac{e^{\beta_{k0}+\beta_{k1}x_{1}+\dots+\beta_{kp}x_{p}}}
{1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_{1}+\dots+\beta_{lp}x_{p}}}
$$
For $k=K$, we use the function.
$$
Pr(Y = K|X= x) = \frac{1}
{1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_{1}+\dots+\beta_{lp}x_{p}}}