05-generalized-linear-models.qmd

---
format:
  html:
    number-depth: 3
    css: summary-format.css
---

# Generalized linear models (GLM)

## Linear regresion

### Getting the Least Squares Line of a sample

As the *population regression line* is unobserved the *least squares line* of a sample is a good estimation. To get it we need to follow the next steps:

1. Define the function to fit.

$$
\hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1} x
$$

2. Define how to calculate **residuals**.

$$
e_{i} = y_{i} - \hat{y}_{i}
$$

3. Define the **residual sum of squares (RSS)**.

$$
RSS = e_{1}^2 + e_{2}^2 + \dots + e_{n}^2
$$

4. Use calculus or make estimation with a computer to find the coefficients that minimize the RSS.

$$
\hat{\beta}_{1} = \frac{\Sigma_{i=1}^{n}(x_{i}-\overline{x})(y_{i}-\overline{y})}
                       {\Sigma_{i=1}^{n}(x_{i}-\overline{x})}
, \quad
\hat{\beta}_{0} = \overline{y} - \hat{\beta}_{1}\overline{x}
$$

### Getting confident intervarls of coeffients

To estimate the **population regression line** we can calculate **conﬁdence intervals** for sample coefficients, to define a range where we can find the population values with a defined **confidence level**.

> If we want to use 95% of confidence we need to know that after taking many samples only 95% of the intervals produced with this **confident level** would have the true value (parameter).

To generate confident intervals we would need to calculate the variance of the *random error*. 

$$
\sigma^2 = Var(\epsilon)
$$

But as we can not calculate that variance an alternative can be to estimate it based on residuals if they meet the next conditions:

1. Each residual have common variance $\sigma^2$, so the variances of the error terms shouldn't have any relation with the value of the response.
2. Residuals are uncorrelated. For example, if $\epsilon_{i}$ is positive, that provides little or no information about the sign of $\epsilon_{i+1}$. 

If not, we would end underestimating the true standard errors, reducing the probability a given confident level to contain the true value of the parameter and underrating the *p-values* associated with the model.

$$
\sigma \approx RSE = \sqrt{\frac{RSS}{(n-p-1)}}
$$

Now we can calculate the **standard error** of each coefficient and calculate the confident intervals.

$$
SE(\hat{\beta_{0}})^2 = \sigma^2 
                       \left[\frac{1}{n}+
                             \frac{\overline{x}^2}
                                  {\Sigma_{i=1}^{n} (x_{i}-\overline{x})^2} 
                       \right]
$$

$$
SE(\hat{\beta_{1}})^2 = \frac{\sigma^2}
                             {\Sigma_{i=1}^{n} (x_{i} - \overline{x})^2}
$$


$$  
\hat{\beta_{1}} \pm 2 \cdot SE(\hat{\beta_{1}}), \quad \hat{\beta_{0}} \pm 2 \cdot SE(\hat{\beta_{0}})
$$

### Insights to extract

#### Confirm the relationship between the Response and Predictors

Use the regression **overall P-value** (based on the F-statistic) to confirm that at **least one predictor** is related with the Response and avoid interpretative problems associated with the number of observations (_n_) or predictors (_p_).
  
  
$$
H_{0}: \beta_{1} = \beta_{2} = \dots = \beta_{p} = 0
$$

$$
H_{a}: \text{at least one } \beta_{j} \text{ is non-zero}
$$


#### Accuracy of the model (relationship strength)

If we want to know how well the model fits to the data we have two options:

- **Residual standard error (RSE)**: Even if the model were correct, the actual values of $\hat{y}$ would differ from the true regression line by approximately *this units*, on average. To get the percentage error we can calculate $RSE/\overline{x}$

- **The $R^2$ statistic**: The proportion of variance explained by taking as a reference the **total sum of squares (TSS)**.

$$
TSS = \Sigma(y_{i} - \overline{y})^2
$$

$$
R^2 = \frac{TSS - RSS}{TSS}
$$


$$
R^2 = 
\begin{cases}
    Cor(X, Y)^2  & \text{Simple Lineal Regresion} \\
    Cor(Y,\hat{Y})^2 & \text{Multipline Lineal Regresion}
\end{cases}
$$


#### Confirm the relationship between the Response and each predictor

To answer that we can test if a particular subset of q of the coefficients are zero. 
  
$$
H_{0}: \beta_{p-q+1} = \beta_{p-q+2} = \dots = \beta_{p} = 0
$$

In this case, F-statistic reports the **partial eﬀect** of adding a extra variable to the model (the order matters) to apply a *variable selection* technique. The classical approach is to:

1. Fit a model for each variable combination $2^p$.
2. Select the best model based on *Mallow’s Cp*, *Akaike information criterion (AIC)*, *Bayesian information criterion (BIC)*, and *adjusted* $R^2$ or plot various model outputs, such as the residuals, in order to search for patterns.

But just think that if we have $p = 30$ we will have $2^{30} = =1,073,741,824\ models$ to fit, that it's too much. Some alternative approaches for this task:

- Forward selection
- Backward selection (cannot be used if p >n)
- Mixed selection

#### Size of association between each predictor and the response.

To check that we need to see the $\hat{\beta}_{j}$ *confident intervals* as the real $\beta_{j}$ is in that range.

#### Predicting future values

If we want to predict the average response $f(X)$ we can use the confident intervals, but if we want to predict an individual response $Y = f(X) + \epsilon$ we need to use prediction intervals as they account for the uncertainty associated with $\epsilon$, the irreducible error.


### Standard linear regression model assumptions
  
- The **additivity assumption** means that the association between a predictor $X_{j}$ and the response $Y$ does not depend on the values of the other predictors, as it happens when there is a *interaction (synergy) effect*

- The **linearity assumption** states that the change in the response Y associated with a one-unit change in $X_{j}$ is constant, regardless of the value of $X_{j}$.

#### Including an interaction term

This approach relax the *additivity assumption* that models usually have. 

- **2 quantitative variables**

It consist in adding an extra coefficient which multiplies two or more variables.

$$
\begin{split}
Y & = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{1} X_{2} + \epsilon \\
  & = \beta_{0} + (\beta_{1} + \beta_{3} X_{2}) X_{1} + \beta_{2} X_{2} + \epsilon \\
  & = \beta_{0} + \tilde{\beta}_{1} X_{1} + \beta_{2} X_{2} + \epsilon
\end{split}
$$

After adding the interaction term we could interpret the change as making one of the original coefficient a function of the another variable. Now we could say that $\beta_{3}$ *represent __the change of__* $X_{1}$ *__effectiveness__ associated with a one-unit increase in* $X_{2}$.

It very important that we keep **hierarchical principle**, which states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant as ***it would alter the meaning of the interaction***.

- **1 quantitative and 1 qualitative variable**

If $X_{1}$ is quantitative and $X_{2}$ is qualitative:

$$
\hat{Y} = 
\begin{cases}
    (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3})X_{1} & \text{if }X_{2} \text{ is TRUE}\\
    \beta_{0} + \beta_{1}X_{1}                             & \text{if }X_{2} \text{ is FALSE}
\end{cases}
$$

Adding the $\beta_{3}$ interaction allow the line to change the line slope based on $X_{2}$ and not just a different intercept.

![](img/03-factor-interaction.png){fig-align="center"}

#### Polynomial regression

This approach relax the *linearity assumption* that models usually have. It consist in including transformed versions of the predictors.

$$
\begin{split}
Y & = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} \\
  & = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{1}^2 
\end{split}
$$


### Possible problems

#### The target doesn't follow a normal distribution

We have two ways to detect this problem:

1. Plot the target variable with a histogram

![](img/46-ames-sale-price-distribution.png){fig-align="center"}

2. Plotting the residual and checking the plot creates a funnel shape, demonstrating a *non-constant variance (heteroscedasticity) of error terms*.
  
![](img/05-non-constance-variance.png){fig-align="center"}
  
To solve this problem we can apply transformations to the target variable.

1.  Applying a concave function such as $\log{Y}$ or $\sqrt{Y}$. If the target value has 0s we can define the argument offset as 1.

```r
ames_recipe <- 
  recipe(Sale_Price ~ ., data = ames_train) %>%
  step_log(all_outcomes(), offset = 0)
```

2. Applying **Box Cox transformation** is more flexible than the log transformation and will find an appropriate transformation from a family of power transforms that will **transform the variable as close as possible to a normal distribution**. Its lambda ($\lambda$) value varies from -5 to 5 and it's selected based on the traning data.

$$
y(\lambda) = 
\begin{cases}
\frac{Y^\lambda - 1}{\lambda}, \text{if } \lambda \neq 0 \\
\log{(Y)}, \text{if } \lambda = 0
\end{cases}
$$

3.  If your data consists of values $y \leq 1$ use the **Yeo-Johnson transformation**, which is very similar to the Box-Cox one.

```r
ames_recipe <- 
  recipe(Sale_Price ~ ., data = ames_train) %>%
  step_YeoJohnson(all_outcomes())
```

To make correct interpretations of our predictions it's important to remember that we need to **revert this transformation after getting the prediction**.


This actions will reduce the **relative error** of each prediction. 


#### Collinearity

**Collinearity** refers to the situation in which two or more predictor variables are closely related (highly correlated) to one another. It reduces the accuracy of the estimates of the regression coeﬃcients and causes the standard error for $\hat{\beta}_{j}$ to grow. That reduce the **power of the hypothesis test**,that is, the probability of correctly detecting a non-zero coeﬃcient.

Looking at the correlation matrix of the predictors could be usefull, but it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high correlation (**multicollinearity**). 

|Detection method|Solutions|
|:---------------|:--------|
|The best way to assess multicollinearity is to compute the **variance inﬂation factor (VIF)**, which is the ratio of the variance of $\hat{\beta}_{j}$ when ﬁtting the full model divided by the variance of $\hat{\beta}_{j}$ if ﬁt on its own with 1 as its lowest value and 5 or 10 as problematic values of collinearity|1. Drop one of the problematic variables from the regression. </br> </br> 2. Combine the collinear variables together into a single predictor|


$$
\text{VIF}(\hat{\beta}_{j}) = \frac{1}
                                   {1 - R_{X_{j}|X_{-j}}^2}
$$

Where $R_{X_{j}|X_{-j}}^2$ is the $R^2$ from a regression of $X_{j}$ onto all of the other predictors.


#### Non-linearity of the response-predictor relationships

|Detection method|Solutions|
|:---------------|:--------|
|Plot the **residuals versus predicted values** $\hat{y}_{i}$. Ideally, the residual plot will show no discernible pattern. The presence of a pattern may indicate a problem with some aspect of the linear model.|A simple approach is to use non-linear transformations of the predictors, such as $\log{X}$, $\sqrt{X}$, and $X^2$, in the regression model|

![](img/04-residuals-predicted-values.png){fig-align="center"}

#### Correlation of error terms

If there is correlation among the errors, then the estimated standard errors of the coefficients will be biased **leading to prediction intervals being narrower** than they should be. 

|Detection method|Solutions|
|:---------------|:--------|
|1. Plot the residuals from our model as a function of time or execution order. If the errors are uncorrelated, then there should be no discernible pattern. </br> </br> 2. Check if some observation have been exposed to the same environmental factors|Good experimental design is crucial in order to mitigate these problems|

#### Outliers

An outlier is a point for which $y_{i}$ is far from the value predicted by the model. Sometimes, they have little effect on the least squares line, but *over estimate the RSE* making bigger p-values of the model and *under estimate the* $R^2$.

|Detection method|Solutions|
|:---------------|:--------|
|Plot the **studentized residuals**, computed by dividing each residual $e_{i}$ by its estimated standard error. Then search for points which absolute value is greater than 3|They can be removed if it has occurred due to an error in data collection. Otherwise, they may indicate a deficiency with the model, such as a missing predictor.|

![](img/06-studentized-residuals-plot.png){fig-align="center"}

#### High-leverage points

Observations with **high leverage** have an unusual value for $x_{i}$. High leverage observations tend to have a sizable impact on the estimated regression line and any problems with these points may invalidate the entire fit.

|Detection method|Solutions|
|:---------------|:--------|
|Compute the leverage statistic. Find an observation with higher value than mean, represented by $(p + 1)/n$. Leverage values are always between $1/n$ and $1$|Make sure that the value is correct and not a data collection problem|

$$
h_{i} = \frac{1}{n} + 
        \frac{(x_{i} - \overline{x})^2}
              {\Sigma_{i'=1}^n(x_{i'} - \overline{x})^2}
$$

In a *multiple linear regression*, it is possible to have an observation that is well within the range of each individual predictor’s values, but that is unusual in terms of the full set of predictors.

![](img/07-studentized-residuals-leverage-plot.png){fig-align="center"}


### Avoid using for classification problems

There are better model to achieve that kind of situation. For example, he linear **discriminant analysis (LDA)** procedure the same response of a linear regression for a binary problem. Other reasons are:

  - A regression method cannot accommodate a qualitative response with more than two classes.
  - A regression method will not provide meaningful estimates of $Pr(Y|X)$ as some of our estimates might be outside the [0, 1] interval.

  
### Benefits of Replacing plain least squares ﬁtting

- **Prediction Accuracy:** If n is not much larger than p, then there can be a lot of variability in the least squares ﬁt, resulting in overﬁtting and consequently poor predictions on future observations not used in model training. And if p >n, then there is no longer a unique least squares coeﬃcient estimate: the variance is inﬁnite so the method cannot be used at all. As an alternative, We could reduce the variance by increasing in bias (*constraining* and *shrinking*).

- **Model Interpretability:**There are some methods that can exclude irrelevant variables from a multiple regression model (*feature selection* or *variable selection*).

### Coding example

To perform **Linear Regression** we just need to create the model specification by using **lm** engine.

```{r}
library(ISLR2)
library(tidymodels)

set.seed(123)
BikeshareSpit <- initial_split(Bikeshare)

lm_spec <- linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("lm")

lm_rec_spec <- 
  recipe(bikers ~ mnth + hr + workingday + temp + weathersit,
         data = Bikeshare ) %>% 
  step_dummy(all_nominal_predictors())

workflow() %>%
  add_model(lm_spec) %>%
  add_recipe(lm_rec_spec) %>%
  last_fit(split = BikeshareSpit) %>%
  collect_metrics()
```

## Extending Linear Regression (Data Transformations)

### Polynomial regression 

It extends the *linear model* by adding extra predictors, obtained by **raising each of the original predictors to a power**. 

As result, if the response is a numeric variable we can fit our model to follow the next form:

$$
y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \dots + \beta_d x_i^d + \epsilon_i
$$


![](img/35-polynomial-4degree-regression.png){fig-align="center"}

On the other hand, we can use the *logistic regression* and apply the same structure to predict the probability of particular class:

$$
\Pr(y_i > 250|x_i) = \frac{\exp(\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \dots + \beta_d x_i^d)}
                          {1 + \exp(\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \dots + \beta_d x_i^d)}
$$

![](img/36-polynomial-4degree-logistic-regression.png){fig-align="center"}

### Piecewise constant regression

They **cut the range of a variable into K distinct regions** (known as *bins*) in order to produce a qualitative variable. This has the eﬀect of ﬁtting a piecewise constant function.

If we define the cutpoints as $c_1, c_2, \dots, c_K$ in the range of *X*, we can create *dummy variables* to represent each range. For example, if $c_1 \leq x_i < c_2$ is `TRUE` then $C_1(x_i) = 1$ and then we need to repeat that process for each value of $X$ and range. As result we can fit a *lineal regression* based on the new variables.

$$
y_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i)  \dots + \beta_K C_K(x_i) + \epsilon_i
$$

![](img/37-step-function-regression.png){fig-align="center"}

On the other hand, we can use the *logistic regression* and apply the same structure to predict the probability of particular class:

$$
\Pr(y_i > 250|x_i) = \frac{\exp(\beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i)  \dots + \beta_K C_K(x_i)}
                          {1 + \exp(\beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i)  \dots + \beta_K C_K(x_i))}
$$

![](img/38-step-function-logistic-regression.png){fig-align="center"}

### Piecewise polynomials regression (Natural Spine)

It consist in ﬁtting separate low-degree polynomials over diﬀerent regions of *X*. For example, a **piecewise cubic polynomial** with a single knot at a point *c* takes the form.

$$
y_i =
  \begin{cases}
    \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \beta_{31} x_i^3 + \epsilon_i &  \text{if } x_i<c\\
    \beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \beta_{32} x_i^3 + \epsilon_i &  \text{if } x_i \geq c
  \end{cases}
$$

As each polynomial has four parameters, we are using a total of **8 degrees of freedom** in ﬁtting that model. By using that model with `Wage`data, we can see a problem as the model used was **too flexible** and to solve it we need to **constrain** it to be continuous at `age = 50`. 

![](img/39-piecewise-cubic-example.png){fig-align="center"}
But as you could see after applying the **continuity constraint** the plot still present an unnatural V-shape that can solve by apply the **continuity constraint** to the *first* and *second* derivative of the function, the end with **5 degrees of freedom** model.

![](img/40-constrained-piecewise-cubic-example.png){fig-align="center"}
 
In this context, a **natural spline** refers to a *regression spline* with the additional constraints of maintaining **linearity at the boundaries**. 


## Subset Selection

This approach involves identifying a subset of the *p* predictors that we believe to be related to the response. We then ﬁt a model using least squares on the reduced set of variables.

### Best Subset Selection

It ﬁts all *p* models that contain exactly one predictor, all $\left( \begin{array}{c} p \\ 2 \end{array} \right) = p (p-1)/2$ models that contain exactly two predictors, and so forth. Then it selects the *best* model based on smallest RSS or the largest $R^2$.

- **Algorithm 6.1**
  1. Let $\mathcal{M}_0$ denote the null model, which represent the sample mean for each observation. 
  2. For $k = 1, 2, \dots, p$:
    - Fit all $\left( \begin{array}{c} p \\ k \end{array} \right)$ models that contain exactly k predictors.
    - Pick the best among these $\left( \begin{array}{c} p \\ k \end{array} \right)$ models using the smallest RSS or the *deviance* for classification (negative two times the maximized log-likelihood), and call it $\mathcal{M}_k$
  3. Select a single best model from among $\mathcal{M}_0, \dots, \mathcal{M}_p$ using cross-
validated prediction error, $C_p$ (AIC), BIC or adjusted $R^2$.

This method is really computational expensive as it needs to fit $2^p$ models. Just think that if your data has 20 predicts, then there are over one million possibilities. Thus an enormous search space can also lead to **overﬁtting** and **high variance of the coeﬃcient estimates**.


### Stepwise Selection: Forward Stepwise Selection

It begins with a model containing no predictors, and then adds the predictors who gives the greatest additional improvement to the ﬁt, one-at-a-time, until all of the predictors are in the model, as result we will need to fit $1+p(p+1)/2$ models.

- **Algorithm 6.2**
  1. Let $\mathcal{M}_0$ denote the null model, which represent the sample mean for each observation. 
  2. For $k = 0, \dots, p-1$:
    - Consider all $p-k$ models that augment the predictors in $\mathcal{M}_k$ with one additional predictor.
    - Choose the best among these $p-k$ models, and call it $\mathcal{M}_{k+1}$. Here best is deﬁned as having smallest RSS.
  3. Select a single best model from among $\mathcal{M}_k, \dots, \mathcal{M}_p$ using cross-
validated prediction error, $C_p$ (AIC), BIC or adjusted $R^2$.

Though it tends to do well in practice, it is not guaranteed to ﬁnd the best possible model out of all $2^p$ models containing subsets of the *p* predictors, but can be used even if $n < p$.


### Stepwise Selection: Backward Stepwise Selection

It begins with the full least squares model containing all *p* predictors, and then iteratively removes the least useful predictor, one-at-a-time. As result we will need to fit $1+p(p+1)/2$ models

- **Algorithm 6.3**
  1. Let $\mathcal{M}_p$ denote the full model, which contains all *p* predictors.
  2. For $k = p, p -1, \dots , 1$:
    - Consider all *k* models that contain all but one of the predictors in $\mathcal{M}_k$, for a total of $k-1$ predictors.
    - Choose the best among these *k* models, and call it $\mathcal{M}_{k-1}$. Here best is deﬁned as having smallest RSS.
  3. Select a single best model from among $\mathcal{M}_0, \dots, \mathcal{M}_p$ using cross-
validated prediction error, $C_p$ (AIC), BIC or adjusted $R^2$.

This method don't guarantee to yield the best model containing a subset of the *p* predictors and we need to cofirm that $n \geq p$.

### Stepwise Selection: Hybrid Approaches

In this method variables adds variables sequentially, but after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model ﬁt. Such an approach attempts to more closely mimic *Best Subset Selection* while retaining the computational advantages of *forward* and *backward stepwise selection*.

### Choosing the Optimal Model

As the model which contains all of the predictors will always have the smallest RSS, we need to estimate the test error rate by:

- Making an adjustment to the training error to account for the bias due to overﬁtting. 
- Using either a validation set approach or a cross-validation approach.

Let's see the methods that relay on correcting the training error:

|Method|Formula|Interpretation|
|:----:|:-----:|:----------|
|$C_p$|$C_p = \frac{1}{n} (\text{RSS} + 2d\hat{\sigma}^2)$|$2d\hat{\sigma}^2$ represent the penalty of adding new predictors. This method is a good estimation of **test MSE** if the $\hat{\sigma}^2$ is an unbiased estimate of the $\sigma^2$|
|Akaike information criterion|$\text{AIC} = \frac{1}{n} (\text{RSS} + 2d\hat{\sigma}^2)$|The book omits irrelevant constants to show that $C_p$ and $\text{AIC}$ are proportional to each other|
|Bayesian information criterion|$\text{BIC} = \frac{1}{n}(\text{RSS} +\log(n) d \hat{\sigma}^2)$|After omitting irrelevant constants, we can see that $\log n > 2$ for any $n > 7$, as consequence, the metric trends to add a heavier penalty on models with many variables than the $C_p$ and tent to select models with fewer predictors|
|Adjusted $R^2$|$\text{Adjusted} \; R^2 = 1 - \frac{\text{RSS}/(n - d -1)}{\text{TSS}/(n-1)}$|A large value of adjusted $R^2$ indicates a model with a small test error, even though the metric doesn't rely on rigorous theoretical justiﬁcations.|

*Where:*

- $n$: Number of observations 
- $d$: Number of predictors
- $\hat{\sigma}^2$: Estimate of the variance of the error $\epsilon$ associated with each response

![](img/25-cp-bic-adj-r2-example.png){fig-align="center" width=90% height=90%}

## Shrinkage

This approach involves ﬁtting a model involving all *p* predictors and shrinks the estimated coeﬃcients towards zero. Depending on what type of shrinkage is performed, some of the coeﬃcients may be estimated exactly at zero, performing some variable selection.

### Ridge Regression

The method rather than using RSS as the metric to minimize with the regression coefficient $\beta_0, \beta_1, \dots, \beta_p$, it is modified by adding a **shrinkage penalty** that the effect of estimating $\beta_j$ towards zero when the coefficient is close to 0:


$$
RSS + \lambda \sum_{j=1}^p \beta_j^2
$$

*Where:*

- $\lambda$: It's a tuning parameter $\geq 0$ that can be calculated using cross-validation.
- $\text{RSS}$: Present the residual standard error.

As result, the ridge regression will produce a diﬀerent set of coeﬃcient estimates for each $\lambda$ value, $\hat{\beta}_\lambda^R$. By plotting the new coefficients against the *penalty* used we can see how the coefficients move towards zero without reaching the absolute 0, but may also be useful to plot the $\ell_2 \; \mathcal{norm}$ ($\| \beta \|_2 = \sqrt{\sum_{j=1}^p \beta^2}$) proportion of the *ridge* vs *least squares* coefficients, to compute the *proportion* in which the ridge regression coeﬃcient estimate have been shrunken towards zero.

![](img/26-ridge-regression-change.png){fig-align="center" width=90% height=90%}

As the $\lambda$ to select is sensible to the predictors scaling, it's a good practice scaling the predictors using the next formula, to make all the predictors to have an standard deviation of 1:

$$
\tilde{x}_{ij} = 
\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}} = 
\frac{x_{ij}}{\sigma_j}
$$


When the number of variables *p* is almost as large as the number of observations *n* the least squares estimates will be extremely variable and this method **increase the bias** by reducing the *flexibility* through $\lambda$ to **reduce the variance** and find the lower error rate, without making many computations as happens the *best subset selection* method.

![](img/27-ridge-regression-bias-variance-trade-off.png){fig-align="center" width=90% height=90%}

### Lasso Regression

*Ridge Regression* don't set the coefficient exactly to zero (unless $\lambda = \infty$). That doesn't affect the model accuracy but doesn't provide any help when we have to interpret a model with many predicts. To over come that problem, the *Lasso Regression* performs ***variable selection*** based on minimization of the next function:


$$
RSS + \lambda \sum_{j=1}^p |\beta_j|
$$

*Where:*

- $\lambda$: It's a tuning parameter $\geq 0$ that can be calculated using cross-validation.
- $\text{RSS}$: Present the residual standard error.

As consequence, the method uses the $\ell_1$ penalty ($\| \beta \|_1 = \sum_{j=1}^p |\beta|$), instead of the $\ell_2$.
 
![](img/28-lasso-regression-change.png){fig-align="center" width=90% height=90%}


### Coding example

To perform a **Ridge** or **Lasso** regression we need to use the function `linear_reg` and define the `mixture` argument depending on the regression we want to perform.

|Model|Mixture|
|:----|:------|
|Ridge Regression|mixture = **0**|
|Lasso Regression|mixture = **1**|

As both regression depend on the `penalty` parameter we need to use cross-validation to estimate the best one. But, if you want to explore the parameter by yourself you has the next options.

- Fit a model and explore different penalties using `tidy`, `augment`, `predict` or `autoplot` functions.

```{r}
Hitters <- 
  as_tibble(Hitters) %>%
  filter(!is.na(Salary))

ridge_fit <- 
  linear_reg(mixture = 0, penalty = 0) %>%
  set_mode("regression") %>%
  set_engine("glmnet") %>%
  fit(Salary ~ ., data = Hitters)

tidy(ridge_fit, penalty = 11498)

predict(ridge_fit, new_data = Hitters, penalty = 11498)

autoplot(ridge_fit)+
  scale_x_log10(labels = scales::comma_format(accuracy = 1))+
  theme_light()
```


Let's have a example using the tidymodels approach. By following the next steps:

1. Splitting the data in testing and training data.

```{r}
set.seed(18)
Hitters_split <- initial_split(Hitters, strata = "Salary")

Hitters_train <- training(Hitters_split)
Hitters_test <- testing(Hitters_split)
```

2. Define the workflow to use.

```{r}
ridge_recipe <- 
  recipe(formula = Salary ~ ., data = Hitters_train) %>% 
  step_novel(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())

ridge_spec <- 
  linear_reg(penalty = tune(), mixture = 0) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")

ridge_workflow <- 
  workflow() %>% 
  add_recipe(ridge_recipe) %>% 
  add_model(ridge_spec)

```

3. Use cross-validation to estimate the testing error and tune the penalty.

```{r}
set.seed(40)
Hitters_fold <- vfold_cv(Hitters_train, v = 10)
```


4. Define the penalties to check. Where the range indicates the limit of $x$ in the function $10^x$ and the level the number of step to complete the range.

```{r}
penalty_grid <- grid_regular(penalty(range = c(-5, 5)), levels = 50)
```

5. Let's fit our models.

```{r}
tune_res <- 
  tune_grid(ridge_workflow,
            resamples = Hitters_fold, 
            grid = penalty_grid)
```

6. Check the test error change in a plot or just export a table.

```{r}
autoplot(tune_res)+
  scale_x_log10(labels = scales::comma_format(accuracy = 1))+
  theme_light()

collect_metrics(tune_res) %>%
  filter(.metric == "rsq") %>%
  arrange(desc(mean))
```

7. Select the best model configuration and update the workflow to use that parameter.

```{r}
best_penalty <- select_best(tune_res, metric = "rsq")
best_penalty

ridge_final <- finalize_workflow(ridge_workflow, best_penalty)
ridge_final
```

8. Fit the final model and validate the performance.

```{r}
ridge_final_fit <- fit(ridge_final, data = Hitters_train)

augment(ridge_final_fit, new_data = Hitters_test) %>%
  rsq(truth = Salary, estimate = .pred)
```


To perform a Lasso regression we don't need to repeat all the code.

```{r}
lasso_spec <- 
  linear_reg(penalty = tune(), mixture = 1) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")

lasso_workflow <- 
  workflow() %>% 
  add_recipe(ridge_recipe) %>% 
  add_model(lasso_spec)

lasso_penalty_grid <- grid_regular(penalty(range = c(-3, 2)), levels = 50)

lasso_tune_res <- 
  tune_grid(lasso_workflow,
            resamples = Hitters_fold, 
            grid = lasso_penalty_grid)

autoplot(lasso_tune_res)+
  scale_x_log10(labels = scales::comma_format(accuracy = 1))+
  theme_light()
```

## Dimension Reduction

This method projects the $p$ predictors into an $M$-dimensional subspace. If $Z_1, Z_2, \dots, Z_M$ represent $M < p$ lineal combinations $Z_m = \sum_{j=1}^p \phi_{jm}X_j$ of **ALL our original predictors** based on some constants $\phi_{1m}, \phi_{2m}, \dots, \phi_{pm}$, then we can use the new variables to fit a linear regression model by least squares.

$$
y_i = \theta_0 + \sum_{m=1}^M \theta_m z_{im} + \epsilon_i, 
\qquad i=1, \dots, n.
$$

This is not a feature selection method as each of the *M* principal components used in the regression is a linear combination of all *p* of the original features.In this sense, PCR is more closely related to *ridge regression* than to the *lasso*. 


To select the $\phi_{jm}$'s we will discuss two different ways:

### Principal Components Regression (PCR)

The PCR assumes that *the directions in which* $X_1, \dots, X_M$ *show the most variation are the directions that are associated with* $Y$. If it's that is true then fitting the model to $Z_1, \dots, Z_m$ will lead better results than using the original variables $X_1, \dots, X_p$

To perform a *principal components analysis* (PCA):

1. It's recommended to standardize each predictor to have the same scale.

$$
\tilde{x}_{ij} = 
\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}} = 
\frac{x_{ij}}{\sigma_j}
$$

2. Select as the *ﬁrst principal component* the variable where the data *vary the most*.

![](img/29-pca-data-example.png){fig-align="center" width=90% height=90%}

3. Project the observations on the first component, to get the largest possible variance

![](img/30-pca-projection.png){fig-align="center" width=90% height=90%}

4. Then maximize the $\text{Var}(\phi_{11} \times (x_1-\overline{x_1}) + \phi_{21} \times (x_2-\overline{x_2}))$ where $\phi_{11}^2 + \phi_{21}^2 = 1$ to the get **the principal component loadings**. As result $Z_1$ it's a weighted average of the to variables.

5. Repeat the process until having *p* distinct principal components *perpendicular* to the previews one.

In general, this method performs better when we just need to use **few principal components**.

![](img/31-pca-test-error.png){fig-align="center" width=90% height=90%}

![](img/32-pca-test-error-less-components.png){fig-align="center" width=90% height=90%}

The **number of principal components**, M, is typically chosen by **cross-validation**. 


### Partial Least Squares (PLS)

The PLS approach attempts to ﬁnd directions that help explain both the response and the predictors by placing the **highest weight** on the variables that are most **strongly related to the response**.

1. Standardize the predictors and **response**.

$$
\tilde{x}_{ij} = 
\frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n(x_{ij}-\bar{x}_j)^2}} = 
\frac{x_{ij}}{\sigma_j}
$$

2. Compute the ﬁrst direction $Z_1$ by setting each $\phi_{j1}$ equal to the coeﬃcient from the simple linear regression of $Y \sim X_j$, which it's also proportional to the correlation.

3. Adjust each of the variables for $Z_1$, by regressing each variable on $Z_1$ and taking residuals.

4. Compute $Z_2$  using this *orthogonalized data* (residuals) in exactly the same fashion as $Z_1$ was computed based on the original data and repete the process $M$ times.

5. Use least squares to ﬁt a linear model to predict $Y$ using $Z_1, \dots, Z_M$

To select the number $M$ we can use **cross-validation**.

### PCR vs PLS

The next figure illustrates how each method work.

![](img/67-PCR-vs-PLS-diagrams.png){fig-align="center"}


The next figure illustrates that the first two PCs when using 
  - **PCR**: Has very **little relationship** to the response variable.
  - **PLS**: Has a much **stronger association** to the response variable.
  
![](img/68-PCR-vs-PLS-example.png){fig-align="center"}


::: {.callout-note appearance="default"}
### Performance Note
In practice, it often performs no better than ridge regression or PCR. While the supervised dimension reduction of PLS can reduce bias, it also has the potential to **increase variance**, so that the overall beneﬁt of PLS relative to PCR **is a wash**.
:::

### Coding example

To run a **PCR** or a **PLS** we just need to set a lineal model.

```{r}
lm_spec <- 
  linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("lm")
```

And tune our recipe to determinate the `threshold` and the number of components to use `num_comp`.

```{r}
pca_recipe <- 
  recipe(formula = Salary ~ ., data = Hitters_train) %>% 
  step_novel(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors(), 
           threshold = tune(),
           num_comp = tune())

pca_workflow <- 
  workflow() %>% 
  add_recipe(pca_recipe) %>% 
  add_model(lm_spec)

threshold_grid <- 
  grid_regular(threshold(), 
               num_comp(c(1, 20)),
               levels = 5)

pca_tune_res <- 
  tune_grid(pca_workflow,
            resamples = Hitters_fold, 
            grid = threshold_grid)

```

As we can see below the number of component don't have any effect over the test error.

```{r}
autoplot(pca_tune_res)+
  scale_x_log10(labels = scales::comma_format(accuracy = 1))+
  theme_light()

best_threshold <- select_best(pca_tune_res, metric = "rmse")
best_threshold

final_pcr_fit <-
  finalize_workflow(pca_workflow, best_threshold) |>
  fit(data = Hitters_train)

final_pcr_fit |>
  augment(new_data = Hitters_test) |>
  rmse(truth = Salary, estimate = .pred)
```

But thanks to the `DALEXtra` package we can interpret these complex models.

```{r}
library(DALEXtra)

set.seed(1807)
explain_tidymodels(
  final_pcr_fit, 
  data = Hitters_train %>% select(-Salary), 
  y = Hitters_train$Salary,
  label = "RDA",
  verbose = FALSE
) %>% 
  model_parts() |>
  plot()
```

## Poisson Regression 

If match with the next conditions we can fit a model based on **Poisson Distribution**:

- The outcome variable is the result of **counting**, so $Y \in \{ 0, 1, 2, 3, \dots \}$.
- $Ave(Y) \approx Var(Y) $, if not you can use the **Quasipoisson Distribution**.

The *Poisson Distribution* follow the next function:

$$
Pr(Y = k) = \frac{e^{-\lambda} \lambda^{k}}
                 {k!}
$$

Where:
  - $\lambda$ must be greater than 0. It represents the expected number of events $E(Y)$ and variance related $Var(Y)$
  - $k$ represent the number of events that we want to evaluate base of $\lambda$. Its numbers should be greater or equal to 0.
  
So, it makes sense that the value that we want to predict with our regression would be $\lambda$, by using next structure:

$$
\log{ \left( \lambda(X_1, X_2, \dots , X_p)  \right)} = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p
$$

If select this model we need to be aware how to interpret the coefficients. For example, if $\beta_1 = -0.08$ for a categorical variable, we can conclude by calculating $e^{-0.08}$ that ___`r round(exp(-0.08) * 100, 2) |> paste0("%")`___ of events of the base line related to $\beta_0$ would happen.

### Coding example

To perform **Poisson Regression** we just need to create the model specification by loading the **poissonreg** package and using **glm** engine.

```{r}
library(poissonreg)

pois_spec <- poisson_reg() %>% 
  set_mode("regression") %>% 
  set_engine("glm")

pois_rec_spec <- lm_rec_spec

workflow() %>%
  add_model(pois_spec) %>%
  add_recipe(pois_rec_spec) %>%
  last_fit(split = BikeshareSpit) %>%
  collect_metrics()

```

## Logistic Regression

It models the **probability** ($p(X) = Pr(Y=1|X)$) that Y belongs to a particular category given some predictors by assuming that $Y$ follows a **Bernoulli Distribution**. This model calculates the probability using the ***logistic function*** which produce a S form between 0 and 1:

$$
p(X) = \frac{e^{\beta_{0}+\beta_{1}X}}
            {1+e^{\beta_{0}+\beta_{1}X}}
$$

![](img/09-logistic-function-example.png){fig-align="center"}

As the functions returns probabilities is responsibility of the analyst to define a **threshold** to make classifications.

### Estimating coefficients

To estimate $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ the method used is called as *maximum likelihood* which consists in maximizing the **likelihood function**. It is important to clarify that the *least squares approach* is in fact a special case of maximum likelihood. 

$$
\ell(\beta_{0}, \beta_{1}) = \prod_{i:y_{i} = 1}p(x_{i})\prod_{i':y_{i'} = 0}p(1-x_{i'})
$$


### Multiple regression

We also can generalize the *logistic function* as you can see bellow.

$$
p(X) = \frac{e^{\beta_{0}+\beta_{1}X_{1}+\dots+\beta_{p}X_{p}}}
            {1+e^{\beta_{0}+\beta_{1}X_{1}+\dots+\beta_{p}X_{p}}}
$$


### Interpreting the model

To understand how each variable influence the probability $p(X)$, we need to manipulate the *logistic function* until having a lineal combination on the right site. 

$$
\underbrace{ \log{ \left( \overbrace{\frac{p(X)}{1 - p(X)}}^\text{odds ratio} \right)} }_\text{log odds or logit} = \beta_{0}+\beta_{1}X 
$$

As we can see, the result of the linear combination is the $\log$ of the *odds ratio*, known as **log odd** or **logit**. 

An **odds ratio** of an event presents the likelihood that the event will occur as a proportion of the likelihood that the event won't occur. It can take any value between $0$ and $\infty$, where low probabilities are close to $0$, higher to $\infty$ and equivalents ones are equals to 1. For example, if we have an $\text{odds ratio} = 2$, we can say that it's 2 times more likely that the event happens rather than not.

Applying $\log{(\text{odds ratio})}$ makes easier to compare the effect of variables as values below 1 become negative numbers of the scale of possible numbers and 1 becomes 0 for non-significant ones. To have an idea, an odds ratio of 2 has the same effect as 0.5, which it's hard to see at first hand, but if we apply the $\log$ to each value we can see that $\log{(2)} = 0.69$ and $\log{(0.5)} = -0.69$.

At end, $p(X)$ will increase as $X$ increases if $\beta_{1}$ is positive despite the relationship between each other isn't a linear one.

#### Understanding a confounding paradox

|Simple Regression|Multiple Regression|
|:---------------|:-------------------|
|![](img/10-Default-table-4_2.png){fig-align="center"}|![](img/11-Default-table-4_3.png){fig-align="center"}|
|The positive coeﬃcient for student indicates that for **over all values of balance and income**, a student is *more* likely to default than a non-student.|The negative coeﬃcient for student indicates that for a **ﬁxed value of balance and income**, a student is less likely to default than a non-student.|

![](img/12-Default-multiple-plot-4_3.png){fig-align="center" width=60% height=60%}

The problem relays on the fact that *student* and *balance* are **correlated**. In consequence, a student is riskier than a non-student if no information about the student’s credit card balance is available. However, that student is less risky than a non-student with the *same credit card balance*!

![](img/13-Default-simple-plot-4_3.png){fig-align="center" width=60% height=60%}


#### Multinomial Logistic Regression

We also can generalize the *logistic function* to support more than 2 categories ($K > 2$) by defining by convention the last category $K$ as a **baseline**.

For $k = 1, \dotsc,K-1$ we use function.

$$
Pr(Y = k|X= x) = \frac{e^{\beta_{k0}+\beta_{k1}x_{1}+\dots+\beta_{kp}x_{p}}}
                      {1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_{1}+\dots+\beta_{lp}x_{p}}}
$$

For $k=K$, we use the function.

$$
Pr(Y = K|X= x) = \frac{1}
                      {1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_{1}+\dots+\beta_{lp}x_{p}}}
$$
And after some manipulations we can show that $\log$ of the probability of getting $k$ divided by the probability of the *baseline* is equivalent to a linear combinations of the functions parameters.

$$
\log{ \left( \frac{Pr(Y = k|X= x)}{Pr(Y = K|X= x)} \right)} = \beta_{k0}+\beta_{k1}x_{1}+\dots+\beta_{kp}x_{p}
$$

In consequence, each coefficient represent a measure of how much change the probability from the baseline probability.  

### Model limitatios

There are models that could make better classifications when:

- There is a substantial separation between the $Y$ classes.
- The predictors $X$ are approximately normal in each class and the sample size is small.
- When the decision boundary is not lineal.


### Coding example

To perform **Logistic Regression** we just need to create the model specification by loading the **discrim** package and using **MASS** engine.

```{r}

Smarket_train <- 
  Smarket %>%
  filter(Year != 2005)

Smarket_test <- 
  Smarket %>%
  filter(Year == 2005)

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

SmarketLrPredictions <-
  lr_spec %>%
  fit(Direction ~ Lag1 + Lag2, data = Smarket_train) |>
  augment(new_data = Smarket_test) 


conf_mat(SmarketLrPredictions, truth = Direction, estimate = .pred_class) 

accuracy(SmarketLrPredictions, truth = Direction, estimate = .pred_class) 

```


## Generalized Additive Models (GAM)

They provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity.

To transform each predictor we have the next options:

- A constant for each categorical level (step function)
- Polynomial regression
- Natural spines (optimized with least squares)
- Smoothing splines (optimized with backﬁtting)
- Local regression

::: {.callout-note}
**Backﬁtting** ﬁts a model involving multiple predictors by repeatedly updating the ﬁt for each predictor in turn, holding the others ﬁxed.
:::

|Pros|Cons|
|:---|:---|
|1. It **finds relationships that a lineal model would miss** without applying many transformations as it can fit a non-linear $f_j$ to each $X_j$ <br> <br> 2. We can **examine the eﬀect of each predictor on $Y$** individually as the model is additive. <br> <br> 3. The *smoothness* of each function $f_j$ can be summarized via **degrees of freedom**.|1. Important **interactions can be missed**, but they can be added manually|


### GAM Regression

To predict a numeric variable this method creates a function with the next form:

$$
y_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \dots + f_p(x_{ip}) + \epsilon_i
$$

By taking the other predictor as constant we can plot effect of each function for each predictor in the `Wage` example:

![](img/43-gam-regression.png){fig-align="center"}

### GAM Classification

To predict a categorical variable this method creates a function with the next form:

$$
\log \left( \frac{p(X)}{1-p(X)} \right)= \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \dots + f_p(x_{ip})
$$

By taking the other predictor as constant we can plot effect of each function for each predictor in the `Wage` example:

![](img/44-gam-classification.png){fig-align="center"}


### Coding example

The `mgcv` package supports the next families based on the outcome variable type.

- **Gaussian (Default):** Regular regression
- **Binomial:** Probabilities
- **Poisson/Quasipoisson:** counts

```{r}
library(tidymodels)
library(mgcv)

load("data/Soybean.RData")

gam_fitted <- gen_additive_mod(mode = "regression") |>
  set_engine(engine = "mgcv",
             family = gaussian) |>
  fit(weight ~ s(Time), data = soybean_train) 

augment(gam_fitted, 
        new_data = soybean_train) |>
  ggplot(aes(Time, weight))+
  geom_point()+
  geom_line(aes(y = .pred),
            color = "blue",
            linewidth = 1)+
  theme_light()

extract_fit_engine(gam_fitted) |>
  summary()
```