-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path02_classification_LDA.Rmd
297 lines (217 loc) · 8.56 KB
/
02_classification_LDA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
title: "Linear Discriminant Analysis "
output:
rmarkdown::github_document:
toc: yes
---
The purpose of this notebook is to measure the quality of the work carried out in the clustering part.
The more stable our clustering is, the more we will be able to find good ranking.
Moreover, an LDA will also allow focusing on the variables to be explained.
# My functions
```{r}
source("C:/Users/u32118508/OneDrive - UPEC/Bureau/Machine_learning_journey/A_journey_in_Machine_Learning/00_functions_multiclass.R")
```
# Libraries
```{r}
library(tidyverse) #for easy data manipulation and visualization
library(caret) #for easy machine learning workflow
library(MASS) #for Modern Applied Statistic
library(readxl) # Load the data
library(klaR) # lda features selection
```
# Data
```{r}
data=read_excel("C:/Users/u32118508/OneDrive - UPEC/Bureau/Machine_learning_journey/A_journey_in_Machine_Learning/OUTPUT/output_tp1.xlsx")
head(data)
```
LDA is compatible with categorial feeature. This is why we transform features in binaries features.
We don't use hot Encoding in order to avoid multicolinearity.
# MODELS
Split the data into training (80%) and test set (20%)
```{r}
set.seed(123)
train.size=0.8
train.index<- sample.int(dim(data)[1],round(dim(data)[1] * train.size ))
train.sample=data[train.index, ]
test.sample=data[-train.index, ]
```
## First model
Fit the model
```{r}
model <- lda(cluster~., data = train.sample)
str(model)
```
Make predictions
```{r}
train.pred<- model %>% predict(train.sample)
test.pred <- model %>% predict(test.sample)
```
How to evaluate the model? Let's see the accuracy| weigt accuracy and micro|macro F1 as we have multiclassification
For more detail https://www.cse.iitk.ac.in/users/purushot/papers/macrof1.pdf
TRAIN SET
```{r}
accuracy_rate(factor(train.pred$class),factor(train.sample$cluster))
F1_rate(factor(train.pred$class),factor(train.sample$cluster))
```
TEST SET
```{r}
accuracy_rate(factor(test.pred$class),factor(test.sample$cluster))
F1_rate(factor(test.pred$class),factor(test.sample$cluster))
```
===> The model seems quite complex, the performance is low in the test data
Pour plus de détail sur les performance :
```{r}
confusionMatrix(factor(train.pred$class),factor(train.sample$cluster))
confusionMatrix(factor(test.pred$class),factor(test.sample$cluster))
```
## Nice to Know !!
1. Wilks lambda: test manova , diffcult to compute the pvalue direcly: A solution is ROA's transformation
The Wilks lambda is the preferred indicator for the statistical evaluation of the model. Itindicates
to what extent the class centers are distinct from each other in the space of
representation. It varies between 0 and 1: towards 0, the model will be good.
2. We can use **Wilks lambda**' to derive model selections because this statisitics show what will happen if the removed the features X
3. R combines 2 approches of LDA:
The posterio is based on predictive approach
But the function score is compute on the geometric approach (PCA)
## Forwardselection
Features selection
```{r}
model.forward <- greedy.wilks(cluster ~ .,data=train.sample, niveau = 0.1)
print(model.forward)
```
Absences , G3 , G1 ,G2 and age are the most importants to split classes;
Let's fit the model
```{r}
model.best <- lda(model.forward$formula, data = train.sample)
```
Make predictions
```{r}
train.pred2<- predict(model.best, train.sample)
test.pred2<- predict(model.best, test.sample)
```
How to evaluate the model?
```{r}
#TRAIN SET
accuracy_rate(factor(train.pred2$class),factor(train.sample$cluster))
F1_rate(factor(train.pred2$class),factor(train.sample$cluster))
```
```{r}
#TEST SET
accuracy_rate(factor(test.pred2$class),factor(test.sample$cluster))
F1_rate(factor(test.pred2$class),factor(test.sample$cluster))
```
====> this model is parsimonious
===== >the metrics are efficients in the two data sets compared to the first model
# Interpretable LDA
## From LDA with PCA functions to easy to use scores
Now the going to Construct our linear discriminant function
Indeed, From the raw values of the features, we construct a linear and easy to implement score function in order to decide on the assignment of classes.
R give us functions discriminant with PCA (these are not easy to use) .For a new comer student who arrives, firsly it is necessary to have his coordinates on PCA axes before being able to use the scores. This is why, I am doing myself in this step the linear functions which do not require the coordinates pca
Let see the functions with PCA
```{r}
print(model.best$scaling)
```
R affiche bel et bien les coefficients des fonctions canoniques à partir partir d' une projection dans l'espace factoriel,
with R, we don't have the intercept, we need to compute it ourselves
Overall means by features
```{r}
xb <- colMeans(data.frame(train.sample[,-which(names(train.sample) == "cluster")])[,c('absences','G3','G2','G1','age')])
print(xb)
```
This is how to get intercept for lda functions
```{r}
const_ <- apply(model.best$scaling,2,function(v){-sum(v*xb)})
print(const_)
```
Conditional means
```{r}
cond_Xb <- sapply(1:ncol(model.best$scaling),function(j)
{sapply(model.best$lev,
function(niveau){sum(model.best$scaling[,j]*model.best$means[niveau,]) + const_[j]})
}
)
colnames(cond_Xb) <- paste("LD",1:ncol(model.best$scaling),sep="")
rownames(cond_Xb) <- model.best$lev
print(cond_Xb)
```
From pca features functions to linear features functions
```{r}
coef_ <- sapply(model.best$lev,function(niveau)
{rowSums(sapply(1:ncol(model.best$scaling),
function(j){model.best$scaling[,j]*cond_Xb[niveau,j]}))
})
print(coef_)
```
From pca intercepts to linear features intercepts
```{r}
intercept_ <- sapply(model.best$lev,function(niveau)
{sum(mapply(prod,const_,cond_Xb[niveau,]))-0.5*sum(cond_Xb[niveau,]^2)+log(model.best$prior[niveau])})
names(intercept_) <- levels(factor(train.sample$cluster))
print(intercept_)
```
Below are the functions
**Cluster 1**: $$ F_1=-30,99 +3,76* absences -1,06*G3 + 0,27*G1-0,25*G2 -0,08* age$$
**Cluster 2**: $$ F_2=7,56 -0,57*absences + 0,04* G3 - 0,05*G2 -0,36*G1 -0,18* ag $$
**Cluster 3**: $$ F_3=19,7 -0,4*absences - 2,37* G3 - 0,24*G2 -0,33*G1 +0,19* age$$
**Cluster 4**: $$ F_4=-26,47 -0,8*absences + 0,56* G3+ 0,7*G2 +0,59*G1 +0,11* age $$
**Cluster 5**: $$ F_5=3,53 +1,01*absences - 0,47* G3 - 0,14*G2 -0,79*G1 +0,22* age $$
**Cluster 6**: $$ F_6=-17,13 +0,8*absences + 0,24* G3 + 0,52*G2 +0,2*G1 -0,07* age $$
.
Pour chaque indivi donnée, il suffit de calculer directemnt la fonction score et prendre le maximum des 6 scores .
On peut également se transformer chaque fonction en probabilité:
$$F_1=\frac{F_1}{\sum_{i=1}^6 F_i}$$
## Compute Wilks Lambda from scratch and derive its Pvalue
```{r}
Xtrain=data.frame(train.sample[,-which(names(train.sample) == "cluster")])
#number of
p <- ncol(Xtrain)
#dataset length
n <- nrow(Xtrain)
#number of cluster
K <- nlevels(factor(train.sample$cluster))
#Degrees of freedom ;numerator
ddlSuppNum <- K - 1
#Degrees of freedom : denominator
ddlSuppDenom <- n - K - p + 1
#matrice de convariance totale
TOT <- cov(Xtrain)
#WITHIN - covariance intra-classe
WIT <- (1.0/(n-K)) *Reduce("+",
lapply(levels(factor(train.sample$cluster)),function(niveau)
{(sum(factor(train.sample$cluster)==niveau)-1)*cov(Xtrain[factor(train.sample$cluster)==niveau,])}))
#lambda de Wilks
#matrices interm?diaires
WITprim <- (n-K)/n*WIT
TOTprim <- (n-1)/n*TOT
#rapport des d?terminants -- Lambda de Wilks
LW <- det(WITprim)/det(TOTprim)
print(LW)
## lower value ==> good
#compute pvale
stat=-log(LW) * (n -0.5 *( p-K+1))
pvalue=pchisq(stat, df=p, lower.tail=FALSE)
print (pvalue)
```
## Features importances by Wilk Lambda
```{r}
#vecteur des F
FTest <- numeric(p)
#p-value
pvalueFTest <- numeric(p)
#boucle
for (j in 1:p){
#Lambda corresp.
LWvar <- det(WITprim[-j,-j])/det(TOTprim[-j,-j])
#print(LWvar)
#F
FTest[j] <- ddlSuppDenom / ddlSuppNum * (LWvar/LW - 1)
#p-value
pvalueFTest[j] <- pf(FTest[j],ddlSuppNum,ddlSuppDenom,lower.tail=FALSE)
}
temp <- data.frame(var=colnames(Xtrain),FValue=FTest,pvalue=round(pvalueFTest,6))
print(temp)
#affichage
temp <- data.frame(var=colnames(Xtrain),FValue=FTest,pvalue=round(pvalueFTest,6)) %>%
arrange(FValue)
print(temp)
```