Verticapy 1.0.1 - Unit test failure for three regression metrics (Aic_score, Bic_score, R2_score) #1207

okankcb · 2024-04-22T13:42:58Z

okankcb
Apr 22, 2024

Hello,

While conducting unit tests with Titanic data on Verticapy 1.0.1, I noticed a discrepancy between the AIC, BIC, and R2 scores calculated by Verticapy and those calculated using scikit-learn.

Here are the steps to reproduce the issue:

1.Use Titanic data from Verticapy:

from verticapy.datasets import load_titanic
titanic_recette = load_titanic(schema=schema, name='titanic_recette')
titanic_recette.eval("col_1", "1")
titanic_recette.cumsum(column="col_1", name="col_id")
titanic_recette.drop(columns=["col_1"])
df_titanic = titanic_recette.to_pandas()_

2.Train a linear regression model with Verticapy and Python:

from verticapy.machine_learning.vertica import LinearRegression
drop(schema + ".lr_titanic_recette22", method="model")
model = LinearRegression(name=schema + ".lr_titanic_recette22")
model.fit(titanic_recette, ["fare", "age"], "survived")
model.predict(titanic_recette, ["fare", "age"], "survived_pred")

import sklearn.linear_model as lr
df_titanic_copy = df_titanic.copy()
df_titanic_copy = df_titanic_copy.loc[df_titanic_copy["age"].notna()]
df_titanic_copy = df_titanic_copy.loc[df_titanic_copy["fare"].notna()]
reg = lr.LinearRegression().fit(df_titanic_copy[["fare", "age"]], df_titanic_copy["survived"])
df_titanic_copy["survived_pred"] = reg.predict(df_titanic_copy[["fare", "age"]])
df_titanic_copy["survived_pred"] = df_titanic_copy["survived_pred"].round(6)
df_titanic_copy["survived"] = df_titanic_copy["survived"].round(6)

3. Calculate AIC, BIC, and R2 scores with Verticapy:

from verticapy.machine_learning.metrics import aic_score, bic_score
vpy_aic = aic_score("survived", "survived_pred", titanic_recette, k=2)
vpy_bic = bic_score("survived", "survived_pred", titanic_recette, k=2)
vpy_R2score = np.round(r2_score("survived", "survived_pred", titanic_recette), 10)

4. Calculate AIC, BIC, and R2 scores with Scikit-learn:

import sklearn as sk
n = len(df_titanic_copy)
k = 3  # 2 variables + intercept
n * np.log(sk.metrics.mean_squared_error(df_titanic_copy["survived"], df_titanic_copy["survived_pred"]))
py_aic = n * np.log(sk.metrics.mean_squared_error(df_titanic_copy["survived"], df_titanic_copy["survived_pred"])) + k * np.log(n)
py_bic = n * np.log(sk.metrics.mean_squared_error(df_titanic_copy["survived"], df_titanic_copy["survived_pred"])) + 2 * k

Results obtained:

Verticapy AIC Score: -1863.62512466986
Verticapy BIC Score: -1848.30522239793
Verticapy R2 Score: 0.0813242788

Scikit-learn AIC Score: -1488.3492662576539
Scikit-learn BIC Score: -1503.0605080304076
Scikit-learn R2 Score: 0.0783248919

The AIC, BIC, and R2 scores obtained with Verticapy are significantly different from those calculated using scikit-learn.

Verticapy Test 0.12: Scores are identical between Verticapy and scikit-learn.

Please examine this issue and keep me informed of any updates or solutions that may be provided.

Best regards,
Okan.K

Answered by oualib

Aug 2, 2024

Solved here: #1254

View full answer

oualib · 2024-06-17T15:11:47Z

oualib
Jun 17, 2024
Maintainer

Hi @okankcb we will need to investigate why the differences are so big. I've created an issue to track this bug. #1236

0 replies

oualib · 2024-08-02T09:47:21Z

oualib
Aug 2, 2024
Maintainer

The problem is due to missing values. It seems, the new version is sensitive to missing values. It will be fixed in the next PR.

1 reply

oualib Aug 2, 2024
Maintainer

Solved here: #1254

Answer selected by oualib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verticapy 1.0.1 - Unit test failure for three regression metrics (Aic_score, Bic_score, R2_score) #1207

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Verticapy 1.0.1 - Unit test failure for three regression metrics (Aic_score, Bic_score, R2_score) #1207

okankcb Apr 22, 2024

Replies: 2 comments · 1 reply

oualib Jun 17, 2024 Maintainer

oualib Aug 2, 2024 Maintainer

oualib Aug 2, 2024 Maintainer

okankcb
Apr 22, 2024

Replies: 2 comments 1 reply

oualib
Jun 17, 2024
Maintainer

oualib
Aug 2, 2024
Maintainer

oualib Aug 2, 2024
Maintainer