Data Leakage when normalize the train data and test data together? #14

howie1013 · 2022-02-20T05:14:37Z

Data Leakage when normalize the train data and test data together?

dataset = pd.read_csv('Finaldata_with_Fourier.csv', parse_dates=['Date'])
...
y_value = pd.DataFrame(dataset.iloc[:, 3])
y_scaler = MinMaxScaler(feature_range=(-1, 1))
y_scaler.fit(y_value)
y_scale_dataset = y_scaler.fit_transform(y_value)
X, y, yc = get_X_y(X_scale_dataset, y_scale_dataset)
y_train, y_test, = split_train_test(y)
yc_train, yc_test, = split_train_test(yc)

The text was updated successfully, but these errors were encountered:

windowshopr · 2023-01-15T04:52:10Z

Yep

nova-land · 2023-02-13T17:46:55Z

The data leakage in this project is serious, I doubt how could those academic peer reviewed the paper...

train_size = round(len(dataset) * 0.7)
print(f'Training Data Size: {train_size}')
train_data = dataset[0:train_size]
test_data = dataset[train_size:]

X_train = pd.DataFrame(train_data)
X_test = pd.DataFrame(test_data)
y_train = pd.DataFrame(train_data['Close'])
y_test = pd.DataFrame(test_data['Close'])

# Fit & Transform Features
# Normalized the data
X_scaler = MinMaxScaler(feature_range=(-1, 1))
y_scaler = MinMaxScaler(feature_range=(-1, 1))
X_train = X_scaler.fit_transform(X_train)
y_train = y_scaler.fit_transform(y_train)
X_test = X_scaler.transform(X_test)
y_test = y_scaler.transform(y_test)

This snippet will generate the normalised data without data leakage.
After this correction, the scaler will not work probably which makes the model useless.

The price in testing period is way higher than training price, using common method like z-score or MinMax will not be useful.

So either it needs to use adaptive normalisation or changing the model target (y_value) to percentage delta change or trend classification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Leakage when normalize the train data and test data together? #14

Data Leakage when normalize the train data and test data together? #14

howie1013 commented Feb 20, 2022 •

edited

Loading

windowshopr commented Jan 15, 2023

nova-land commented Feb 13, 2023

Data Leakage when normalize the train data and test data together? #14

Data Leakage when normalize the train data and test data together? #14

Comments

howie1013 commented Feb 20, 2022 • edited Loading

windowshopr commented Jan 15, 2023

nova-land commented Feb 13, 2023

howie1013 commented Feb 20, 2022 •

edited

Loading