Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Leakage when normalize the train data and test data together? #14

Open
howie1013 opened this issue Feb 20, 2022 · 2 comments
Open

Comments

@howie1013
Copy link

howie1013 commented Feb 20, 2022

Data Leakage when normalize the train data and test data together?

dataset = pd.read_csv('Finaldata_with_Fourier.csv', parse_dates=['Date'])
...
y_value = pd.DataFrame(dataset.iloc[:, 3])
y_scaler = MinMaxScaler(feature_range=(-1, 1))
y_scaler.fit(y_value)
y_scale_dataset = y_scaler.fit_transform(y_value)
X, y, yc = get_X_y(X_scale_dataset, y_scale_dataset)
y_train, y_test, = split_train_test(y)
yc_train, yc_test, = split_train_test(yc)
@windowshopr
Copy link

Yep

@nova-land
Copy link

The data leakage in this project is serious, I doubt how could those academic peer reviewed the paper...

train_size = round(len(dataset) * 0.7)
print(f'Training Data Size: {train_size}')
train_data = dataset[0:train_size]
test_data = dataset[train_size:]

X_train = pd.DataFrame(train_data)
X_test = pd.DataFrame(test_data)
y_train = pd.DataFrame(train_data['Close'])
y_test = pd.DataFrame(test_data['Close'])

# Fit & Transform Features
# Normalized the data
X_scaler = MinMaxScaler(feature_range=(-1, 1))
y_scaler = MinMaxScaler(feature_range=(-1, 1))
X_train = X_scaler.fit_transform(X_train)
y_train = y_scaler.fit_transform(y_train)
X_test = X_scaler.transform(X_test)
y_test = y_scaler.transform(y_test)

This snippet will generate the normalised data without data leakage.
After this correction, the scaler will not work probably which makes the model useless.

The price in testing period is way higher than training price, using common method like z-score or MinMax will not be useful.

So either it needs to use adaptive normalisation or changing the model target (y_value) to percentage delta change or trend classification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants