Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build(datasets): upgrade s3fs to newer calver #348

Conversation

MatthiasRoels
Copy link
Contributor

@MatthiasRoels MatthiasRoels commented Sep 26, 2023

Description

s3fs version is already 3+ years old. The aim of this PR is to upgrade it to their new calver versioning approach.

Fixes #337

Development notes

When s3fs switched to calver versioning, they switched from using botocore to aiobotocore, which is an async client for AWS services. As a result it became much more difficult to use moto for unit testing as many people faced issues like the one described here aio-libs/aiobotocore#755. To fix it, we had to switch to using moto in server mode.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Matthias Roels <matthias.roels21@gmail.com>
Since s3fs is switched to aiobotocore, tests started to fail with message
`'MockRawResponse' object has no attribute 'raw_headers'`. This is a bug
in aiobotocore. Using moto server does seem to resolve the issues

Signed-off-by: Matthias Roels <matthias.roels21@gmail.com>
Signed-off-by: Matthias Roels <matthias.roels21@gmail.com>
@MatthiasRoels MatthiasRoels changed the title Build/datasets upgrade s3fs to newer calver Build(datasets) upgrade s3fs to newer calver Sep 26, 2023
@noklam noklam self-assigned this Sep 26, 2023
@noklam noklam added the Community Issue/PR opened by the open-source community label Sep 26, 2023
@noklam
Copy link
Contributor

noklam commented Sep 26, 2023

It's interesting that you need moto-server to resolve the issue. Did you get some weird authentication error? This is something I encountered in the past, with newer moto version the issue seems to resolve itself automatically.

Does it fails on all version on specific Python Version?

@MatthiasRoels
Copy link
Contributor Author

@noklam: No I didn't encounter any weird authn errors. If you don't use moto in server mode, you always run into these kind of errors:

AttributeError: 'AWSResponse' object has no attribute 'raw_headers'

For some reason, it occurs only in some test files. The tests for PandasCSVDataset run fine for example (although it's just one test involving S3). When I google the error, 2 fixes consistently appear; either you patch some boto function calls or you use moto in sever mode. The later looked like the most elegant one to fix the failing tests.

As a side note: I noticed there is a lot of inconsistency in the types of tests between different dataset classes. Is that intentional?

@noklam
Copy link
Contributor

noklam commented Sep 26, 2023

@MatthiasRoels that's fine, I have research this in the past I think server mode is the right way.

Signed-off-by: Matthias Roels <matthias.roels21@gmail.com>
Signed-off-by: Matthias Roels <matthias.roels21@gmail.com>
Signed-off-by: Matthias Roels <matthias.roels21@gmail.com>
@MatthiasRoels MatthiasRoels marked this pull request as ready for review September 27, 2023 08:02
Key=S3_KEY_PATH + "/" + video_name,
Filename=str(tmp_file),
)
with open(tmp_file, "wb") as file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could do tmp_file.open("wb") since it's already a pathlib object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't knew that was possible :-). Do you want me to change it?

kedro-datasets/setup.py Outdated Show resolved Hide resolved
Comment on lines +60 to +64
def credentials():
return {
"key": "fake_access_key",
"secret": "fake_secret_key",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there are 2 sets of credentials?

Do we need key and secret or the AWS_XXXXX?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One is for credentials file, the other one are env vars. The env vars are required for bucket setup etc while the credentials are used in the Dataset classes.

@MatthiasRoels MatthiasRoels changed the title Build(datasets) upgrade s3fs to newer calver build: kedro-datasets - upgrade s3fs to newer calver Oct 2, 2023
Signed-off-by: Matthias Roels <matthias.roels21@gmail.com>
@MatthiasRoels MatthiasRoels changed the title build: kedro-datasets - upgrade s3fs to newer calver build(datasets): upgrade s3fs to newer calver Oct 2, 2023
@MatthiasRoels
Copy link
Contributor Author

MatthiasRoels commented Oct 12, 2023

Seems like the unit tests failed on (windows-latest, 3.8). Unfortunately, I have no access to a Windows machine for debugging and testing. And to be honest, I have never written software on a Windows OS before... So it would be nice if someone can help here. Thanks!

@merelcht
Copy link
Member

@MatthiasRoels I've tried running this on Gitpod (ubuntu) and the tests also fail there, so it's not just a windows problem. You can also see that in the builds: https://github.com/kedro-org/kedro-plugins/actions/runs/6557080300/job/17809502926?pr=348 I guess it was just the windows 3.8 that failed first.

@noklam
Copy link
Contributor

noklam commented Nov 20, 2023

Update branch to see if the problem persist

@merelcht merelcht mentioned this pull request Nov 20, 2023
4 tasks
@merelcht merelcht removed the Community Issue/PR opened by the open-source community label Nov 27, 2023
merelcht and others added 9 commits November 27, 2023 11:58
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
…k tests

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
…com:MatthiasRoels/kedro-plugins into build/datasets-upgrade-s3fs-to-newer-calver
@merelcht merelcht self-assigned this Nov 30, 2023
@merelcht merelcht mentioned this pull request Nov 30, 2023
4 tasks
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@merelcht merelcht force-pushed the build/datasets-upgrade-s3fs-to-newer-calver branch from 750f31a to 402439e Compare November 30, 2023 16:20
@MatthiasRoels
Copy link
Contributor Author

Thanks @merelcht for taking over, I wanted to work on this for a while, but I’m in the middle of a cloud migration so I’m way to busy 🙁

@merelcht
Copy link
Member

merelcht commented Dec 1, 2023

Hi @MatthiasRoels no worries. I've been trying to fix the issues described by people in aio-libs/aiobotocore#755. It's not straightforward though, so I'm using another branch to experiment and not pollute this PR too much.

This was referenced Dec 4, 2023
@merelcht
Copy link
Member

merelcht commented Dec 5, 2023

@MatthiasRoels I opened #463 where I've got most tests passing now and marked the two failing ones with @pytest.mark.xfail

@merelcht
Copy link
Member

merelcht commented Dec 7, 2023

Thanks again for opening this PR @MatthiasRoels #463 is now merged, so I'll close this PR. After our 2.0.0 release on datasets we will review our mocking approach and perhaps we can use the server approach you tried.

@merelcht merelcht closed this Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

s3fs dependency is more than 3 years old.
6 participants