Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Iceberg]Support non-identity partition predicate for thoroughly push down #21999

Merged

Conversation

hantangwangd
Copy link
Member

@hantangwangd hantangwangd commented Feb 24, 2024

Description

This PR extends Iceberg capabilities to pushdown thoroughly predicates on partitioning columns. Besides pushdown thoroughly predicates on identity partitioning columns, it also pushdown thoroughly predicates if they align with partitioning boundaries.

It is also helpful for DELETE, as it allows to do metadata-only delete in more cases, where we don't really needing to do a row-level delete.

Test Plan

  • Added test cases for pushing down predicates on columns of various partition transforms

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@hantangwangd hantangwangd requested a review from a team as a code owner February 24, 2024 17:25
@hantangwangd hantangwangd marked this pull request as draft February 24, 2024 18:21
@hantangwangd
Copy link
Member Author

Convert to draft as this PR depending on #21959. So it should be reviewed after #21959 is merged.

@hantangwangd hantangwangd changed the title Support non-identity partition predicate for thoroughly push down [Iceberg]Support non-identity partition predicate for thoroughly push down Feb 25, 2024
@tdcmeehan
Copy link
Contributor

Is this a cherry pick? At first glance, I don't see substantial similarities with the referenced PR.

@hantangwangd
Copy link
Member Author

Is this a cherry pick? At first glance, I don't see substantial similarities with the referenced PR.

I referenced the implementation of trino, which may evolve gradually through a series of PRs. Is it suitable to be described as a "cherry-pick"?

@tdcmeehan
Copy link
Contributor

Based on my reading of the code, this wouldn't qualify as a cherry pick as it's substantially different, and even the way it's performed is different between the two systems. I think we can remove the cherry pick reference.

@hantangwangd
Copy link
Member Author

Based on my reading of the code, this wouldn't qualify as a cherry pick as it's substantially different, and even the way it's performed is different between the two systems. I think we can remove the cherry pick reference.

OK, fixed! Thanks for the suggestion.

@hantangwangd hantangwangd force-pushed the support_non_identity_column_pushdown branch from bb7daeb to 90f6f9d Compare February 29, 2024 01:44
@hantangwangd hantangwangd marked this pull request as ready for review February 29, 2024 02:23
@tdcmeehan tdcmeehan self-assigned this Mar 6, 2024
Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some initial comments, I'll need more time to review this.

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making changes directly to Type, can we just have a new utility method in the Iceberg connector which takes in the Type, the Object, and from those two things infers the primtive type of the Object from the SQL Type? The helper methods on Type don't directly relate to Block usage, which is what we expect for Types.

@hantangwangd
Copy link
Member Author

hantangwangd commented Mar 6, 2024

Instead of making changes directly to Type, can we just have a new utility method in the Iceberg connector which takes in the Type, the Object, and from those two things infers the primtive type of the Object from the SQL Type? The helper methods on Type don't directly relate to Block usage, which is what we expect for Types.

Very reasonable, I agree that we should not modify common Type just for some specific uses.

Should we move the method getPreviousValue and getNextValue to a utility class as well? And should we place the utility methods in iceberg connector or common TypeUtils? If we place them in iceberg connector, then should we just need to support the types needed by iceberg's truncate/year/month/day/hour transform?

@hantangwangd hantangwangd force-pushed the support_non_identity_column_pushdown branch 3 times, most recently from 658b0f1 to 18366d6 Compare March 7, 2024 05:30
@tdcmeehan
Copy link
Contributor

Instead of making changes directly to Type, can we just have a new utility method in the Iceberg connector which takes in the Type, the Object, and from those two things infers the primtive type of the Object from the SQL Type? The helper methods on Type don't directly relate to Block usage, which is what we expect for Types.

Very reasonable, I agree that we should not modify common Type just for some specific uses.

Should we move the method getPreviousValue and getNextValue to a utility class as well? And should we place the utility methods in iceberg connector or common TypeUtils? If we place them in iceberg connector, then should we just need to support the types needed by iceberg's truncate/year/month/day/hour transform?

I think we can start in the Iceberg connector, and extract if or when an additional use case comes along.

@hantangwangd
Copy link
Member Author

I think we can start in the Iceberg connector, and extract if or when an additional use case comes along.

Good idea! Have done this, please take a look, thanks a lot!

@hantangwangd hantangwangd requested a review from tdcmeehan March 8, 2024 18:23
@hantangwangd hantangwangd force-pushed the support_non_identity_column_pushdown branch from 18366d6 to 1873dc3 Compare March 10, 2024 02:31
Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! The core changes LGTM. Mainly a few nits on the tests

@hantangwangd hantangwangd force-pushed the support_non_identity_column_pushdown branch from 1873dc3 to faee439 Compare March 20, 2024 03:20
@hantangwangd
Copy link
Member Author

hantangwangd commented Mar 20, 2024

Hi @ZacBlanco, I have combined getPreviousValue() and getNextValue() to a single function getAdjacentValue(), and used anyNot(...) to extract functions thoroughlyPushdown(...) and notThoroughlyPushdown(...) which could have a more clearer meaning. Then refactored test cases based on these functions, and moved the left-unwrap TODOs to separate @Ignore test functions. Please take a look when convenient, thanks!

ZacBlanco
ZacBlanco previously approved these changes Mar 26, 2024
tdcmeehan
tdcmeehan previously approved these changes Apr 12, 2024
@hantangwangd hantangwangd dismissed stale reviews from tdcmeehan and ZacBlanco via 198d79f April 12, 2024 01:26
@hantangwangd hantangwangd force-pushed the support_non_identity_column_pushdown branch from faee439 to 198d79f Compare April 12, 2024 01:26
@hantangwangd
Copy link
Member Author

@tdcmeehan @ZacBlanco just resolved the conflict in IcebergPlanOptimizer.

ZacBlanco
ZacBlanco previously approved these changes Apr 12, 2024
@ZacBlanco
Copy link
Contributor

Don't forget to squash your commits

@hantangwangd
Copy link
Member Author

Don't forget to squash your commits

OK, done!

@hantangwangd hantangwangd merged commit b9d9de1 into prestodb:master Apr 12, 2024
56 checks passed
@hantangwangd hantangwangd deleted the support_non_identity_column_pushdown branch April 12, 2024 18:15
@@ -172,11 +247,21 @@ private static Block bucketInteger(Block block, int count)
return bucketBlock(block, count, position -> bucketHash(INTEGER.getLong(block, position)));
}

private static int bucketValueInteger(Block block, int position, int count)
{
return bucketValue(block, count, position, pos -> bucketHash(INTEGER.getLong(block, pos)));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and for some of the other newly added methods: Is it intended that this calls bucketValue(..., count, position, ...) even though the signature of the method is bucketValue(..., int position, int count, ...) (that is, the arguments are switched)?

Sorry for this drive-by and late review comment, hopefully it is useful though. I wasn't sure whether a separate issue or discussion was justified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I think in some cases the code will still work but bucketing will technically be incorrect. We should switch the arguments. I'll open a PR later today to fix this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch @Marcono1234, thanks for pointing out this, seems we didn't have sufficient test cases to cover these codes. So perhaps we need to supplement some test cases that can cover these code logics in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#24309 fixes this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants