Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: hilbert clustering #17045

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

zhyass
Copy link
Member

@zhyass zhyass commented Dec 12, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

image

This PR refines Hilbert clustering in Databend by adopting a range-based partitioning approach. It samples cluster keys, assigns range partition IDs, and calculates Hilbert indexes for efficient pruning and clustering. Key changes include:

  1. Removing the old Hilbert clustering logic.

  2. Stable segments are excluded from reclustering to preserve optimal clustering results.

The alter table t recluster is equivalent to

WITH _keys_bound AS (
  SELECT 
    range_bound(1024, 1000)(a) AS a_bound, 
    range_bound(1024, 1000)(b) AS b_bound 
  FROM 
    default.t
), 
_source_data AS (
  SELECT 
    t.*, 
    hilbert_index(
      [hilbert_key(cast(ifnull(range_partition_id(t.a, _keys_bound.a_bound), 1023) as uint16)), hilbert_key(cast(ifnull(range_partition_id(t.b, _keys_bound.b_bound), 1023) as uint16))], 
      2
    ) AS _hilbert_index 
  FROM 
    default.t, 
    _keys_bound
) 
SELECT 
  * EXCLUDE(_hilbert_index) 
FROM 
  _source_data 
ORDER BY 
  _hilbert_index

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@zhyass zhyass marked this pull request as draft December 12, 2024 13:44
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Dec 12, 2024
@zhyass zhyass force-pushed the feature_cluster_table branch 5 times, most recently from 88111a3 to 059bb42 Compare December 30, 2024 13:51
@zhyass zhyass marked this pull request as ready for review December 31, 2024 07:03
@zhyass zhyass added the ci-cloud Build docker image for cloud test label Dec 31, 2024

This comment was marked as outdated.

@zhang2014
Copy link
Member

Maybe should add performance test with hilbert clustering?

@zhyass zhyass added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Dec 31, 2024

This comment was marked as outdated.

@zhyass zhyass force-pushed the feature_cluster_table branch 2 times, most recently from e5ea1e6 to 595cff9 Compare January 1, 2025 09:02
@zhyass zhyass marked this pull request as draft January 1, 2025 17:09
@zhyass zhyass force-pushed the feature_cluster_table branch from 6ba61fd to c0c2183 Compare January 2, 2025 04:25
@zhyass zhyass added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jan 2, 2025
Copy link
Contributor

github-actions bot commented Jan 2, 2025

Docker Image for PR

  • tag: pr-17045-5895f17-1735794560

note: this image tag is only available for internal use,
please check the internal doc for more details.

@zhyass zhyass force-pushed the feature_cluster_table branch from c0c2183 to 2d183eb Compare January 2, 2025 05:47
@zhyass zhyass force-pushed the feature_cluster_table branch from b336e03 to ffdc058 Compare January 2, 2025 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cloud Build docker image for cloud test pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants