Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Microbatch should have a weekly batch size option #11141

Open
3 tasks done
jschintz-nytimes opened this issue Dec 12, 2024 · 0 comments
Open
3 tasks done

[Feature] Microbatch should have a weekly batch size option #11141

jschintz-nytimes opened this issue Dec 12, 2024 · 0 comments
Labels
enhancement New feature or request triage

Comments

@jschintz-nytimes
Copy link

jschintz-nytimes commented Dec 12, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Add week as a batch size option
Also allow the user to set what the start day of the week is.

Describe alternatives you've considered

I hack my through the microbatch - executing a daily microbatch - but with a filter to only include Mondays.
This works fine for my daily job

  {# Custom microbatch selection for windowing #}
    {%- set microbatch_day = model.config.__dbt_internal_microbatch_event_time_start|default(modules.datetime.datetime(1899, 1, 1,0,0,0))-%}
    {% set batch_7_days_ago = ( microbatch_day - modules.datetime.timedelta(days=7)).strftime("%Y-%m-%d") %}
    {% set batch_1_days_ago = ( microbatch_day - modules.datetime.timedelta(days=1)).strftime("%Y-%m-%d") %}

with

    filter_microbatch__sessions as (

        select * from {{ ref('our__sessions').render() }}
        --override default microbatch behavior with 7 day window
        where date BETWEEN '{{ batch_7_days_ago }}' AND '{{ batch_1_days_ago }}'
            -- only run 1 day per week -- Mondays
            and EXTRACT(dayofweek FROM DATE('{{ batch_7_days_ago }}')) = 2
           )
select * from filter_microbatch__sessions

But that wastes a lot of execution time during full-backfills. And my users want to do a lot of backfills.
I probably should use smarter jinja, instead of sql, to figure out when not to execute.

Who will this benefit?

Any user of microbatch incremental strategy

This could also help with large backfills -
For the normal incremental run - Users could set an environment variable for 'day' for their daily incremental with a 3 day lookback.
But if I'm processing 3-7 years of data - shouldn't I query at the week level? Running 1K queries can take a very long time just to process 1K btache. Wouldn't 150 be much more efficient for larger windows of time?

Specifically to bigquery - what about table quota limits? Is some method necessary to reduce the number of table operations?

Are you interested in contributing this feature?

I think so

Anything else?

Microbatch has helped my org a lot - thanks for this feature!

@jschintz-nytimes jschintz-nytimes added enhancement New feature or request triage labels Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

1 participant