-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* roadmap post december 2024 * fix yml spacing * fix code snippet format
- Loading branch information
1 parent
95c090b
commit 459d156
Showing
1 changed file
with
207 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,207 @@ | ||
# dbt: Play On (December 2024) | ||
|
||
Oh hi there :) | ||
|
||
We’ve had the opportunity to talk a lot about our investments in open source over the past few months: | ||
|
||
- [I [Grace] renewed my vows with the dbt Community in a Vegas wedding ceremony officiated by Elvis](https://youtu.be/DC9sbZBYzpI?si=uK9Aie6Jl-FIHggm) (aka [@jtcohen6](https://github.com/jtcohen6)). | ||
- [We spoke about dbt Labs’ continued commitment to Open Source in the Coalesce Keynote](https://youtu.be/I72yUtrmhbY?si=Iu6s9WXdHnFtyCvi). | ||
- [We wrote about the importance of extensibility for how we build features in dbt Core.](https://www.getdbt.com/blog/dbt-core-v1-9-is-ga#:~:text=Extensibility%20is%20what%20powers%20the%20community) | ||
|
||
All of these points and themes remain incredibly relevant to our team and are fueling our vision as we prepare for the new year: | ||
|
||
- we're committed to defending dbt Core as the open source standard for data transformation, which will remain licensed under Apache 2.0 | ||
- the dbt framework will continue to be shaped by a collaborative effort between you (the community) and us (the maintainers) | ||
- when we add something new to the standard, we are committing to the long term, which means we are intentional about *how* and *when* we expand its breadth - in the meantime, lean into dbt's extensibility (and show us how you're doing it!) | ||
|
||
Now that we’ve gotten into our new rhythm… | ||
|
||
Now that last year’s focus on stability has earned us the right to ship awesome new additions to the dbt framework… | ||
|
||
Now that we’re actually seeing a *much* higher % of projects running the latest & greatest dbt… | ||
|
||
…the next 6-12 months will feel similar to the last. dbt will keep getting better at doing what it does - being a mission-critical piece of your data stack, and a delightful part of your work. To that end, stability always comes first. And, we will be shipping some exciting new features to dbt Core. | ||
|
||
# Oh What A Year It’s Been! | ||
|
||
This year, we released two new minor versions of dbt Core. | ||
|
||
| **Version** | **When** | **Namesake** | **Stuff** | | ||
| --- | --- | --- | --- | | ||
| [v1.8](https://docs.getdbt.com/docs/dbt-versions/core-upgrade/upgrading-to-v1.8) | May | [Julian Abele](https://github.com/dbt-labs/dbt-core/releases/tag/v1.8.0) | Unit testing. Decoupling of dbt-core and adapters. Flags for managing changes to legacy behaviors. | | ||
| [v1.9](https://docs.getdbt.com/docs/dbt-versions/core-upgrade/upgrading-to-v1.9) | December | [Dr. Susan La Flesche Picotte](https://github.com/dbt-labs/dbt-core/releases/tag/v1.9.0) | Microbatch incremental strategy. New configurations and spec for snapshots. Standardizing support for Iceberg. | | ||
|
||
In the [last roadmap post](https://github.com/dbt-labs/dbt-core/blob/main/docs/roadmap/2023-11-dbt-tng.md), we committed to prioritizing all-around interface stability. This included decoupling dbt-core and adapters, introducing [behavior change flags](https://docs.getdbt.com/reference/global-configs/behavior-changes) to give you time to adjust to ~~breaking~~ changes, and improving the stability of metadata artifacts. You can read more about those efforts [here](https://www.notion.so/E-team-offsite-competitive-deep-dive-SQLMesh-SDF-November-2024-129bb38ebda7807ca023ddf198fb1279?pvs=21). | ||
|
||
Stability means you can upgrade with confidence. | ||
|
||
Stability means less disruptions for our adapters and integrations in the dbt ecosystem. | ||
|
||
Stability means we’ve earned the right to **ship some big new features**. | ||
|
||
This past year, we were able to ship some long-awaited additions and enhancements to the dbt Core framework: | ||
|
||
- [**Unit tests**](https://github.com/dbt-labs/dbt-core/discussions/8275) allow you to validate your SQL modeling logic on a small set of static inputs before you materialize your full model in production. | ||
- [**Snapshots**](https://github.com/dbt-labs/dbt-core/discussions/7018) got the glow up they deserved - with new configurations and spec to make capturing your data changes easier to configure, run, and customize. | ||
- [**Microbatch**](https://github.com/dbt-labs/dbt-core/discussions/10672) incremental models enable you to optimize your largest datasets, by transforming your timeseries data in discrete periods with their own SQL queries, rather than all at once. | ||
- [**Iceberg**](https://docs.getdbt.com/blog/icebeg-is-an-implementation-detail) table format (an open standard for storing data and accessing metadata) is standardized across adapters, enabling you to store your data in a way that promises interoperability with multiple compute engines. | ||
|
||
We also were able to close out some highly-upvoted “paper cuts”: | ||
|
||
- `--empty` flag limits the `ref`s and `source`s to zero rows, which you can use for schema-only dry runs that validate your model SQL and run unit tests | ||
- dbt now issues a single (batch) query when calculating source freshness through metadata, instead of executing a query per source | ||
- improvements to `state:modified` help reduce the risk of false positives due to environment-aware logic | ||
- you can now document your data tests with `description`s | ||
|
||
A lot of these features are things that you (the community) have been discussing and experimenting with for years. To everyone who opened an issue, commented on a discussion, joined us for a feedback session, developed a package, or contributed to our code base… however you made your voice heard this year, **thank you** for continuing to care, for continuing to lean in, for continuing to help shape dbt. | ||
|
||
# New Year’s Resolutions | ||
|
||
To start off the new year, we’re focusing on three major areas of development to the dbt framework: | ||
|
||
- **typed macros** - configure Python type annotations for better macro validation | ||
- **catalogs** - first-class support for materializing dbt models into external catalogs, providing a warehouse-agnostic interface for managing data in object storage | ||
- **sample mode** - limit your data to smaller, time-based samples for faster development and CI testing | ||
|
||
## Typed Macros | ||
|
||
In the simplest terms, [macros](https://docs.getdbt.com/docs/build/jinja-macros#macros) are are pieces of code, written in Jinja, that can be reused throughout your project – they are similar to "functions" in other programming languages. | ||
|
||
In practice, you can reference a macro in a model’s SQL (or config block, or hook) to: | ||
|
||
- make your SQL code more DRY by abstracting snippets of SQL into reusable “functions” | ||
- use the results of one query to generate a set of logic | ||
- change the way your project builds based on the current environment | ||
|
||
Or, if you really need to, run arbitrary SQL in your warehouse via `dbt run-operation`. | ||
|
||
Macros can depend on inputs, vars, `env_vars`, or even other macros — *and* there are “special” macros for defining custom generic tests and materializations. | ||
|
||
This immense flexibility is one of the great benefits of macros — you can use them to solve **a lot** of different problems. | ||
|
||
However, on the flip side, this flexibility — where macros can be whatever you want them to be — makes it challenging to validate that your macros are doing what you expect. | ||
|
||
Without a way to define expected types for the inputs and outputs of macros, our adapter maintainers struggle to validate that a built-in macro override will produce the correct output. Furthermore, *any* analytics engineer writing a macro in dbt should be able to validate the expected behavior. | ||
|
||
One of the things we aim to work on in the coming months is an interface for **typed macros.** | ||
|
||
Imagine being able to codify the expected types* for the inputs and outputs for your macros: | ||
|
||
```sql | ||
{% macro cents_to_dollars(column_name: str, scale: int = 2 -> str) %} | ||
({{ column_name }} / 100)::numeric(16, {{ scale }}) | ||
{% endmacro %} | ||
``` | ||
|
||
**Note: These are a subset of Python types (internal to dbt), not the data types within the data warehouse.* | ||
|
||
Then, we could issue warnings at parse time when usage of a macro violates these types. By adding the ability to configure type expectations, user-created macros become more predicable, and [built-in macros become easier to override](https://github.com/dbt-labs/dbt-core/issues/9164). | ||
|
||
Should this type checking be on by default, or something you opt into? Should we also support dbt-specific types, such as `Relation`? Head over to the [github discussion](https://github.com/dbt-labs/dbt-core/discussions/11158) to participate in the conversation! | ||
|
||
## Catalogs | ||
|
||
In `v1.9`, we shipped a set of standard configs to materialize dbt models in Iceberg table format: | ||
|
||
```sql | ||
{{ | ||
config( | ||
materialized = "table", | ||
table_format = "iceberg", | ||
external_volume = "s3_iceberg_snow" | ||
) | ||
}} | ||
|
||
... | ||
``` | ||
|
||
Supporting the Iceberg table format was our first step towards empowering users to adopt Iceberg as a standard storage format for their critical datasets. | ||
|
||
In the coming months, we want to add first-class support for “catalogs” in dbt. “Catalogs,” including Glue or Iceberg REST, operate at a level of abstraction above specific Iceberg tables — and they can provide a warehouse-agnostic interface for managing a large number of datasets in object storage. | ||
|
||
Imagine a new top-level `catalogs.yml` that tells dbt about the catalog integrations you want to write to: | ||
|
||
```yaml | ||
catalogs: | ||
- catalog_name: my_first_catalog | ||
write_integrations: | ||
- integration_name: prod_glue_write_integration | ||
external_volume: my_prod_external_volume | ||
table_format: iceberg | ||
catalog_type: glue | ||
``` | ||
Then, in your model’s configuration, simply specify the `catalog` field: | ||
|
||
```sql | ||
{{ | ||
config( | ||
catalog = "my_first_catalog" | ||
) | ||
}} | ||
... | ||
``` | ||
|
||
These new configurations will enable dbt to template the correct DDL statements for the platform (`CREATE GLUE ICEBERG TABLE` vs. `CREATE ICEBERG TABLE`, etc.). Now, when you run the above model, it will be materialized as an Iceberg table in s3 registered in the AWS Glue catalog. | ||
|
||
We believe the approach of writing datasets to a platform-agnostic storage layer, and registering those datasets with a similarly agnostic *catalog service,* will become important to the technical foundations of the dbt workflow — the [Analytics Development Lifecycle (ADLC)](https://www.getdbt.com/resources/guides/the-analytics-development-lifecycle) — moving forward. Support for materializing Iceberg tables in external catalogs is another step down this path. | ||
|
||
To read more about catalogs and participate in shaping this feature, head over to the [discussion on GitHub](https://github.com/dbt-labs/dbt-core/discussions/11171)! | ||
|
||
## Sample Mode | ||
|
||
We began our [“event_time” journey](https://github.com/dbt-labs/dbt-core/discussions/10672) with microbatch incremental models. Next up is [sample mode](https://github.com/dbt-labs/dbt-core/discussions/10672#:~:text=%F0%9F%8C%80%5BNext%5D%20%E2%80%9CSample%E2%80%9D%20mode%20for%20dev%20%26%20CI%20runs) - we want to support a pattern for speeding up development and CI testing by filtering your dataset to a *time-limited sample.* | ||
|
||
Imagine your model contains a `ref` statements like so: | ||
|
||
```sql | ||
select * from {{ ref('fct_orders') }} | ||
``` | ||
|
||
During standard runs, this compiles to: | ||
|
||
```sql | ||
select * from my_db.my_schema.fct_orders | ||
``` | ||
|
||
But during a *sample* run, dbt could automatically filter down large tables to the last X days of data using the same `event_time` column used for microbatch. The exact syntax here is TBD, but imagine you run something like `dbt run --sample 3`, which then compiles your code to: | ||
|
||
```sql | ||
select * from my_db.my_schema.fct_orders | ||
-- the event_time column in fct_orders is 'order_at' | ||
where order_at > dateadd(-3, day, current_date) | ||
``` | ||
|
||
No more [overriding the `source` or `ref` macro](https://discourse.getdbt.com/t/limiting-dev-runs-with-a-dynamic-date-range/508) to hack together this functionality. | ||
|
||
Built-in support for “sample mode” — filtering to a consistent time-based “slice” across all models — means faster development and faster testing because you’re running on *less* data. This could be configured for a specific dbt invocation, a CI environment, or as a set-it-and-forget-it default for everyone who’s developing on your team’s project. | ||
|
||
We’ll be opening an additional sample-mode-specific GitHub discussion in January, but feel free to queue up your thoughts [here](https://github.com/dbt-labs/dbt-core/discussions/10672) or [here](https://github.com/dbt-labs/dbt-core/issues/8378) in the meantime! | ||
|
||
## and who could forget, paper cuts! | ||
|
||
Those are the big things we plan to tackle in the coming months. As always, we want to tackle some smaller `paper_cut`s as well. Your upvotes and comments help us prioritize these, so please make some noise if there’s something you care a whole lot about. Some that are top of mind for me already: | ||
|
||
- **DRY-er YML**, including: | ||
- [Defining vars configs outside dbt_project.yml](https://github.com/dbt-labs/dbt-core/issues/2955) | ||
- [Add ability to import/include YAML from other files](https://github.com/dbt-labs/dbt-core/issues/9695) | ||
- **Enhancements to model versions and contracts**, including: | ||
- [Automatically create view/clone of latest version](https://github.com/dbt-labs/dbt-core/issues/7442) | ||
- [Support constraints independently from enforcing a full model contract](https://github.com/dbt-labs/dbt-core/issues/10195) | ||
- and [warnings when configs are misspelled](https://github.com/dbt-labs/dbt-core/issues/8942) :) | ||
|
||
# [**Call me**, **beep me** if you wanna reach me](https://www.youtube.com/watch?v=GIgLqN_rAXU) | ||
|
||
If one of *your* New Year’s resolutions is to be more involved in the dbt community, here are some of the many ways you can contribute: | ||
|
||
- open, upvote, and comment on GitHub [issues](https://docs.getdbt.com/community/resources/oss-expectations#issues) | ||
- start a [discussion](https://docs.getdbt.com/community/resources/oss-expectations#discussions) or discourse post when you’ve got a Big Idea | ||
- engage in conversations in the [dbt Community Slack](https://www.getdbt.com/community/join-the-community) | ||
- [contribute code](https://docs.getdbt.com/community/resources/oss-expectations#pull-requests) back to one of our open source repos, for one of our issues tagged `help_wanted` or `good_first_issue`, and our engineering team will work with you to get it over the finish line | ||
- join us on zoom for feedback sessions (which we’ll post about in slack and in the relevant GitHub discussion) | ||
- share your creative solutions in a blog post, at a dbt meetup, or by talking at Coalesce | ||
|
||
Your feedback and thoughts are incredibly valuable to us, so make yourself heard this year. We’re excited to build some awesome new features together. | ||
|
||
[Your loving wife](https://youtu.be/DC9sbZBYzpI?si=xCGWoQDK-w13Fz6U&t=1594), Grace |