-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make MAXIMUM_SEED_SIZE_MIB configurable #11177
base: main
Are you sure you want to change the base?
Conversation
Co-authored by: Noah Holm <32292420+noppaz@users.noreply.github.com> Co-authored by: Jeremy Cohen <jeremy@dbtlabs.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #11177 +/- ##
==========================================
- Coverage 88.93% 88.88% -0.05%
==========================================
Files 186 186
Lines 24054 24075 +21
==========================================
+ Hits 21392 21400 +8
- Misses 2662 2675 +13
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎩 Given a my_small_seed.csv
(<1 MB) and my_large_seed.csv
(~3 MB)
Using dbt-core@main
:
% dbt parse && mv target/manifest.json state && dbt ls -s state:modified --state state
16:29:42 Running with dbt=1.10.0-a1
16:29:42 Registered adapter: duckdb=1.9.1
16:29:42 Performance info: /Users/jerco/dev/scratch/testy/target/perf_info.json
16:29:43 Running with dbt=1.10.0-a1
16:29:43 Registered adapter: duckdb=1.9.1
16:29:43 Found 1 operation, 1 model, 2 seeds, 424 macros
16:29:43 Found a seed (testy.my_large_seed) >1MB in size at the same path, dbt cannot tell if it has changed: assuming they are the same
16:29:43 The selection criterion 'state:modified' does not match any enabled nodes
16:29:43 No nodes selected!
Switching to dbt-core@jerco/redo-pr-7125
(without recreating state/manifest.json
):
% dbt ls -s state:modified --state state
16:30:03 Running with dbt=1.10.0-a1
16:30:03 Registered adapter: duckdb=1.9.1
16:30:03 Found 1 operation, 1 model, 2 seeds, 424 macros
16:30:03 Found a seed (testy.my_large_seed) >1MiB in size at the same path, dbt cannot tell if it has changed: assuming they are the same
16:30:03 The selection criterion 'state:modified' does not match any enabled nodes
16:30:03 No nodes selected!
This proves that the checksum
of my_small_seed
has not changed.
Now, let's set the config and redo:
% export DBT_MAXIMUM_SEED_SIZE_MIB=10
% dbt parse && mv target/manifest.json state && dbt ls -s state:modified --state state
16:30:45 Running with dbt=1.10.0-a1
16:30:45 Registered adapter: duckdb=1.9.1
16:30:45 Performance info: /Users/jerco/dev/scratch/testy/target/perf_info.json
16:30:46 Running with dbt=1.10.0-a1
16:30:46 Registered adapter: duckdb=1.9.1
16:30:46 Found 1 operation, 1 model, 2 seeds, 424 macros
16:30:46 The selection criterion 'state:modified' does not match any enabled nodes
16:30:46 No nodes selected!
Manually edit one row in the large seed:
% dbt ls -s state:modified --state state
16:36:58 Running with dbt=1.10.0-a1
16:36:58 Registered adapter: duckdb=1.9.1
16:36:58 Found 1 operation, 1 model, 2 seeds, 424 macros
testy.my_large_seed
# We don't want to calculate a hash of this file. Use the path. | ||
source_file = SourceFile.big_seed(match) | ||
else: | ||
file_contents = load_file_contents(match.absolute_path, strip=True) | ||
checksum = FileHash.from_contents(file_contents) | ||
checksum = FileHash.from_path(match.absolute_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have confirmed that this is not a "breaking" change, insofar as the same seed produces the same checksum before and after this change.
@@ -60,6 +61,27 @@ def from_contents(cls, contents: str, name="sha256") -> "FileHash": | |||
checksum = hashlib.new(name, data).hexdigest() | |||
return cls(name=name, checksum=checksum) | |||
|
|||
@classmethod | |||
def from_path(cls, path: str, name="sha256") -> "FileHash": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-breaking change in the dbt/artifacts/
directory, so I will add the artifact_minor_upgrade
label to this PR
resolves #7117
resolves #7124
Reapply changes from #7125. (This proved easier than rebasing the commits directly.)
Problem
We apply an arbitrary limit of 1 MiB to seeds (CSVs), for the specific purpose of hashing contents and comparing those hashed contents during
state:modified
comparison. (That's a mebibyte, not a megabyte, for those who care to distinguish between the two).That is, dbt doesn't raise an error if it detects a seed larger than
MAXIMUM_SEED_SIZE_MIB
, but certain features become unavailable and dbt does not make any guarantees about acceptable performance.We could adjust this for inflation (1 MiB in 2020 is worth ~1.2 MiB today), but this PR takes the more forward-looking approach of making the value configurable by end users. Users can instruct dbt to compare the contents of larger seeds, so long as they're willing to "pay the price" of hashing the contents of large seeds.
We will need to update docs: https://docs.getdbt.com/reference/node-selection/state-comparison-caveats
Solution
From #7125:
Checklist