Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-86] [Feature] Write All Store Test Failures Into One Table with an Unstructured JSON/VARIANT/SUPER Column #4613

Open
1 task done
codigo-ergo-sum opened this issue Jan 24, 2022 · 18 comments
Labels
dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request

Comments

@codigo-ergo-sum
Copy link

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

Right now store-failures in dbt writes out the rows returned for each failed test into a separate table if --store-failures is specified. This feature is definitely helpful from a developer perspective to be able to immediately see what rows failed in a test by querying the returned table.

It is somewhat less useful in its current incarnation in a production environment where jobs and tests may run many times per day and the operations team would like to have a unified, easily queryable database-persisted record of all tests. Even better perhaps would be a more automated way of pushing test failures back to users.

One way of doing this would be to create one table where all test failures are written to. The big obstacle to this on some platforms, such as BigQuery, has been the lack of a semi-structured native table/column format. Snowflake already has VARIANT and Redshift now has SUPER. This is important because different failed tests on different tables are naturally going to have different schemas that can't easily be reconciled in one table where columns have to be declared ahead of time.

BigQuery has now just added a JSON datatype: https://cloud.google.com/bigquery/docs/reference/standard-sql/json-data

What if we now had a table with a set of columns where all test failures would be stored:

  1. Failed Test Name (this is where being able to explicitly name tests rather than use the default generated name gets really interesting)
  2. Failed Test Model Name
  3. Failed Test Timestamp
  4. Failed Test Owner
  5. Failed Test Description (I'm less sold on this column, maybe it's overkill)
  6. Failed Test Data - Where the entire row which failed gets inserted as JSON for future querying as needed

Maybe even more interesting would be the ability to group different failed tests into different long-term tables for storage. Something like this in dbt_project.yml:

tests:  
  +store_failures: true
  
  my_project:
    folder_1:
      +table_for_failed_tests_for_this_folder: really_sensitive_test_failures_table
    folder_2:
      +table_for_failed_tests_for_this_folder: not_so_sensitive_test_failures_table

Then I could go back and write a report which issues a query to the database like this:

SELECT * FROM not_so_sensitive_test_failures_table WHERE owner = 'data_curation@mycompany.com' AND failed_test_timestamp > (CURRENT_TIMESTAMP - INTERVAL 24 HOURS)

What say ye?

Describe alternatives you've considered

Writing our own macros to parse INFORMATION_SCHEMA to find failed tests in the schema that dbt creates for storing failures and then unifying everything that dbt had put in there into one JSON-style table. But doing it right in dbt would be a lot more elegant.

Who will this benefit?

Anybody trying to store test failures over time and then push out responsibility for correction outside of the data engineering team.

Are you interested in contributing this feature?

Yes - I poked around in the code for 10 minutes in dbt-core and didn't obviously see where to start editing but definitely interested

Anything else?

No response

@codigo-ergo-sum codigo-ergo-sum added enhancement New feature or request triage labels Jan 24, 2022
@github-actions github-actions bot changed the title [Feature] Write All Store Test Failures Into One Table with an Unstructured JSON/VARIANT/SUPER Column [CT-86] [Feature] Write All Store Test Failures Into One Table with an Unstructured JSON/VARIANT/SUPER Column Jan 24, 2022
@codigo-ergo-sum
Copy link
Author

codigo-ergo-sum commented Jan 24, 2022

Another possible related thought on this. What if the table into which test results are stored actually became a node in the DAG? Given that tests are already nodes in the DAG, these tables/models would become nodes downstream of each test. And then it would be possible to have models which in turn are built on top of the these failure storage tables which would be very helpful for the purposes of data quality management.

That idea could be combined with my original proposal in this ticket to allow larger combined tables, with the semi-structured format suggested, to then be used by downstream analytics. If the failure storage tables were treated as similar to snapshots, they also then wouldn't be truncated/dropped-and-recreated on the next dbt run. Which would also make it easier to see, if when dbt is re-run, the same error is repeated again or whether it no longer exists.

@jtcohen6 jtcohen6 added Team:Execution dbt tests Issues related to built-in dbt testing functionality labels Jan 24, 2022
@iknox-fa
Copy link
Contributor

iknox-fa commented Jan 25, 2022

Hi @codigo-ergo-sum!
Thanks for such a well thought out feature request!

A few ideas:

  • A lot of this can be achieved via the cloud metadata API. The main thing missing in this approach would be the actual content of the stored failure rows, so the feature here would be to provide a linking id from the metadata API to the stored test failures. It's also only valid if you use cloud :)

  • Another approach would be to use on-run-end hooks + the results object. This would require some macro writing, but it's fairly straightforward to implement.

  • All of that said, some version of this idea has been coming up quite a bit recently so I'm going to open a new ticket (scoped slightly smaller) to give some careful consideration to how we handle store-failures-- we can certainly make it more useful for this kind of thing.

Would either or those first two ideas be suitable (or close to suitable) for your use case?

@iknox-fa
Copy link
Contributor

Oh, and regarding adding the stored failures as a node in the DAG-- I love the idea! We'll have to see where #4624 takes us

@iknox-fa iknox-fa removed the triage label Jan 25, 2022
@jleacox
Copy link

jleacox commented Jan 28, 2022

FYI we are getting around this by setting up the returned failed test record columns to be all nearly identical, with only minor variations between them (also using BQ and batched runs). That way we can just simply run a SELECT * FROM project.dataset.* I agree would be awesome to have this in the DAG and to be able to combine and store failed records from batch runs incrementally.

@iknox-fa iknox-fa self-assigned this Jan 31, 2022
@codigo-ergo-sum
Copy link
Author

I'm going to try using on-run-end post-hook combined with a macro, and then we also just got JSON support enabled for our BQ environment today (it's a special thing that has to be turned on by request.) So I'll report back :).

@mathcoder3141
Copy link

mathcoder3141 commented Apr 13, 2022

@codigo-ergo-sum how'd you go on your test for this? We're looking to start cleaning up our schemas and having something like this would be incredibly valuable :)

I started to store all our testing failures but we sparsely use it so it's getting hard to traverse the schema it stores it to :(

@codigo-ergo-sum
Copy link
Author

codigo-ergo-sum commented May 21, 2022

So I did this initial take for BigQuery, basing off of code in this issue (which was for Snowflake): #4099:

{% macro centralize_test_failures(results) %}

  {%- set test_results = [] -%}
  {%- for result in results -%}
    {%- if result.node.resource_type == 'test' and result.status == 'fail' and (
          result.node.config.get('store_failures') or flags.STORE_FAILURES
      )
    -%}
      {%- do test_results.append(result) -%}
    {%- endif -%}
  {%- endfor -%}
  
  {%- set central_tbl -%} {{ target.schema }}.test_failure_central {%- endset -%}
  
  {{ log("Centralizing test failures in " + central_tbl, info = true) if execute }}

  create or replace table {{ central_tbl }} (test_name STRING, test_failures_json JSON, test_ts timestamp) as (
  
  {% for result in test_results %}

    select
      '{{ result.node.name }}' as test_name,
      to_json(t) as test_failures_json,
      current_timestamp() as test_ts
      
    from {{ result.node.relation_name }} as t

    {{ "union all" if not loop.last }}
  {% endfor %}
  
  )

{% endmacro %}

A few thoughts:

  1. We have about 3700 tests in our project. The original version of the code for this macro didn't select specifically only tests that fail, so that means a very, very long SQL query unioning all the 3700 tables even though we would only want the ones with rows. BigQuery actually gives an error and won't run the query if I run all the tests in the project, this is the error message: The query is too large. The maximum standard SQL query length is 1024.00K characters, including comments and white space characters. So I changed from result.status != 'skipped' to result.status == 'fail which trimmed things down. In projects that have a lot of tests that fail though, the problem will still arise so longer-term likely need to have some sort of method of batching a certain number of the failed test SQL aggregation together and then appending to it.
  2. Some tests don't seem to actually record the whole contents of the row, which is what we want. In particular, I noticed that accepted_values (part of dbt core) and also dbt_utils.not_accepted_values don't record the whole row. I'll have to follow up on this.
  3. I'm interesting in capturing the "owner" field from the meta object so we could start filtering different tests and exposing them to different groups that way. But I think there were some issues in getting the meta object data from the Results object. I'll have to go back and look more at that.
  4. I'm not crazy about having this be part of an on-run-end hook. We want our full DAG to run and not stop when tests fail. So we do this:
dbt --warn-error --no-partial-parse build --full-refresh --exclude test_type:schema test_type:data && \
dbt test

So we'd have the on-run-end hook run as part of the first step and then the second again. We don't really want that, we just want the tests to be collected and stored as part of the second step. I'd almost rather have this code run as a model and not an on-run-end hook but I'm not sure that I would be able to capture the results correctly then. Also running it as a model would then let us have other models which look at this output data downstream and perhaps do further downstream transformations on this aggregated test data. This would fit nicely in with having the --store-failures tests themselves be nodes in the DAG, it could all be more cleanly orchestrated.

Time permitting, I'll try to poke at this more next week.

@ismailsimsek
Copy link

ismailsimsek commented Aug 18, 2022

+1 this is correct way of storing test results instead current one.

  • an option to limit the number of rows saved could be helpful. like save only 10 rows out of failed test records!
  • also and option to print the compiled test query couldl be helpful too. only when the test fails!

@joellabes
Copy link
Contributor

  • an option to limit the number of rows saved could be helpful. like save only 10 rows out of failed test records!

Have a look at https://docs.getdbt.com/reference/resource-configs/limit - I think you'll find it does what you need

  • also and option to print the compiled test query couldl be helpful too. only when the test fails!

Do you mean printing to the console, or just accessing the compiled code in general? All compiled tests are stored in your target directory where you can look at them - https://docs.getdbt.com/docs/faqs/checking-logs. If you wanted the whole compiled code to be printed to the console, I'd suggest opening a separate issue for that.

@iknox-fa iknox-fa removed their assignment Dec 4, 2022
@bashyroger
Copy link

Also see my comment on this closed ticket that is definitely related: #2593 (comment)

@unstoppable-allensun
Copy link

@codigo-ergo-sum Have you made any more progress with this using on-run-end hook?

@ismailsimsek
Copy link

+1 is there any progress on this?

@martanthony
Copy link

Given the planned introduction of unit testing in 1.8, it'd be great to see this put on the roadmap for 1.9!

Copy link
Contributor

github-actions bot commented Aug 6, 2024

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Aug 6, 2024
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 13, 2024
@martanthony
Copy link

Still interested in this.

@joellabes joellabes reopened this Aug 13, 2024
@paulfry-payroc
Copy link

Still interested in this.

same

@github-actions github-actions bot removed the stale Issues that have gone stale label Aug 14, 2024
@ismailmuller
Copy link

same

Still interested in this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request
Projects
None yet
Development

No branches or pull requests