Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt docs browser tab crashes with ~100K source tables #3026

Closed
1 of 5 tasks
panasenco opened this issue Jan 23, 2021 · 5 comments
Closed
1 of 5 tasks

dbt docs browser tab crashes with ~100K source tables #3026

panasenco opened this issue Jan 23, 2021 · 5 comments
Labels
artifacts bug Something isn't working dbt-docs [dbt feature] documentation site, powered by metadata artifacts performance stale Issues that have gone stale

Comments

@panasenco
Copy link
Contributor

panasenco commented Jan 23, 2021

Describe the bug

dbt docs browser tab crashes when trying to load all ~100K source tables in our data warehouse.

Steps To Reproduce

We have 200 source applications and some applications have thousands of tables. See model yml file inside big.zip, which approximately replicates the size of our data warehouse.

After adding the above yml file to models folder, run dbt docs generate and then dbt docs serve.

Expected behavior

Expected to be able to navigate the docs site.

Screenshots and log output

Firefox:
image

Chrome:
image

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: SQL Server)

The output of dbt --version:

installed version: 0.19.0-rc2
   latest version: 0.18.1

Your version of dbt is ahead of the latest release!

Plugins:
  - bigquery: 0.19.0rc2
  - postgres: 0.19.0rc2
  - redshift: 0.19.0rc2
  - snowflake: 0.19.0rc2
  - sqlserver: 0.19.0rc2

The operating system you're using:
Windows 10

The output of python --version:
Python 3.8.3

Additional context

40,000 tables eventually loads...

How difficult would it be to add a docs generation mode where each table gets its own html page and we avoid loading the entire manifest.json into browser memory?

@panasenco panasenco added bug Something isn't working triage labels Jan 23, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 23, 2021

Hey @panasenco, thanks for raising this. As you can see in dbt-labs/dbt-docs#170 as well, the docs site is due for a refactor, now that there exist projects orders-of-magnitude bigger than it was initially designed to support a few years ago.

I think the prescription here may be more severe than the one in that issue: If a manifest.json file is too big to load into the browser, I'm not sure there's anything we can do by way of cleverer JavaScript alone. For that reason, I'm keeping the issue in this repo for now. We may need to revisit a conversation we've considered previously: splitting JSON artifacts into more files, using a different file format, ...

How large is your manifest.json today with ~100k source tables? With ~40k tables?

@jtcohen6 jtcohen6 added dbt-docs [dbt feature] documentation site, powered by metadata artifacts artifacts performance and removed triage labels Jan 23, 2021
@panasenco
Copy link
Contributor Author

Thanks @jtcohen6!

The file manifest.json is 89MB with ~100k tables and 38MB with ~40k tables, so approximately 1KB/table.

I'm considering taking a stab at this (currently browsing through the source code). Thinking back to the conversations you had, could you share what refactoring approach you liked the most? What's the most ideal way forward?

@panasenco
Copy link
Contributor Author

panasenco commented Jan 23, 2021

Looking at the source code, the quickest hack would be to:

  • Generate multiple manifest.json files by redefining the write() method of WritableManifest
  • Somehow rewriting loadProject to only load one manifest at a time and then calling loadProject on every table or database change.

However, this hack wouldn't be pretty and would probably break countless things. I'll just wait for you guys to implement it properly.

In the meantime, I'll write a script to generate dbt doc sites for each of my source databases separately as a workaround. Then people can refer to these individual static sites when discovering data, and use a separate site for actual models.

@panasenco
Copy link
Contributor Author

For anyone looking for a workaround, I just wrote the first version of my Python script that generates an index of multiple dbt documentation sites:

https://github.com/panasenco/dbt-docs-index

You can now split your dbt documentation into multiple separate sites and use this script to build an index of them.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Oct 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
artifacts bug Something isn't working dbt-docs [dbt feature] documentation site, powered by metadata artifacts performance stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

2 participants