-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/generate_model_import_ctes #74
Conversation
This is an awesome idea! I'm going to document my findings in this PR as I'm testing. I wanted to test two things:
I ran into my first issue, which is that the macro doesn't get the correct references and output the right import CTEs when the explicit reference has the database name attached. Here is my model code: WITH
-- Only orders which tie back to a real customer record
ro as (
select
o.o_orderkey,
c.c_custkey
from raw_tpch.tpch_sf1.orders o
inner join raw_tpch.tpch_sf1.customer c
on o.o_custkey = c.c_custkey
),
rl as (
SELECT c.c_custkey
,c.c_name
,sum(revenue) AS revenue_lost
,c_acctbal
,n.n_name
,c_address
,c_phone
,c_comment
FROM ro left join (SELECT l.l_orderkey AS ORDER_ID, o.o_custkey AS CUSTOMER_ID, SUM(l.l_extendedprice * (1 - l.l_discount)) AS REVENUE
FROM raw_tpch.tpch_sf1.lineitem l LEFT JOIN raw_tpch.tpch_sf1.orders o ON l.l_orderkey = o.o_orderkey
WHERE o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1994-04-01' AND l.l_returnflag = 'R' GROUP BY o.o_custkey, l.l_orderkey
) lo on lo.ORDER_ID = ro.o_orderkey and lo.CUSTOMER_ID = ro.c_custkey
LEFT JOIN raw_tpch.tpch_sf1.customer c ON c.c_custkey = lo.CUSTOMER_ID LEFT JOIN raw_tpch.tpch_sf1.nation n ON c.c_nationkey = n.n_nationkey
WHERE lo.CUSTOMER_ID is not null GROUP BY c.c_custkey,c.c_name,c.c_acctbal,c.c_phone,n.n_name,c.c_address,c.c_comment ORDER BY revenue_lost DESC
)
select * from rl and here is the compiled code from the macro: with tpch_sf1 as (
select * from raw_tpch.tpch_sf1
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
)
WITH
-- Only orders which tie back to a real customer record
ro as (
select
o.o_orderkey,
c.c_custkey
from tpch_sf1.orders o
inner join tpch_sf1.customer c
on o.o_custkey = c.c_custkey
),
rl as (
SELECT c.c_custkey
,c.c_name
,sum(revenue) AS revenue_lost
,c_acctbal
,n.n_name
,c_address
,c_phone
,c_comment
FROM ro left join (SELECT l.l_orderkey AS ORDER_ID, o.o_custkey AS CUSTOMER_ID, SUM(l.l_extendedprice * (1 - l.l_discount)) AS REVENUE
FROM tpch_sf1.lineitem l LEFT JOIN tpch_sf1.orders o ON l.l_orderkey = o.o_orderkey
WHERE o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1994-04-01' AND l.l_returnflag = 'R' GROUP BY o.o_custkey, l.l_orderkey
) lo on lo.ORDER_ID = ro.o_orderkey and lo.CUSTOMER_ID = ro.c_custkey
LEFT JOIN tpch_sf1.customer c ON c.c_custkey = lo.CUSTOMER_ID LEFT JOIN tpch_sf1.nation n ON c.c_nationkey = n.n_nationkey
WHERE lo.CUSTOMER_ID is not null GROUP BY c.c_custkey,c.c_name,c.c_acctbal,c.c_phone,n.n_name,c.c_address,c.c_comment ORDER BY revenue_lost DESC
)
select * from rl Update
Output: with raw_tpch_tpch_sf1_customer as (
select * from raw_tpch.tpch_sf1.customer
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
)
raw_tpch_tpch_sf1_lineitem as (
select * from raw_tpch.tpch_sf1.lineitem
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
)
raw_tpch_tpch_sf1_nation as (
select * from raw_tpch.tpch_sf1.nation
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
)
raw_tpch_tpch_sf1_orders as (
select * from raw_tpch.tpch_sf1.orders
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
)
tpch_sf1 as (
select * from raw_tpch.tpch_sf1
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
)
WITH
-- Only orders which tie back to a real customer record
ro as (
select
o.o_orderkey,
c.c_custkey
from raw_tpch_tpch_sf1_orders o
inner join raw_tpch_tpch_sf1_customer c
on o.o_custkey = c.c_custkey
),
rl as (
SELECT c.c_custkey
,c.c_name
,sum(revenue) AS revenue_lost
,c_acctbal
,n.n_name
,c_address
,c_phone
,c_comment
FROM ro left join (SELECT l.l_orderkey AS ORDER_ID, o.o_custkey AS CUSTOMER_ID, SUM(l.l_extendedprice * (1 - l.l_discount)) AS REVENUE
FROM raw_tpch_tpch_sf1_lineitem l LEFT JOIN raw_tpch_tpch_sf1_orders o ON l.l_orderkey = o.o_orderkey
WHERE o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1994-04-01' AND l.l_returnflag = 'R' GROUP BY o.o_custkey, l.l_orderkey
) lo on lo.ORDER_ID = ro.o_orderkey and lo.CUSTOMER_ID = ro.c_custkey
LEFT JOIN raw_tpch_tpch_sf1_customer c ON c.c_custkey = lo.CUSTOMER_ID LEFT JOIN raw_tpch_tpch_sf1_nation n ON c.c_nationkey = n.n_nationkey
WHERE lo.CUSTOMER_ID is not null GROUP BY c.c_custkey,c.c_name,c.c_acctbal,c.c_phone,n.n_name,c.c_address,c.c_comment ORDER BY revenue_lost DESC
)
select * from rl |
@christineberger Can I ask what you think should be done in this situation? Can we make the assumption that |
@graciegoheen That's a really good question that I'm not sure there's a straight answer to, but we could always just make a v1 decision - I think in order to avoid conflicts, maybe we just treat them as separate import CTEs and the user will need to resolve that themselves. (I think) we can assume that if there's an explicit reference without a database name / schema name, it's referring to the target location. If not, then we know the explicit location - so we could name a references like this:
I haven't thought this all the way through, but what I'm trying to say is yes to separate import CTES! |
Also, I'm just documenting what is working and what's not working, and those can either be limitations of the v1 or fixed in this PR - There's absolutely no issue in getting it working "for the most part" as long as we have the limitations documented! |
{%- set from_regexes = { | ||
'from_ref':'(?i)(from|join)\s+({{\s*ref\s*\()([^)]+)(\)\s*}})', | ||
'from_source':'(?i)(from|join)\s+({{\s*source\s*\([^)]+,)([^)]+)(\)\s*}})', | ||
'from_table_1':'(?i)(from|join)\s+([\[`\"]?\w+[\]`\"]?)\.([\[`\"]?\w+[\]`\"]?)', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may or may not be an issue depending on how this is restructured to capture 3-part vs. 2-part vs 1-part explicit references, but in the output while accounting for a 3-part reference, I'm seeing an additional cte for raw_tpch.sf_1
with no table. This is because one of the regex matches are capturing the pattern for a 2-part reference. To fix this temporarily I added a $
to the end of from_table_1
in order to tell it "just match this and nothing after it", but there are probably many ways you could be more concise or offer more flexibility without limiting the regex match in that way. Something to consider as part of the explicit references piece!
Output:
with
raw_tpch_tpch_sf1_customer as (
select * from raw_tpch.tpch_sf1.customer
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
),
raw_tpch_tpch_sf1_lineitem as (
select * from raw_tpch.tpch_sf1.lineitem
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
),
raw_tpch_tpch_sf1_nation as (
select * from raw_tpch.tpch_sf1.nation
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
),
raw_tpch_tpch_sf1_orders as (
select * from raw_tpch.tpch_sf1.orders
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
),
-- Only orders which tie back to a real customer record
ro as (
select
o.o_orderkey,
c.c_custkey
from raw_tpch_tpch_sf1_orders o
inner join raw_tpch_tpch_sf1_customer c
on o.o_custkey = c.c_custkey
),
rl as (
SELECT c.c_custkey
,c.c_name
,sum(revenue) AS revenue_lost
,c_acctbal
,n.n_name
,c_address
,c_phone
,c_comment
FROM ro left join (SELECT l.l_orderkey AS ORDER_ID, o.o_custkey AS CUSTOMER_ID, SUM(l.l_extendedprice * (1 - l.l_discount)) AS REVENUE
FROM raw_tpch_tpch_sf1_lineitem l LEFT JOIN raw_tpch_tpch_sf1_orders o ON l.l_orderkey = o.o_orderkey
WHERE o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1994-04-01' AND l.l_returnflag = 'R' GROUP BY o.o_custkey, l.l_orderkey
) lo on lo.ORDER_ID = ro.o_orderkey and lo.CUSTOMER_ID = ro.c_custkey
LEFT JOIN raw_tpch_tpch_sf1_customer c ON c.c_custkey = lo.CUSTOMER_ID LEFT JOIN raw_tpch_tpch_sf1_nation n ON c.c_nationkey = n.n_nationkey
WHERE lo.CUSTOMER_ID is not null GROUP BY c.c_custkey,c.c_name,c.c_acctbal,c.c_phone,n.n_name,c.c_address,c.c_comment ORDER BY revenue_lost DESC
)
select * from rl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if in another version we can work on pulling out subqueries into CTEs 🤔 If they have any joins
anywhere after "from", we can assume that they have aliased the subquery with or without using AS
. We can take the alias and create a CTE for it using the subquery text. If there's no joins within the entire from
statement, then we could just temporarily name them. I imagine this will warrant an entire macro to itself!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a re-work on the replacing logic to be more precise - using regex sub instead of replace. Let me know if you still have this issue!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also love the subquery idea - that would be a really cool addition!
@graciegoheen I don't think this has been mentioned yet, but I tested it out with the formatting of joins below, and it only caught the first table.
Results:
|
Co-authored-by: Benoit Perigaud <8754100+b-per@users.noreply.github.com>
@jeremyholtzman TBH I didn't even know you could do joins like that - need to think of a way to actually capture that with regex... |
…labs/dbt-codegen into feature/generate_model_ctes
@graciegoheen an idea for you to handle cases like the one that @jeremyholtzman surfaced:
Here's a non-exhaustive list of parsers (your mileage will vary): ExampleSQLGlot is pretty cool -- it has a parser, transpiler, and optimizer, and (at the time of this writing) it has some level of dialect support for bigquery, clickhouse, duckdb, hive, mysql, oracle, postgres, presto, snowflake, spark, sqlite, starrocks, tableau (?!), and trino. (Kinda off-topic, but transpiling means it can read from one dialect and write out to another.) InstallIf you are using python3 -m venv venv
source venv/bin/activate
python3 -m pip install sqlglot RunLaunch Python in interactive mode: python Create a SQL query using the Postgres dialect: from sqlglot import parse_one, exp
# find all tables (x, y, z)
for table in parse_one("SELECT * FROM x JOIN y JOIN z", read="postgres").find_all(exp.Table):
print(table.name) This and other examples in their documentation. |
Crazy idea # 1
Crazy idea # 2(Which is not actually related to this pull request!)
This dialect could replace database-specific keywords, functions, etc with their cross-database macro equivalents. Idea # 2 would be most relevant for code that needs/wants to be cross-database compatible, like packages on the dbt Hub. |
I personally feel that we might tweak the regexes to make them work in most of the cases (even if they won't work 100% of the time) but avoid using third-party libraries. (I'd be keen to have as many tests as possible on the regexes though because backward compatibility can be difficult to identify with regexes whenever we do a change)
|
@graciegoheen there are commas missing after the import CTEs in my output: Original Queryselect customer_id,
customer_name,
coalesce(purchase_total,0) as purchase_total,
coalesce(lr.revenue_lost, 0) as return_total,
coalesce(purchase_total - return_total, 0) as lifetime_value,
(coalesce(return_total, 0)/coalesce(purchase_total,0))*100 as return_pct,
case when coalesce(purchase_total, 0) = 0 then 'red' when return_pct <= 25 then 'green' when return_pct <= 50 then 'yellow' when return_pct <= 75 then 'orange' when return_pct <= 100 then 'red' end as customer_status
from
(select c.c_custkey as customer_id,
c.c_name as customer_name,
round(sum(customer_cost), 2) as purchase_total
from {{ source('TPCH_SF1', 'CUSTOMER') }} c
left join {{ ref('order_line_items') }} oli
on c.c_custkey = oli.customer_id
where item_status != 'R'
group by c_custkey, c_name
order by c_custkey) co
left join
{{ ref('lost_revenue') }} lr
on co.customer_id = lr.c_custkey Outputwith lost_revenue as (
select * from {{ ref('lost_revenue') }}
)
order_line_items as (
select * from {{ ref('order_line_items') }}
)
source_customer as (
select * from {{ source('TPCH_SF1', 'CUSTOMER') }}
-- CAUTION: It's best practice to create staging layer for raw sources
)
select customer_id,
customer_name,
coalesce(purchase_total,0) as purchase_total,
coalesce(lr.revenue_lost, 0) as return_total,
coalesce(purchase_total - return_total, 0) as lifetime_value,
(coalesce(return_total, 0)/coalesce(purchase_total,0))*100 as return_pct,
case when coalesce(purchase_total, 0) = 0 then 'red' when return_pct <= 25 then 'green' when return_pct <= 50 then 'yellow' when return_pct <= 75 then 'orange' when return_pct <= 100 then 'red' end as customer_status
from
(select c.c_custkey as customer_id,
c.c_name as customer_name,
round(sum(customer_cost), 2) as purchase_total
from source_CUSTOMER c
left join order_line_items oli
on c.c_custkey = oli.customer_id
where item_status != 'R'
group by c_custkey, c_name
order by c_custkey) co
left join lost_revenue lr
on co.customer_id = lr.c_custkey |
@graciegoheen -- I am also missing commas! I ran this against my coalesce demo project from last year Input-- if it's not abundantly clear this is a completely fabricated query on data that exists nowhere
select
c.id as claim_id,
p.name_f || ' ' || p.name_l as pat_name,
p.dob,
d.name as doctor_name,
d.npi,
(select max(specialty) from development.dbt_dconnors_ehr_data.doc_specialties ds where ds.doc_id = d.id) as spec_name,
prim_diag.icd_10_code as primary_diag,
prim_diag.icd_10_code_descrip as primary_diag_desc,
sec_diag.icd_10_code as sec_diag,
sec_diag.icd_10_code_descrip as sec_diag_desc,
c.ClaimNumber,
ca.total_claim_amnt,
ca.paid_amnt
from
development.dbt_dconnors_ehr_data.patients p
join development.dbt_dconnors_billing_data.claims c
on p.id = c.pat_id
left join (
select
cl.claim_id,
sum(chrgamnt) total_claim_amnt,
sum(paid_amnt) paid_amnt
from development.dbt_dconnors_billing_data.claim_line cl
where status in ('billed', 'adjudicated', 'closed')
group by 1
) ca
on c.id = ca.claim_id
left join (
select
claim_id,
icd10 as icd_10_code,
icd10desc as icd_10_code_descrip,
split_part(icd10, '.', 1) as parent_icd10
from development.dbt_dconnors_billing_data.claim_diagnosis
where pos = 1
) as prim_diag
on c.id = prim_diag.claim_id
left join (
select
claim_id,
icd10 as icd_10_code,
icd10desc as icd_10_code_descrip,
split_part(icd10, '.', 1) as parent_icd10
from development.dbt_dconnors_billing_data.claim_diagnosis
where pos = 2
) as sec_diag
on c.id = sec_diag.claim_id
left join development.dbt_dconnors_ehr_data.doctors d
on c.doc_id = d.id
where c.test = false
and p.email not like '%@test-patient.com'
and c.bill_attmps > 0 Outputwith development_dbt_dconnors_billing_data_claim_diagnosis as (
select * from development.dbt_dconnors_billing_data.claim_diagnosis
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
)
development_dbt_dconnors_billing_data_claim_line as (
select * from development.dbt_dconnors_billing_data.claim_line
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
)
development_dbt_dconnors_billing_data_claims as (
select * from development.dbt_dconnors_billing_data.claims
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
)
development_dbt_dconnors_ehr_data_doc_specialties as (
select * from development.dbt_dconnors_ehr_data.doc_specialties
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
)
development_dbt_dconnors_ehr_data_doctors as (
select * from development.dbt_dconnors_ehr_data.doctors
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
)
development_dbt_dconnors_ehr_data_patients as (
select * from development.dbt_dconnors_ehr_data.patients
-- CAUTION: It's best practice to use the ref or source function instead of a direct reference
)
-- if it's not abundantly clear this is a completely fabricated query on data that exists nowhere
select
c.id as claim_id,
p.name_f || ' ' || p.name_l as pat_name,
p.dob,
d.name as doctor_name,
d.npi,
(select max(specialty) from development_dbt_dconnors_ehr_data_doc_specialties ds where ds.doc_id = d.id) as spec_name,
prim_diag.icd_10_code as primary_diag,
prim_diag.icd_10_code_descrip as primary_diag_desc,
sec_diag.icd_10_code as sec_diag,
sec_diag.icd_10_code_descrip as sec_diag_desc,
c.ClaimNumber,
ca.total_claim_amnt,
ca.paid_amnt
from development_dbt_dconnors_ehr_data_patients p
join development_dbt_dconnors_billing_data_claims c
on p.id = c.pat_id
left join (
select
cl.claim_id,
sum(chrgamnt) total_claim_amnt,
sum(paid_amnt) paid_amnt
from development_dbt_dconnors_billing_data_claim_line cl
where status in ('billed', 'adjudicated', 'closed')
group by 1
) ca
on c.id = ca.claim_id
left join (
select
claim_id,
icd10 as icd_10_code,
icd10desc as icd_10_code_descrip,
split_part(icd10, '.', 1) as parent_icd10
from development_dbt_dconnors_billing_data_claim_diagnosis
where pos = 1
) as prim_diag
on c.id = prim_diag.claim_id
left join (
select
claim_id,
icd10 as icd_10_code,
icd10desc as icd_10_code_descrip,
split_part(icd10, '.', 1) as parent_icd10
from development_dbt_dconnors_billing_data_claim_diagnosis
where pos = 2
) as sec_diag
on c.id = sec_diag.claim_id
left join development_dbt_dconnors_ehr_data_doctors d
on c.doc_id = d.id
where c.test = false
and p.email not like '%@test-patient.com'
and c.bill_attmps > 0 |
Leaving some additional comments here, but I'm not exactly sure if you wanted to account for these scenarios or not! I personally don't think they are worth holding up this PR for, so happy to approve after the missing commas after multiple ctes issue is fixed. This macro can be iterated on for these if we think they're worth it, and people should always be checking what's generated and tweak the output if they need to anyway. Scenario 1 | Joins by CommaInput{{
config(
materialized='table'
)
}}
SELECT MIN(ps_supplycost) AS min_supply_cost, PS_PARTKEY AS PARTKEY FROM
{{ source('TPCH_SF1', 'PARTSUPP') }},
{{ source('TPCH_SF1', 'SUPPLIER') }},
{{ source('TPCH_SF1', 'NATION') }},
{{ source('TPCH_SF1', 'REGION') }}
WHERE s_suppkey = ps_suppkey
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'EUROPE'
GROUP BY PS_PARTKEY Output{{
config(
materialized='table'
)
}}
with source_partsupp as (
select * from {{ source('TPCH_SF1', 'PARTSUPP') }}
)
SELECT MIN(ps_supplycost) AS min_supply_cost, PS_PARTKEY AS PARTKEY FROM source_PARTSUPP,
{{ source('TPCH_SF1', 'SUPPLIER') }},
{{ source('TPCH_SF1', 'NATION') }},
{{ source('TPCH_SF1', 'REGION') }}
WHERE s_suppkey = ps_suppkey
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'EUROPE'
GROUP BY PS_PARTKEY Scenario 2 | Non-explicit table referencesI do concede this is a bad practice and probably an edge case since it would require the person's target to have all objects that they could be referring to, but if someone is migrating a script this very well could be the case. Inputwith
select
e.p_name AS Part_Name,
e.p_retailprice AS RetailPrice,
e.s_name AS Supplier_Name,
e.p_mfgr AS Part_Manufacturer,
e.s_address AS SuppAddr,
e.s_phone AS Supp_Phone,
ps.PS_AVAILQTY AS Num_Available
from {{ ref('EUR_LOWCOST_BRASS_SUPPLIERS') }} e
LEFT JOIN {{ source('TPCH_SF1', 'SUPPLIER') }} s on e.S_NAME = s.S_NAME
LEFT JOIN
partsupp ps
on e.P_PARTKEY = ps.PS_PARTKEY
and s.S_SUPPKEY = ps.PS_SUPPKEY
where n_name = 'UNITED KINGDOM' Outputwith eur_lowcost_brass_suppliers as (
select * from {{ ref('EUR_LOWCOST_BRASS_SUPPLIERS') }}
),
source_supplier as (
select * from {{ source('TPCH_SF1', 'SUPPLIER') }}
-- CAUTION: It's best practice to create staging layer for raw sources
),
select
e.p_name AS Part_Name,
e.p_retailprice AS RetailPrice,
e.s_name AS Supplier_Name,
e.p_mfgr AS Part_Manufacturer,
e.s_address AS SuppAddr,
e.s_phone AS Supp_Phone,
ps.PS_AVAILQTY AS Num_Available
from EUR_LOWCOST_BRASS_SUPPLIERS e
LEFT JOIN source_SUPPLIER s on e.S_NAME = s.S_NAME
LEFT JOIN
partsupp ps
on e.P_PARTKEY = ps.PS_PARTKEY
and s.S_SUPPKEY = ps.PS_SUPPKEY
where n_name = 'UNITED KINGDOM' |
@christineberger and @dave-connors-3 the missing comma issue should now be resolved! Thank you Christine for identifying those outstanding issues. I think I'm okay with saying "Joins by Comma" is out of scope for this PR (but curious if other people disagree). "Non-explicit table references" would be tricky to capture, because the syntax for that is identical to selecting from a cte. How often does that come up? We could also call these out as two known issues (either in the README or actually opening up github issues) that won't currently be covered by this macro. That way, at least no one is surprised that it didn't work for their use case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this out on BigQuery and it worked! Only change I think should be made is to the README, and be explicit in the instructions to call codegen.generate_model_import_ctes to prevent errors and to match other documentation.
Not worth changing but a heads up: the only other thing I noticed was that if you have some import CTEs but not all, it will create semi-duplicate CTEs at the top (I'd already created the dependencies import CTE)
BUT not something I think you should change or worry about. The use case is clearly for the code that doesn't have any import CTEs rather than some. And extra CTEs won't hurt even if they do already have some import CTEs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a few queries and it worked fine. Awesome work Grace, this will be super valuable for refactoring code!
My only comment would be to slightly change the code to generate SQL that follows our new conventions for CTEs, i.e.
cte1 as (
...
),
cte2 as (
...
),
with 1 blank row between CTEs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🦞
@graciegoheen Are you ready for this to be merged and released? I haven't done any sort of review of this, but I did have a peek at the readme. My one piece of feedback is that I assumed this was taking a string of code and reformatting it; turns out it takes a model and reformats raw_sql or raw_code. Curious as to why that was the approach (as opposed to audit_helper's pattern where you create your sql strings and pass them into the macro), but not against it at all! However, I think it would be useful to clarify that in your README - as a foolish optimist who has never worked on proserv, I assume that models are already clean and follow dbt best practices, so I couldn't work out what import CTEs you were going to create. Maybe you say "make a model with the janky code, and then point this macro at it" or something? Or consider doing it as
|
@joellabes Yes, I'm feeling ready! Great feedback. Since all of the other macros in this package take either a source or model name as the input, I was following the same pattern for this one. If this belongs in
So, I was imagining the workflow where you've created your model in dbt, copy/pasted in your old code, then want to generate the same sql with references to tables pulled out into import CTEs. To me, opening up a statement tab, calling this macro, compiling the code, and copy/pasting in the compiled output feels more intuitive (in the context of the other |
Yeah I still thought this would happen in a scratchpad - I would have imagined this workflow:
By doing that, you're not pasting the same file over itself (in your current approach, you make the model containing the legacy code, then paste the compiled code over the top of the same file). You're going to use it though so if you're happy with the current workflow then I defer to you 🙇 |
@joellabes I'd like to stick with the current workflow - but I'll update the README to add some additional clarity! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rumble 🎉
Feature/generate_model_import_ctes
This is a:
main
dev/
branchdev/
branchDescription & motivation
I created a macro to generate the SQL for a given model with all references pulled up into import CTEs, which you can then paste back into the model.
Checklist