-
Notifications
You must be signed in to change notification settings - Fork 3.1k
feat(ingest/sql): column logic + join extraction #13426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
View the full list of 1 ❄️ flaky tests
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
beyond self-join, what's the coverage for other join variants such as inner/outer/cross joins? they could be explicitely tested
should user-facing doc be updated with the new features?
@sgomezvillamor this should work for all join types, including ones that use subqueries or CTEs There's no user-facing portion of this yet, so I don't want to add any user-facing docs around this until it shows up in the product somewhere |
@@ -1383,7 +1390,16 @@ def _gen_lineage_for_downstream( | |||
if self.can_generate_query(query_id) | |||
else None | |||
), | |||
confidenceScore=queries_map[query_id].confidence_score, | |||
confidenceScore=query.confidence_score, | |||
transformOperation=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we format as a string and not convert it into a JSON object to make it parsable by machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is shown in the UI as-is
@@ -26,6 +26,7 @@ | |||
"downstreams": [ | |||
"urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:mssql,demodata.foo.age_dist,PROD),age)" | |||
], | |||
"transformOperation": "COPY: [persons].[age] AS [age]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct with the square brackets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - mssql uses square brackets for quotes
Adds a couple new features to our SQL parser
is_direct_copy
bool.on
clause used and (2) the full list of tables/columns involved in the join. The latter is more reliable when there's CTEs.The column logic extraction is also now integrated into the
SqlParsingAggregator
, so our main warehouse integrations should benefit from it. The join logic extraction is not yet used anywhere.All tests are updated, and I also added a couple new ones for tricky edge cases (e.g. self-joins).