-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-46572: [Python] expose filter option to python for join #46566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
cc @richardliaw since I discussed this with Richard and he suggested me to give this a try. and it may be helpful for ray project too. Thanks! |
Hi @xingyu-long, thank you for opening a PR! |
@AlenkaF Thanks for taking a look! I just opened the issue to track this (#46572). for the failing tests, probably related to corresponding python callers / function definition. but could you take a look first? since the main part is to enable join option in _acero.pyx, I'd like to get some feedback from the community for this part and see if it makes sense. Thanks! |
Thanks for opening this issue! I've marked the PR as a draft and updated the title. Regarding the call in Also, please make sure to connect it with CC: @raulcd for any additional thoughts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. In principle looks good to me. I would just change it to be the last argument of the function signature. As we are not using keyword only arguments these change is making the signature of the function change those provoking an unnecessary breaking change for users.
Thanks for the suggestion @AlenkaF @raulcd I just updated the code. btw, I observed two things while I am writing tests for this matter
is this intended behavior? for example, let's assume that we have two tables which have some common fields ( but if we can apply the filter on both tables first before we joining two tables, it would be more efficient? that's why I'd like to confirm what's the expected behavior for this filter in c++ implementation.
In [54]: import pandas as pd
...: import pyarrow as pa
...: df1 = pd.DataFrame({'id': [1, 2, 3],
...: 'year': [2020, 2022, 2019]})
...: df2 = pd.DataFrame({'id': [3, 4],
...: 'n_legs': [5, 100],
...: 'animal': ["Brittle stars", "Centipede"]})
...: t1 = pa.Table.from_pandas(df1)
...: t2 = pa.Table.from_pandas(df2)
In [55]: t1.join(t2, 'id', join_type="right outer").combine_chunks()
Out[55]:
pyarrow.Table
year: int64
id: int64
n_legs: int64
animal: string
----
year: [[2019,null]]
id: [[3,4]]
n_legs: [[5,100]]
animal: [["Brittle stars","Centipede"]]
# and then we apply filter expression with intended mismatch here
In [56]: t1.join(t2, 'id', join_type="right outer", filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks()
Out[56]:
pyarrow.Table
year: int64
id: int64
n_legs: int64
animal: string
----
year: [[null,null]]
id: [[3,4]]
n_legs: [[5,100]]
animal: [["Brittle stars","Centipede"]] it seems we didn't return empty, instead, we return the btw, it seems fine with inner join type. In [57]: t1.join(t2, 'id', join_type="inner", filter_expression=pc.equal(pc.field("n_legs"), 200)).combine_chunks()
Out[57]:
pyarrow.Table
id: int64
year: int64
n_legs: int64
animal: string
----
id: []
year: []
n_legs: []
animal: [] this one seems like a bug to me, but I am not sure, @AlenkaF @raulcd could you provide some feedback on these two questions? Thanks! |
I am no expert on this area. |
Thank you @xingyu-long for contributing this! I'd first address your concern of:
Yes, this is expected by SQL semantic. And this is also the difference between you put an expression within Conceptually, all subexpressions in
Yes this is necessary for preserving the SQL-like join semantic - as long as you write the filter in the
In this case you can just do the filter ahead of join, e.g.,
As long as it is what you needed.
|
If my above comment addresses your concern, I'll in turn review the code. Thank you @xingyu-long . |
Thanks @zanmato1984 for your explanation, it makes sense. probably I should mention more details in function docstring for this usage then. at same time, feel free to review the changes since it just exposes what c++ does for python. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits.
I see, so if I understand this correctly, ideally, we probably should assign distinct key for both columns before using filter expression since output_suffix_for_left would only works for output at the end of the workflow, right? (sorry if this is a dumb question...) i.e., something like this won't work join_opts = HashJoinNodeOptions(
"inner", left_keys="key", right_keys="key",
output_suffix_for_left="_left",output_suffix_for_right="_right",
filter=pc.equal(pc.field('key_left'), 2)) # <------------ will hit key not found in both schemas.
joined = Declaration(
"hashjoin", options=join_opts, inputs=[left_source, right_source])
result = joined.to_table() if we don't use filter at all, we are ok with same column, and we can use output_suffix_for_left to help for the output only. @zanmato1984 |
Sorry I made a mistake. You are right about this. Thanks for clarifying. If you want to write a similar test case, let's just workaround the constraint and use unique column names. |
Thanks for confirming it! The tests I added for test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
(I pushed a commit merely changing some line orders.)
Thanks! @zanmato1984 Really appreciated it! I will wait for other approvals. |
Rationale for this change
C++ implementation support filter while performing join, however, it didn't expose to python and I think it's good to have this, so other users can avoid additional filter op explicitly in their side.
What changes are included in this PR?
Support expression in python binding.
Are these changes tested?
Yes, added new test test_hash_join_with_filter
Are there any user-facing changes?
It will expose one more argument for user, i.e., filter_expression for Table.join and Datastet.join
Note: I added [draft] for this change, since I'd like to get feedback from reviewers first and then we can change the frontend calls, i.e., Table, Dataset pxi files.