-
Notifications
You must be signed in to change notification settings - Fork 126
SNOW-1321662: Update merge behavior to pandas 2.x #1577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving some questions.
78c78f2
to
3b5a11e
Compare
3b5a11e
to
cb9d891
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
left_df, right_df, how=how, left_index=True, right_index=True, sort=sort | ||
) | ||
if how == "inner" and sort is False: | ||
pytest.skip("pandas bug: https://github.com/pandas-dev/pandas/issues/55774") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this bug has been fixed in pandas 2.2.1. why is it a problem for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's not fixed completely. ordering is incorrect for single to mulit join case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please comment on the pandas issue with a reproducer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I don't think we should spend much energy on fixing or worrying about pandas bugs. Pandas merge implementation (especially when multi-index is involved) is super buggy. We should strive to provide a consistent/documented behavior to snowpark pandas users not worry about pandas bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still recommend that we report pandas bugs when we find them.
- it's a service to the rest of the open source community. In this case it looks like BUG: pd.merge using "inner" with multi-index doesn't preserve left index order pandas-dev/pandas#55774 will be closed and there's no record of the fact that the issue hasn't really been fixed. We don't know how many users will also find this bug and not find an existing issue for it. If they want a fix, they have to report the issue and then wait for a fix. This project and Snowflake in general benefit a lot from pandas, and filing bugs is a simple way to give back to the pandas community.
- we use pandas methods in our implementation, so if someone fixes a pandas issue, that makes our implementation simpler or less buggy.
- we use pandas a lot in our tests, so if someone fixes a pandas issue, our tests can be simpler to maintain, understand, and extend. Also, having a self-contained bug report makes it easier to understand what pandas behaviors we would like to rely on but cannot, and to learn when a bug has been fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM besides @sfc-gh-mvashishtha's comments about native pandas bugs.
cb9d891
to
02c9b61
Compare
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-1321662
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Update merge behavior to match with pandas 2.x
change 1: merge/join with how='outer' always performs sorting.
change 2: for merge/join between frames with multi index, order of index columns is now:
right index columns + remaining left index columns (if how != right)
left index columns + remaining right index columns otherwise
Native pandas has fixed lot of ordering related bugs in 2.x https://github.com/pandas-dev/pandas/pull/54611/files
So also removed custom ordering logic from tests which is no longer required.
Also updated tests and removed xfails.