Skip to content

SNOW-1321662: Update merge behavior to pandas 2.x #1577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 15, 2024

Conversation

sfc-gh-nkumar
Copy link
Contributor

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1321662

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
  3. Please describe how your code solves the related issue.

Update merge behavior to match with pandas 2.x
change 1: merge/join with how='outer' always performs sorting.
change 2: for merge/join between frames with multi index, order of index columns is now:
right index columns + remaining left index columns (if how != right)
left index columns + remaining right index columns otherwise

Native pandas has fixed lot of ordering related bugs in 2.x https://github.com/pandas-dev/pandas/pull/54611/files
So also removed custom ordering logic from tests which is no longer required.

Also updated tests and removed xfails.

Copy link
Contributor

@sfc-gh-mvashishtha sfc-gh-mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving some questions.

@sfc-gh-nkumar sfc-gh-nkumar force-pushed the nkumar-SNOW-1321662-merge-tests branch 3 times, most recently from 78c78f2 to 3b5a11e Compare May 14, 2024 20:36
@sfc-gh-nkumar sfc-gh-nkumar added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label May 14, 2024
@sfc-gh-nkumar sfc-gh-nkumar force-pushed the nkumar-SNOW-1321662-merge-tests branch from 3b5a11e to cb9d891 Compare May 14, 2024 21:35
Copy link
Contributor

@sfc-gh-mvashishtha sfc-gh-mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

left_df, right_df, how=how, left_index=True, right_index=True, sort=sort
)
if how == "inner" and sort is False:
pytest.skip("pandas bug: https://github.com/pandas-dev/pandas/issues/55774")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this bug has been fixed in pandas 2.2.1. why is it a problem for us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's not fixed completely. ordering is incorrect for single to mulit join case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please comment on the pandas issue with a reproducer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't think we should spend much energy on fixing or worrying about pandas bugs. Pandas merge implementation (especially when multi-index is involved) is super buggy. We should strive to provide a consistent/documented behavior to snowpark pandas users not worry about pandas bugs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still recommend that we report pandas bugs when we find them.

  1. it's a service to the rest of the open source community. In this case it looks like BUG: pd.merge using "inner" with multi-index doesn't preserve left index order pandas-dev/pandas#55774 will be closed and there's no record of the fact that the issue hasn't really been fixed. We don't know how many users will also find this bug and not find an existing issue for it. If they want a fix, they have to report the issue and then wait for a fix. This project and Snowflake in general benefit a lot from pandas, and filing bugs is a simple way to give back to the pandas community.
  2. we use pandas methods in our implementation, so if someone fixes a pandas issue, that makes our implementation simpler or less buggy.
  3. we use pandas a lot in our tests, so if someone fixes a pandas issue, our tests can be simpler to maintain, understand, and extend. Also, having a self-contained bug report makes it easier to understand what pandas behaviors we would like to rely on but cannot, and to learn when a bug has been fixed.

Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides @sfc-gh-mvashishtha's comments about native pandas bugs.

@sfc-gh-nkumar sfc-gh-nkumar force-pushed the nkumar-SNOW-1321662-merge-tests branch from cb9d891 to 02c9b61 Compare May 14, 2024 23:47
@sfc-gh-nkumar sfc-gh-nkumar merged commit e7dffea into main May 15, 2024
31 checks passed
@sfc-gh-nkumar sfc-gh-nkumar deleted the nkumar-SNOW-1321662-merge-tests branch May 15, 2024 03:28
@github-actions github-actions bot locked and limited conversation to collaborators May 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants