-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Joining on a nullable column #32306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yea I would agree that we should not join on NA |
There is an existing issue about the fact that right now we see NaNs as matching, which can blow up your merge if you have NaNs in both left and right key column (since you get a cross product combination of rows). |
The issue is #22491, but we apparently have many other related, possibly duplicate, issues (from a quick search): #28522, #22618, #25138, #7473 (we should probably clean that up a bit) But in any case, it would be good to directly have the "correct" behaviour for the new nullable dtypes (where we have less concerns about backwards compatibility) |
cc @jorisvandenbossche c @WillAyd The behavior should be, that NaN does not match NaN for all nan types? This would break a few tests. Can we do that from the perspective of backwards compatibility or should we warn first before fixing this? |
For backwards incompatible changes like this we would just go through a deprecation cycle first |
Ok thanks very much, will have a Look in the coming days |
I just ran into this problem. This behaviour is unlike any SQL join behaviour I have encountered so far, and completely unexpected, since usually What is the status of this? In my opinion, this should be deprecated as fast as possible. Meanwhile, I think a big disclaimer should be put into the merge documentation that this is the current behaviour, as a warning to everyone using this. |
This is highly non trivial since we have a lot of functions matching NaN with NaN. If we would want to do this we have to discuss the scope of the deprecation first. |
@phofl happy to discuss this. What do you mean by "a lot of functions matching NaN with NaN"? I can think of In my opinion, our target should be to have the "correct" behaviour for all nullable dtypes, even though this means going through a deprecation cycle.
@jorisvandenbossche For the sake of conistency, I would propose making sure that behaviour for the new nullable dtypes does not differ from the old, even if that means that the behaviour initially is "wrong" for the new types. Otherwise, this might cause even more confusion. |
Some examples:
|
I think all of these are fine as-is, because (at least in my opinion) they behave just as someone would expect if they were familiar with Python, but not pandas. In contrast, the behaviour of Solution proposal
At a later point, we could discuss about changing the default behaviour - however, I think as long as a warning is displayed and there is an easy option to switch behaviour, this might not even be necessary. Note: As a comparison, based on this thread, Matlab seems to behave like SQL for joins. |
I am not opposed to changing it. I looked into this a few months back but got stuck on all the cases to consider, because merge uses some of the functions mentioned above quite heavily. If we diverge there, this gets pretty difficult real quick. So the discuission about this may be easy, but the implementation is not as straighforward without adding a lot of complexity. If youl would like to try this, I think this would certainly be welcome. |
Has there been any additional discussion on this? I came across this behavior in my work, and I was genuinely surprised that this was the default behavior of Pandas. The merge: match_na solution with a deprecation warning that @kasuteru proposed is a great idea. In the interim, a warning should be put into the merge documentation. This should be a high priority... |
Don't really have any suggestions here, but just wanted to expose this: Currently
The proposed |
just checking what's the latest update on this? was it addressed? |
The issue is still open |
Not sure if this belongs here, but under certain conditions,
Mentioning here because other issues describing this behavior have been merged here (e.g., #25138), but all of the discussion above seems to refer only to |
(Above is from 1.0.1 and master)
I think when joining on a nullable column we should not be matching
NA
withNA
and should only be joining where we have unambiguous equality (as in SQL). Also worth noting that this is the same as what happens when we haveNaN
which also seems incorrect, so could be an opportunity to fix this behavior?Expected Output
cc @jorisvandenbossche
The text was updated successfully, but these errors were encountered: