Skip to content

Joining on a nullable column #32306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dsaxton opened this issue Feb 27, 2020 · 17 comments
Open

Joining on a nullable column #32306

dsaxton opened this issue Feb 27, 2020 · 17 comments
Labels
Bug Deprecate Functionality to remove in pandas NA - MaskedArrays Related to pd.NA and nullable extension arrays Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@dsaxton
Copy link
Member

dsaxton commented Feb 27, 2020

import pandas as pd

left = pd.DataFrame({"a": [1, 2, pd.NA]}, dtype="Int64")
right = pd.DataFrame({"b": [1, 2, pd.NA]}, dtype="Int64")

pd.merge(left, right, how="inner", left_on="a", right_on="b")
#       a     b
# 0     1     1
# 1     2     2
# 2  <NA>  <NA>

(Above is from 1.0.1 and master)

I think when joining on a nullable column we should not be matching NA with NA and should only be joining where we have unambiguous equality (as in SQL). Also worth noting that this is the same as what happens when we have NaN which also seems incorrect, so could be an opportunity to fix this behavior?

pd.merge(left.astype(float), right.astype(float), how="inner", left_on="a", right_on="b")
#      a    b
# 0  1.0  1.0
# 1  2.0  2.0
# 2  NaN  NaN

Expected Output

      a     b
0     1     1
1     2     2

cc @jorisvandenbossche

@WillAyd
Copy link
Member

WillAyd commented Feb 27, 2020

Yea I would agree that we should not join on NA

@jorisvandenbossche
Copy link
Member

There is an existing issue about the fact that right now we see NaNs as matching, which can blow up your merge if you have NaNs in both left and right key column (since you get a cross product combination of rows).
I need to look up the issue, but I think there is some agreement that we don't like that behaviour, so that would be a nice thing to fix with NAs.

@jorisvandenbossche jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 2, 2020
@jorisvandenbossche
Copy link
Member

The issue is #22491, but we apparently have many other related, possibly duplicate, issues (from a quick search): #28522, #22618, #25138, #7473 (we should probably clean that up a bit)

But in any case, it would be good to directly have the "correct" behaviour for the new nullable dtypes (where we have less concerns about backwards compatibility)

@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Mar 3, 2020
@mroeschke mroeschke added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Apr 28, 2020
@phofl
Copy link
Member

phofl commented Oct 18, 2020

cc @jorisvandenbossche c @WillAyd

The behavior should be, that NaN does not match NaN for all nan types? This would break a few tests. Can we do that from the perspective of backwards compatibility or should we warn first before fixing this?

@WillAyd
Copy link
Member

WillAyd commented Oct 19, 2020

For backwards incompatible changes like this we would just go through a deprecation cycle first

@phofl
Copy link
Member

phofl commented Oct 19, 2020

Ok thanks very much, will have a Look in the coming days

@kasuteru
Copy link
Contributor

kasuteru commented Aug 3, 2021

I just ran into this problem. This behaviour is unlike any SQL join behaviour I have encountered so far, and completely unexpected, since usually NaN == NaN evaluates to False. I only caught this bug in my code by sheer luck.

What is the status of this?

In my opinion, this should be deprecated as fast as possible. Meanwhile, I think a big disclaimer should be put into the merge documentation that this is the current behaviour, as a warning to everyone using this.

@phofl
Copy link
Member

phofl commented Aug 3, 2021

This is highly non trivial since we have a lot of functions matching NaN with NaN. If we would want to do this we have to discuss the scope of the deprecation first.

@kasuteru
Copy link
Contributor

kasuteru commented Aug 3, 2021

@phofl happy to discuss this.

What do you mean by "a lot of functions matching NaN with NaN"? I can think of df.compare, but there it makes sense to say np.Nan==np.Nan. Are there other functions we need to consider?

In my opinion, our target should be to have the "correct" behaviour for all nullable dtypes, even though this means going through a deprecation cycle.

But in any case, it would be good to directly have the "correct" behaviour for the new nullable dtypes (where we have less concerns about backwards compatibility)

@jorisvandenbossche For the sake of conistency, I would propose making sure that behaviour for the new nullable dtypes does not differ from the old, even if that means that the behaviour initially is "wrong" for the new types. Otherwise, this might cause even more confusion.

@phofl
Copy link
Member

phofl commented Aug 3, 2021

Some examples:

  • union
  • drop_duplicates
  • unique
  • intersection
  • and so on

@kasuteru
Copy link
Contributor

kasuteru commented Aug 3, 2021

Some examples:

  • union
  • drop_duplicates
  • unique
  • intersection
  • and so on

I think all of these are fine as-is, because (at least in my opinion) they behave just as someone would expect if they were familiar with Python, but not pandas.

In contrast, the behaviour of df.merge is the opposite of what one would usually expect. Looking at this thread and the similar ones that @mroeschke closed, I feel like there is universal agreement that the default should be to not match NaN with NaN in the case of a merge, or at least display a warning.

Solution proposal
A solution proposal would be to add an additional parameter to merge: match_na. This is how R's dplyr handles it. This way, the user can explicitely state what they want. To ensure backwards compatibility, in a first step, we could leave the default to True, but display a warning if the parameter is not explicitely set and NA values match. The results would be:

  • The behaviour does not change for old code if there are no NA rows that are matched to other NA rows
  • The behaviour does not change for old code if there are NA rows that match, but a warning is displayed.
  • If someone wants to get rid of the warning and the matching is intentional, they could explicitely specify match_na=True.
  • If someone wants SQL-like behaviour (which I would bet is the majority), they can specify match_na=False.

At a later point, we could discuss about changing the default behaviour - however, I think as long as a warning is displayed and there is an easy option to switch behaviour, this might not even be necessary.

Note: As a comparison, based on this thread, Matlab seems to behave like SQL for joins.

@phofl
Copy link
Member

phofl commented Aug 3, 2021

I am not opposed to changing it. I looked into this a few months back but got stuck on all the cases to consider, because merge uses some of the functions mentioned above quite heavily. If we diverge there, this gets pretty difficult real quick.

So the discuission about this may be easy, but the implementation is not as straighforward without adding a lot of complexity. If youl would like to try this, I think this would certainly be welcome.

@TheGustafson
Copy link

TheGustafson commented Nov 12, 2021

Has there been any additional discussion on this? I came across this behavior in my work, and I was genuinely surprised that this was the default behavior of Pandas.

The merge: match_na solution with a deprecation warning that @kasuteru proposed is a great idea.

In the interim, a warning should be put into the merge documentation. This should be a high priority...

@NickCrews
Copy link
Contributor

Don't really have any suggestions here, but just wanted to expose this:

Currently pd.NA and np.nan are treated as the same key when merging, even though they are treated as different elsewhere (eg drop_duplicates()) :

import numpy as np
import pandas as pd

left = pd.DataFrame({"a": ["a", "b", pd.NA]}, dtype=object)
right = pd.DataFrame({"b": ["c", np.nan, pd.NA]}, dtype=object)

pd.merge(left, right, how="inner", left_on="a", right_on="b")
#      a     b
# 0  <NA>   NaN
# 1  <NA>  <NA>

The proposed match_na solution above would have to consider this case.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@fg-pt
Copy link

fg-pt commented Feb 6, 2023

just checking what's the latest update on this? was it addressed?

@phofl
Copy link
Member

phofl commented Feb 6, 2023

The issue is still open

@tcquinn
Copy link

tcquinn commented Dec 31, 2023

Not sure if this belongs here, but under certain conditions, NaN also matches on non-NaN values. Here is a small example (Pandas 2.1.4):

a = pd.DataFrame({
    'idx_a':['A', 'A', 'B'],
    'idx_b':['alpha', 'beta', 'gamma'],
    'idx_c': [1.0, 1.0, 1.0],
    'x':[10, 20, 30]
}).set_index(['idx_a', 'idx_b', 'idx_c'])

b = pd.DataFrame({
    'idx_b':['gamma', 'delta', 'epsilon', np.nan, np.nan],
    'idx_c': [1.0, 1.0, 1.0, 1.0, 1.0],
    'y':[100, 200, 300, 400, 500]
}).set_index(['idx_b', 'idx_c'])

c = a.join(
    b,
    how='inner',
    on=['idx_b', 'idx_c']
)

print(a)
                    x
idx_a idx_b idx_c    
A     alpha 1.0    10
      beta  1.0    20
B     gamma 1.0    30

print(b)
                y
idx_b   idx_c     
gamma   1.0    100
delta   1.0    200
epsilon 1.0    300
NaN     1.0    400
        1.0    500

print(c)
                    x    y
idx_a idx_b idx_c         
B     gamma 1.0    30  100
            1.0    30  400
            1.0    30  500

Mentioning here because other issues describing this behavior have been merged here (e.g., #25138), but all of the discussion above seems to refer only to NaN matching on NaN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Deprecate Functionality to remove in pandas NA - MaskedArrays Related to pd.NA and nullable extension arrays Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests