Joining on a nullable column #32306

dsaxton · 2020-02-27T18:46:41Z

import pandas as pd

left = pd.DataFrame({"a": [1, 2, pd.NA]}, dtype="Int64")
right = pd.DataFrame({"b": [1, 2, pd.NA]}, dtype="Int64")

pd.merge(left, right, how="inner", left_on="a", right_on="b")
#       a     b
# 0     1     1
# 1     2     2
# 2  <NA>  <NA>

(Above is from 1.0.1 and master)

I think when joining on a nullable column we should not be matching NA with NA and should only be joining where we have unambiguous equality (as in SQL). Also worth noting that this is the same as what happens when we have NaN which also seems incorrect, so could be an opportunity to fix this behavior?

pd.merge(left.astype(float), right.astype(float), how="inner", left_on="a", right_on="b")
#      a    b
# 0  1.0  1.0
# 1  2.0  2.0
# 2  NaN  NaN

Expected Output

      a     b
0     1     1
1     2     2

cc @jorisvandenbossche

The text was updated successfully, but these errors were encountered:

WillAyd · 2020-02-27T20:20:13Z

Yea I would agree that we should not join on NA

jorisvandenbossche · 2020-02-27T22:36:22Z

There is an existing issue about the fact that right now we see NaNs as matching, which can blow up your merge if you have NaNs in both left and right key column (since you get a cross product combination of rows).
I need to look up the issue, but I think there is some agreement that we don't like that behaviour, so that would be a nice thing to fix with NAs.

jorisvandenbossche · 2020-03-03T12:10:42Z

The issue is #22491, but we apparently have many other related, possibly duplicate, issues (from a quick search): #28522, #22618, #25138, #7473 (we should probably clean that up a bit)

But in any case, it would be good to directly have the "correct" behaviour for the new nullable dtypes (where we have less concerns about backwards compatibility)

phofl · 2020-10-18T12:54:01Z

cc @jorisvandenbossche c @WillAyd

The behavior should be, that NaN does not match NaN for all nan types? This would break a few tests. Can we do that from the perspective of backwards compatibility or should we warn first before fixing this?

WillAyd · 2020-10-19T21:16:24Z

For backwards incompatible changes like this we would just go through a deprecation cycle first

phofl · 2020-10-19T21:17:37Z

Ok thanks very much, will have a Look in the coming days

kasuteru · 2021-08-03T11:34:52Z

I just ran into this problem. This behaviour is unlike any SQL join behaviour I have encountered so far, and completely unexpected, since usually NaN == NaN evaluates to False. I only caught this bug in my code by sheer luck.

What is the status of this?

In my opinion, this should be deprecated as fast as possible. Meanwhile, I think a big disclaimer should be put into the merge documentation that this is the current behaviour, as a warning to everyone using this.

phofl · 2021-08-03T12:35:34Z

This is highly non trivial since we have a lot of functions matching NaN with NaN. If we would want to do this we have to discuss the scope of the deprecation first.

kasuteru · 2021-08-03T13:23:43Z

@phofl happy to discuss this.

What do you mean by "a lot of functions matching NaN with NaN"? I can think of df.compare, but there it makes sense to say np.Nan==np.Nan. Are there other functions we need to consider?

In my opinion, our target should be to have the "correct" behaviour for all nullable dtypes, even though this means going through a deprecation cycle.

But in any case, it would be good to directly have the "correct" behaviour for the new nullable dtypes (where we have less concerns about backwards compatibility)

@jorisvandenbossche For the sake of conistency, I would propose making sure that behaviour for the new nullable dtypes does not differ from the old, even if that means that the behaviour initially is "wrong" for the new types. Otherwise, this might cause even more confusion.

phofl · 2021-08-03T13:25:34Z

Some examples:

union
drop_duplicates
unique
intersection
and so on

kasuteru · 2021-08-03T14:40:10Z

Some examples:

union

drop_duplicates

unique

intersection

and so on

I think all of these are fine as-is, because (at least in my opinion) they behave just as someone would expect if they were familiar with Python, but not pandas.

In contrast, the behaviour of df.merge is the opposite of what one would usually expect. Looking at this thread and the similar ones that @mroeschke closed, I feel like there is universal agreement that the default should be to not match NaN with NaN in the case of a merge, or at least display a warning.

Solution proposal
A solution proposal would be to add an additional parameter to merge: match_na. This is how R's dplyr handles it. This way, the user can explicitely state what they want. To ensure backwards compatibility, in a first step, we could leave the default to True, but display a warning if the parameter is not explicitely set and NA values match. The results would be:

The behaviour does not change for old code if there are no NA rows that are matched to other NA rows
The behaviour does not change for old code if there are NA rows that match, but a warning is displayed.
If someone wants to get rid of the warning and the matching is intentional, they could explicitely specify match_na=True.
If someone wants SQL-like behaviour (which I would bet is the majority), they can specify match_na=False.

At a later point, we could discuss about changing the default behaviour - however, I think as long as a warning is displayed and there is an easy option to switch behaviour, this might not even be necessary.

Note: As a comparison, based on this thread, Matlab seems to behave like SQL for joins.

phofl · 2021-08-03T15:04:29Z

I am not opposed to changing it. I looked into this a few months back but got stuck on all the cases to consider, because merge uses some of the functions mentioned above quite heavily. If we diverge there, this gets pretty difficult real quick.

So the discuission about this may be easy, but the implementation is not as straighforward without adding a lot of complexity. If youl would like to try this, I think this would certainly be welcome.

TheGustafson · 2021-11-12T02:12:25Z

Has there been any additional discussion on this? I came across this behavior in my work, and I was genuinely surprised that this was the default behavior of Pandas.

The merge: match_na solution with a deprecation warning that @kasuteru proposed is a great idea.

In the interim, a warning should be put into the merge documentation. This should be a high priority...

NickCrews · 2022-01-18T18:20:04Z

Don't really have any suggestions here, but just wanted to expose this:

Currently pd.NA and np.nan are treated as the same key when merging, even though they are treated as different elsewhere (eg drop_duplicates()) :

import numpy as np
import pandas as pd

left = pd.DataFrame({"a": ["a", "b", pd.NA]}, dtype=object)
right = pd.DataFrame({"b": ["c", np.nan, pd.NA]}, dtype=object)

pd.merge(left, right, how="inner", left_on="a", right_on="b")
#      a     b
# 0  <NA>   NaN
# 1  <NA>  <NA>

The proposed match_na solution above would have to consider this case.

fg-pt · 2023-02-06T15:38:58Z

just checking what's the latest update on this? was it addressed?

phofl · 2023-02-06T15:47:46Z

The issue is still open

tcquinn · 2023-12-31T18:13:58Z

Not sure if this belongs here, but under certain conditions, NaN also matches on non-NaN values. Here is a small example (Pandas 2.1.4):

a = pd.DataFrame({
    'idx_a':['A', 'A', 'B'],
    'idx_b':['alpha', 'beta', 'gamma'],
    'idx_c': [1.0, 1.0, 1.0],
    'x':[10, 20, 30]
}).set_index(['idx_a', 'idx_b', 'idx_c'])

b = pd.DataFrame({
    'idx_b':['gamma', 'delta', 'epsilon', np.nan, np.nan],
    'idx_c': [1.0, 1.0, 1.0, 1.0, 1.0],
    'y':[100, 200, 300, 400, 500]
}).set_index(['idx_b', 'idx_c'])

c = a.join(
    b,
    how='inner',
    on=['idx_b', 'idx_c']
)

print(a)
                    x
idx_a idx_b idx_c    
A     alpha 1.0    10
      beta  1.0    20
B     gamma 1.0    30

print(b)
                y
idx_b   idx_c     
gamma   1.0    100
delta   1.0    200
epsilon 1.0    300
NaN     1.0    400
        1.0    500

print(c)
                    x    y
idx_a idx_b idx_c         
B     gamma 1.0    30  100
            1.0    30  400
            1.0    30  500

Mentioning here because other issues describing this behavior have been merged here (e.g., #25138), but all of the discussion above seems to refer only to NaN matching on NaN.

jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 2, 2020

jorisvandenbossche added this to the Contributions Welcome milestone Mar 3, 2020

mroeschke added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Apr 28, 2020

This was referenced Oct 25, 2020

DEPR: Add FutureWarning when matching NaN with NaN in merge/join #37405

Closed

ENH: Validate merge is not made on NA values in the merge keys #37518

Closed

lithomas1 added the Deprecate Functionality to remove in pandas label Feb 8, 2021

kasuteru mentioned this issue Aug 3, 2021

Add warning about NA merges #42865

Merged

mroeschke mentioned this issue Jan 20, 2022

REF: merge_asof dont use values_for_argsort #45475

Merged

4 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Uh oh!

Joining on a nullable column #32306

Joining on a nullable column #32306

Comments

dsaxton commented Feb 27, 2020

Expected Output

WillAyd commented Feb 27, 2020

Uh oh!

jorisvandenbossche commented Feb 27, 2020

Uh oh!

jorisvandenbossche commented Mar 3, 2020

Uh oh!

phofl commented Oct 18, 2020

Uh oh!

WillAyd commented Oct 19, 2020

Uh oh!

phofl commented Oct 19, 2020

Uh oh!

kasuteru commented Aug 3, 2021

Uh oh!

phofl commented Aug 3, 2021

Uh oh!

kasuteru commented Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phofl commented Aug 3, 2021

Uh oh!

kasuteru commented Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phofl commented Aug 3, 2021

Uh oh!

TheGustafson commented Nov 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickCrews commented Jan 18, 2022

Uh oh!

fg-pt commented Feb 6, 2023

Uh oh!

phofl commented Feb 6, 2023

Uh oh!

tcquinn commented Dec 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasuteru commented Aug 3, 2021 •

edited

Loading

kasuteru commented Aug 3, 2021 •

edited

Loading

TheGustafson commented Nov 12, 2021 •

edited

Loading

tcquinn commented Dec 31, 2023 •

edited

Loading