two dataframes outer join on null values #7473

socheon · 2014-06-16T12:30:58Z

In pandas, unlike SQL, the rows seemed to be joining on null values. Is this a bug?
related SO: http://stackoverflow.com/questions/23940181/pandas-merging-with-missing-values/23940686#23940686

Code snippet

import pandas as pd 
import numpy as np

df1 = pd.DataFrame(
    [[1, None],
    [2, 'y']],
    columns = ['A', 'B']
)
print df1

df2 = pd.DataFrame(
    [['y', 'Y'],
    [None, 'None1'],
    [None, 'None2']],
    columns = ['B', 'C']
)
print df2

print df1.merge(df2, on='B', how='outer')

Output

   A     B
0  1  None
1  2     y

      B      C
0     y      Y
1  None  None1
2  None  None2

   A     B      C
0  1  None  None1
1  1  None  None2
2  2     y      Y

You can see row 0 in df1 unexpectedly joins to both rows in df2.

I would expect the correct answer to be

   A       B      C
0  1      None  None
1  2       y      Y
2 None None None1
3 None None None2

The text was updated successfully, but these errors were encountered:

jreback · 2014-06-16T12:37:58Z

see the linked question as well.

jreback · 2014-06-16T13:43:23Z

Using None as a missing value is not very standard in pandas, use np.nan

In [18]: df1
Out[18]: 
   A     B
0  1  None
1  2     y

In [19]: df1.fillna(np.nan)
Out[19]: 
   A    B
0  1  NaN
1  2    y

In [20]: df2
Out[20]: 
      B      C
0     y      Y
1  None  None1
2  None  None2

In [21]: df2.fillna(np.nan)
Out[21]: 
     B      C
0    y      Y
1  NaN  None1
2  NaN  None2

You can easily drop the missing values before merging.

In [22]: df1.fillna(np.nan).dropna().merge(df2.fillna(np.nan).dropna(),on='B',how='outer')
Out[22]: 
   A  B  C
0  2  y  Y

This is the default, as normally this would be up to the user to drop before merging.
(because its not always obvious what to drop, e.g. could easily drop based on the 'on' column, but, I suspect letting the user have control is better).

could do this a bit better I suppose.

socheon · 2014-06-16T14:20:30Z

Thanks. I think the outer join should follow the standard SQL otherwise it could be confusing.
There are some examples here

http://databases.about.com/library/weekly/aa051203a.htm
en.wikipedia.org/wiki/Join_(SQL)

In your example, you dropped the rows with null values before doing the outer join. But in the two examples above, the rows are not dropped.

In the first example, student Alan has course NULL. After joining to the instructor table, the Alan row is not dropped. The final result just says Alan has course NULL and instructor NULL.

In the second example, similarly we see that John with a Department ID NULL is not dropped in a full outer join.

A solution was suggested in the SO example above for left join
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id')

but for a full outer join we need to do something like

pd.concat([
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id'),
foo[pd.notnull(foo.id)].merge(bar, how='right', on='id')
])

But I think is unnatural and inconsistent with the standard SQL.

jreback · 2014-06-16T14:25:07Z

my example is very similar to what your soln shows.

I think this is not well defined ATM in pandas. Welcome a pull-request to address this. (I don't know if changing this to SQL behavior will break anything; if it does then that would need to be address).

pwrignall · 2020-08-19T13:50:17Z

@socheon Bit late after your comment but I notice this is still open and it helped me recently. I didn't quite get what I was looking for using your snippet but a bit of adjustment returned what I wanted and is more similar to the results expected doing a full outer join in SQL:

def merge_fullouter(df1, df2, on_col):
    """
    Does a SQLesque full outer join on two data frames without matching null values
    Input: df1, df2, Pandas data frames, data frames to full outer join
           on_col, string, the column on which to join the data frames by
    """
    return pd.concat([
        df1[pd.notnull(df1[on_col])].merge(df2[pd.notnull(df2[on_col])], how="outer", on=on_col),
        df1[pd.isnull(df1[on_col])],
        df2[pd.isnull(df2[on_col])]
])

mroeschke · 2021-07-21T04:37:41Z

Closing as duplicate of #32306 with a more recent discussion on the future policy we want.

jreback added Docs labels Jun 16, 2014

jreback added the API Design label Jun 16, 2014

jreback added this to the Someday milestone Jun 16, 2014

jreback added the Missing-data label Jun 16, 2014

jorisvandenbossche mentioned this issue Mar 3, 2020

Joining on a nullable column #32306

Open

mroeschke removed the API Design label Apr 11, 2021

mroeschke closed this as completed Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

two dataframes outer join on null values #7473

two dataframes outer join on null values #7473

socheon commented Jun 16, 2014

jreback commented Jun 16, 2014

Uh oh!

jreback commented Jun 16, 2014

Uh oh!

socheon commented Jun 16, 2014

Uh oh!

jreback commented Jun 16, 2014

Uh oh!

pwrignall commented Aug 19, 2020

Uh oh!

mroeschke commented Jul 21, 2021

Uh oh!

Uh oh!

two dataframes outer join on null values #7473

two dataframes outer join on null values #7473

Comments

socheon commented Jun 16, 2014

jreback commented Jun 16, 2014

Uh oh!

jreback commented Jun 16, 2014

Uh oh!

socheon commented Jun 16, 2014

Uh oh!

jreback commented Jun 16, 2014

Uh oh!

pwrignall commented Aug 19, 2020

Uh oh!

mroeschke commented Jul 21, 2021

Uh oh!