Skip to content

two dataframes outer join on null values #7473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
socheon opened this issue Jun 16, 2014 · 6 comments
Closed

two dataframes outer join on null values #7473

socheon opened this issue Jun 16, 2014 · 6 comments
Labels
Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@socheon
Copy link

socheon commented Jun 16, 2014

In pandas, unlike SQL, the rows seemed to be joining on null values. Is this a bug?
related SO: http://stackoverflow.com/questions/23940181/pandas-merging-with-missing-values/23940686#23940686

Code snippet

import pandas as pd 
import numpy as np

df1 = pd.DataFrame(
    [[1, None],
    [2, 'y']],
    columns = ['A', 'B']
)
print df1

df2 = pd.DataFrame(
    [['y', 'Y'],
    [None, 'None1'],
    [None, 'None2']],
    columns = ['B', 'C']
)
print df2

print df1.merge(df2, on='B', how='outer')

Output

   A     B
0  1  None
1  2     y

      B      C
0     y      Y
1  None  None1
2  None  None2

   A     B      C
0  1  None  None1
1  1  None  None2
2  2     y      Y

You can see row 0 in df1 unexpectedly joins to both rows in df2.

I would expect the correct answer to be

   A       B      C
0  1      None  None
1  2       y      Y
2 None None None1
3 None None None2
@jreback
Copy link
Contributor

jreback commented Jun 16, 2014

see the linked question as well.

@jreback
Copy link
Contributor

jreback commented Jun 16, 2014

Using None as a missing value is not very standard in pandas, use np.nan

In [18]: df1
Out[18]: 
   A     B
0  1  None
1  2     y

In [19]: df1.fillna(np.nan)
Out[19]: 
   A    B
0  1  NaN
1  2    y

In [20]: df2
Out[20]: 
      B      C
0     y      Y
1  None  None1
2  None  None2

In [21]: df2.fillna(np.nan)
Out[21]: 
     B      C
0    y      Y
1  NaN  None1
2  NaN  None2

You can easily drop the missing values before merging.

In [22]: df1.fillna(np.nan).dropna().merge(df2.fillna(np.nan).dropna(),on='B',how='outer')
Out[22]: 
   A  B  C
0  2  y  Y

This is the default, as normally this would be up to the user to drop before merging.
(because its not always obvious what to drop, e.g. could easily drop based on the 'on' column, but, I suspect letting the user have control is better).

could do this a bit better I suppose.

@socheon
Copy link
Author

socheon commented Jun 16, 2014

Thanks. I think the outer join should follow the standard SQL otherwise it could be confusing.
There are some examples here

In your example, you dropped the rows with null values before doing the outer join. But in the two examples above, the rows are not dropped.

In the first example, student Alan has course NULL. After joining to the instructor table, the Alan row is not dropped. The final result just says Alan has course NULL and instructor NULL.

In the second example, similarly we see that John with a Department ID NULL is not dropped in a full outer join.

A solution was suggested in the SO example above for left join
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id')

but for a full outer join we need to do something like

pd.concat([
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id'),
foo[pd.notnull(foo.id)].merge(bar, how='right', on='id')
])

But I think is unnatural and inconsistent with the standard SQL.

@jreback
Copy link
Contributor

jreback commented Jun 16, 2014

my example is very similar to what your soln shows.

I think this is not well defined ATM in pandas. Welcome a pull-request to address this. (I don't know if changing this to SQL behavior will break anything; if it does then that would need to be address).

@pwrignall
Copy link

@socheon Bit late after your comment but I notice this is still open and it helped me recently. I didn't quite get what I was looking for using your snippet but a bit of adjustment returned what I wanted and is more similar to the results expected doing a full outer join in SQL:

def merge_fullouter(df1, df2, on_col):
    """
    Does a SQLesque full outer join on two data frames without matching null values
    Input: df1, df2, Pandas data frames, data frames to full outer join
           on_col, string, the column on which to join the data frames by
    """
    return pd.concat([
        df1[pd.notnull(df1[on_col])].merge(df2[pd.notnull(df2[on_col])], how="outer", on=on_col),
        df1[pd.isnull(df1[on_col])],
        df2[pd.isnull(df2[on_col])]
])

@mroeschke
Copy link
Member

Closing as duplicate of #32306 with a more recent discussion on the future policy we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

4 participants