Skip to content

sort_index for MultiIndex DataFrame silently fill NaN for index column #25818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dddping opened this issue Mar 21, 2019 · 1 comment
Open

Comments

@dddping
Copy link

dddping commented Mar 21, 2019

Code Sample

mi = pd.MultiIndex.from_tuples([['A0', 'B0'],['A0', 'B1'],
                   ['A1', 'B0'],['A1', 'B1'],['A3', np.nan] ], names=['ia','ib'])
df = pd.DataFrame(np.arange(10).reshape(5,2), mi,columns=['bar', 'foo'])
df2=df.copy()
df2.index.set_levels(['B1','B0'],level=1,inplace=True)

df2
        bar  foo
ia ib
A0 B1     0    1
   B0     2    3
A1 B1     4    5
   B0     6    7
A3 NaN    8    9

# now sort_index to df2 will fill B0 for (A3,NaN) row, but sort_index of df wouldn't
df2.sort_index()
       bar  foo
ia ib
A0 B0    2    3
   B1    0    1
A1 B0    6    7
   B1    4    5
A3 B0    8    9

Problem description

It is suspect that the unsort level in MultiIndex DataFrame lead to NaN auto fill.

Note: I come with "unsort level in MultiIndex" DataFrame after some broadcast operation.
Here is the code

def mklbl(prefix, n):
    return ["%s%s" % (prefix, i) for i in range(n)]

miindex = pd.MultiIndex.from_product([mklbl('A', 3),
                                      mklbl('B', 2),
                                      mklbl('C', 2),
                                      mklbl('D', 2)],names=['ia','ib','ic','id'])

micolumns = ['foo','bah']

dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
                      .reshape((len(miindex), len(micolumns))),
                    index=miindex,
                    columns=micolumns).sort_index().sort_index(axis=1)

dfmi=dfmi.drop('A2')
# now dfmi first level index contain list of value more than it actually hold
# and it may lead to the index level change from ['A0', 'A1', 'A2'] to ['A1','A0']
bs=dfmi.loc[('A0','B0')].copy().rename({'D1':'D2'})
# this will lead to NaN in some row due to broadcast.
(dfmi+bs).sort_index()
               bah   foo
ic id ia ib
C0 D0 A0 B0    2.0   0.0
         B1   10.0   8.0
      A1 B0   18.0  16.0
         B1   26.0  24.0
   D1 A0 B0    NaN   NaN
         B1    NaN   NaN
      A1 B0    NaN   NaN
         B1    NaN   NaN
   D2 A0 NaN   NaN   NaN
C1 D0 A0 B0   10.0   8.0
         B1   18.0  16.0
      A1 B0   26.0  24.0
         B1   34.0  32.0
   D1 A0 B0    NaN   NaN
         B1    NaN   NaN
      A1 B0    NaN   NaN
         B1    NaN   NaN
   D2 A0 NaN   NaN   NaN # again the (D2,NaN,NaN) get change

Expected Output

sort_index should not change the contain of DataFrame.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: 4.1.0
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: 4.7.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@erezinman
Copy link

Another, perhaps simpler example:

import pandas as 
pd import numpy as np 
idx = pd.MultiIndex(levels=[['a', 'b', 'c'], [16., 32., 4.]], codes=[[0, 1, 2, 2], [-1, -1, 1, 1]]) 
df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=idx) 
df 
# a b c 
# NaN NaN 32.0 32.0 
# 0 0 1 2 3 
# 1 4 5 6 7 
# 2 8 9 10 11 
# 3 12 13 14 15 

df.sort_index(axis=1) 
# a b c 
# 4.0 4.0 32.0 32.0 
# 0 0 1 2 3 
# 1 4 5 6 7 
# 2 8 9 10 11
# 3 12 13 14 15

Problem still persists 3 years after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants