Skip to content

How should I store frames with multiindex columns in CSV? #21976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alexitkes opened this issue Jul 19, 2018 · 8 comments
Closed

How should I store frames with multiindex columns in CSV? #21976

alexitkes opened this issue Jul 19, 2018 · 8 comments
Labels
IO Data IO issues that don't fit into a more specific label

Comments

@alexitkes
Copy link
Contributor

alexitkes commented Jul 19, 2018

Hello.

I tried to save a dataframe with MultiIndex used as columns to a CSV file and load it back, but I had no luck.

# Create a fame with multiindex columns
frame = pd.DataFrame({('AAPL', 'OPEN'): [1, 2, 3, 4], ('AAPL', 'CLOSE'): [1, 2, 3, 4], ('MSFT', 'OPEN'): [1, 2, 3, 4], ('MSFT', 'CLOSE'): [1, 2, 3, 4]})
# Make sure it was created as wanted.
frame
#   AAPL       MSFT     
#   CLOSE OPEN CLOSE OPEN
# 0     1    1     1    1
# 1     2    2     2    2
# 2     3    3     3    3
# 3     4    4     4    4

# Try to convert the frame to CSV
s1 = frame.to_csv()
s2 = frame.to_csv(tupleize_cols=True)
# FutureWarning displayed - tupleize_cols is deprecated.

print(s1)
# ,AAPL,AAPL,MSFT,MSFT
# ,CLOSE,OPEN,CLOSE,OPEN
# 0,1,1,1,1
# 1,2,2,2,2
# 2,3,3,3,3
# 3,4,4,4,4

print(s2)
# ,"('AAPL', 'CLOSE')","('AAPL', 'OPEN')","('MSFT', 'CLOSE')","('MSFT', 'OPEN')"
# 0,1,1,1,1
# 1,2,2,2,2
# 2,3,3,3,3
# 3,4,4,4,4

# Read the CSV strings back to DataFrames
f1 = pd.read_csv(StringIO(s1))
f2 = pd.read_csv(StringIO(s2), tupleize_cols=True)
# Warning about tupleize_cols here

# Both frames does not look like the original one.
f1
#    Unnamed: 0   AAPL AAPL.1   MSFT MSFT.1
# 0         NaN  CLOSE   OPEN  CLOSE   OPEN
# 1         0.0      1      1      1      1
# 2         1.0      2      2      2      2
# 3         2.0      3      3      3      3
# 4         3.0      4      4      4      4

f2
#    Unnamed: 0  ('AAPL', 'CLOSE')  ('AAPL', 'OPEN')  ('MSFT', 'CLOSE')  ('MSFT', 'OPEN')
# 0           0                  1                 1                  1                 1
# 1           1                  2                 2                  2                 2
# 2           2                  3                 3                  3                 3
# 3           3                  4                 4                  4                 4

As you see, both frames don't have multiindexed columns as original one. So, how should I save a DataFrame with multiindexed columns to CSV file and load it back to get a frame same to the original one?

I also tried to save as JSON, but also encountered problems. Here is what the frame shown above is converted to.

frame.to_json()
'{"["AAPL","CLOSE"]":{"0":1,"1":2,"2":3,"3":4},"["AAPL","OPEN"]":{"0":1,"1":2,"2":3,"3":4},"["MSFT","CLOSE"]":{"0":1,"1":2,"2":3,"3":4},"["MSFT","OPEN"]":{"0":1,"1":2,"2":3,"3":4}}'

So, tupleized multiindexed column names are obviously incorrectly quoted.

With best regards,

Alex.

INSTALLED VERSIONS

commit: None
python: 3.4.2.final.0
python-bits: 32
OS: Linux
OS-release: 3.16.0-6-686-pae
machine: i686
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+318.g272bbdc
pytest: 3.6.3
pip: 1.5.6
setuptools: 5.5.1
Cython: 0.28.4
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@alexitkes
Copy link
Contributor Author

I tried to experiment more with JSON, but found the only way to produce a valid JSON output is to specify orient='split', but it is still read incorrectly.

s = frame.to_json(orient='split')
# s = '{"columns":[["AAPL","CLOSE"],["AAPL","OPEN"],["MSFT","CLOSE"],["MSFT","OPEN"]],"index":[0,1,2,3],"data":[[1,1,1,1],[2,2,2,2],[3,3,3,3],[4,4,4,4]]}'

pd.read_json(StringIO(s), orient='split')
#  AAPL CLOSE
#  AAPL  OPEN
#  MSFT CLOSE
#  MSFT  OPEN
# 0    1     1
# 1    2     2
# 2    3     3
# 3    4     4

# This is still not what I wanted. Here is what this frame uses as columns.
pd.read_json(StringIO(s), orient='split').columns
# MultiIndex(levels=[['AAPL', 'CLOSE'], ['AAPL', 'OPEN'], ['CLOSE', 'MSFT'], ['MSFT', 'OPEN']],
#           labels=[[0, 1], [0, 1], [1, 0], [0, 1]])

# While this is the original one.
frame.columns
# MultiIndex(levels=[['AAPL', 'MSFT'], ['CLOSE', 'OPEN']],
#           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

So far I don't know any way to save a dataframe with multiindexed columns to either CSV or JSON and load it back properly...

@TomAugspurger
Copy link
Contributor

I suspect you'll have trouble with this in most storage formats, since hierarchical columns are somewhat unique to pandas. You may be best of manually flattening your columns before and after IO.

@TomAugspurger TomAugspurger added the IO Data IO issues that don't fit into a more specific label label Jul 19, 2018
@WillAyd
Copy link
Member

WillAyd commented Jul 19, 2018

The best be here may be orient="table" though I'm not sure what the JSON Table Schema specification says about hierarchical columns (indices are fine). Any investigation or PRs there are certainly welcome

Here's a workaround to maintain those via transposition. You need a non-numeric index to account for the fact that numeric column labels are not valid in the JSON Table schema:

>>> frame.index = list('abcd')
>>> pd.read_json(frame.T.to_json(orient="table"), orient="table").T
  AAPL       MSFT      
  OPEN CLOSE OPEN CLOSE
a    1     1    1     1
b    2     2    2     2
c    3     3    3     3
d    4     4    4     4

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 19, 2018 via email

@chris-b1
Copy link
Contributor

read_csv can handle hierarchical columns - but they must be specified on the way in with a list to the header arg.

from io import StringIO
buf = StringIO()
frame.to_csv(buf)
buf.seek(0)

In [109]: pd.read_csv(buf, header=[0,1], index_col=0)
Out[109]: 
  AAPL       MSFT     
 CLOSE OPEN CLOSE OPEN
0     1    1     1    1
1     2    2     2    2
2     3    3     3    3
3     4    4     4    4

@jbrockmendel
Copy link
Member

Could we use some trickery like using a \xa0 to denote what “columns
Are actually part of an MI?

@alexitkes
Copy link
Contributor Author

Thank you, CSV version works fine, while JSON version still requires frame index to be a string list and crashes if timestamps or integers are used as index.

Additionally, it's looking like even frames with multi-level row indexes are stored in JSON a bit incorrectly if orient='split' used.

In [30]: frame
Out[30]: 
            0  1  2  3
AAPL CLOSE  1  2  3  4
     OPEN   1  2  3  4
MSFT CLOSE  1  2  3  4
     OPEN   1  2  3  4

In [31]: pd.read_json(StringIO(frame.to_json(orient='split')), orient='split')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
ValueError: Shape of passed values is (4, 4), indices imply (4, 2)

In [32]: frame.to_json(orient='split')
Out[32]: '{"columns":[0,1,2,3],"index":[["AAPL","CLOSE"],["AAPL","OPEN"],["MSFT","CLOSE"],["MSFT","OPEN"]],"data":[[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]]}'

@WillAyd
Copy link
Member

WillAyd commented Jul 20, 2018

Can you open a separate bug for the JSON orient="split" issue? That does seem off.

As a side note on orient="table":

@WillAyd WillAyd closed this as completed Jul 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

5 participants