How should I store frames with multiindex columns in CSV? #21976

alexitkes · 2018-07-19T11:07:41Z

Hello.

I tried to save a dataframe with MultiIndex used as columns to a CSV file and load it back, but I had no luck.

# Create a fame with multiindex columns
frame = pd.DataFrame({('AAPL', 'OPEN'): [1, 2, 3, 4], ('AAPL', 'CLOSE'): [1, 2, 3, 4], ('MSFT', 'OPEN'): [1, 2, 3, 4], ('MSFT', 'CLOSE'): [1, 2, 3, 4]})
# Make sure it was created as wanted.
frame
#   AAPL       MSFT     
#   CLOSE OPEN CLOSE OPEN
# 0     1    1     1    1
# 1     2    2     2    2
# 2     3    3     3    3
# 3     4    4     4    4

# Try to convert the frame to CSV
s1 = frame.to_csv()
s2 = frame.to_csv(tupleize_cols=True)
# FutureWarning displayed - tupleize_cols is deprecated.

print(s1)
# ,AAPL,AAPL,MSFT,MSFT
# ,CLOSE,OPEN,CLOSE,OPEN
# 0,1,1,1,1
# 1,2,2,2,2
# 2,3,3,3,3
# 3,4,4,4,4

print(s2)
# ,"('AAPL', 'CLOSE')","('AAPL', 'OPEN')","('MSFT', 'CLOSE')","('MSFT', 'OPEN')"
# 0,1,1,1,1
# 1,2,2,2,2
# 2,3,3,3,3
# 3,4,4,4,4

# Read the CSV strings back to DataFrames
f1 = pd.read_csv(StringIO(s1))
f2 = pd.read_csv(StringIO(s2), tupleize_cols=True)
# Warning about tupleize_cols here

# Both frames does not look like the original one.
f1
#    Unnamed: 0   AAPL AAPL.1   MSFT MSFT.1
# 0         NaN  CLOSE   OPEN  CLOSE   OPEN
# 1         0.0      1      1      1      1
# 2         1.0      2      2      2      2
# 3         2.0      3      3      3      3
# 4         3.0      4      4      4      4

f2
#    Unnamed: 0  ('AAPL', 'CLOSE')  ('AAPL', 'OPEN')  ('MSFT', 'CLOSE')  ('MSFT', 'OPEN')
# 0           0                  1                 1                  1                 1
# 1           1                  2                 2                  2                 2
# 2           2                  3                 3                  3                 3
# 3           3                  4                 4                  4                 4

As you see, both frames don't have multiindexed columns as original one. So, how should I save a DataFrame with multiindexed columns to CSV file and load it back to get a frame same to the original one?

I also tried to save as JSON, but also encountered problems. Here is what the frame shown above is converted to.

frame.to_json()
'{"["AAPL","CLOSE"]":{"0":1,"1":2,"2":3,"3":4},"["AAPL","OPEN"]":{"0":1,"1":2,"2":3,"3":4},"["MSFT","CLOSE"]":{"0":1,"1":2,"2":3,"3":4},"["MSFT","OPEN"]":{"0":1,"1":2,"2":3,"3":4}}'

So, tupleized multiindexed column names are obviously incorrectly quoted.

With best regards,

Alex.

INSTALLED VERSIONS

commit: None
python: 3.4.2.final.0
python-bits: 32
OS: Linux
OS-release: 3.16.0-6-686-pae
machine: i686
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+318.g272bbdc
pytest: 3.6.3
pip: 1.5.6
setuptools: 5.5.1
Cython: 0.28.4
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

alexitkes · 2018-07-19T13:23:55Z

I tried to experiment more with JSON, but found the only way to produce a valid JSON output is to specify orient='split', but it is still read incorrectly.

s = frame.to_json(orient='split')
# s = '{"columns":[["AAPL","CLOSE"],["AAPL","OPEN"],["MSFT","CLOSE"],["MSFT","OPEN"]],"index":[0,1,2,3],"data":[[1,1,1,1],[2,2,2,2],[3,3,3,3],[4,4,4,4]]}'

pd.read_json(StringIO(s), orient='split')
#  AAPL CLOSE
#  AAPL  OPEN
#  MSFT CLOSE
#  MSFT  OPEN
# 0    1     1
# 1    2     2
# 2    3     3
# 3    4     4

# This is still not what I wanted. Here is what this frame uses as columns.
pd.read_json(StringIO(s), orient='split').columns
# MultiIndex(levels=[['AAPL', 'CLOSE'], ['AAPL', 'OPEN'], ['CLOSE', 'MSFT'], ['MSFT', 'OPEN']],
#           labels=[[0, 1], [0, 1], [1, 0], [0, 1]])

# While this is the original one.
frame.columns
# MultiIndex(levels=[['AAPL', 'MSFT'], ['CLOSE', 'OPEN']],
#           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

So far I don't know any way to save a dataframe with multiindexed columns to either CSV or JSON and load it back properly...

TomAugspurger · 2018-07-19T13:48:12Z

I suspect you'll have trouble with this in most storage formats, since hierarchical columns are somewhat unique to pandas. You may be best of manually flattening your columns before and after IO.

WillAyd · 2018-07-19T15:20:58Z

The best be here may be orient="table" though I'm not sure what the JSON Table Schema specification says about hierarchical columns (indices are fine). Any investigation or PRs there are certainly welcome

Here's a workaround to maintain those via transposition. You need a non-numeric index to account for the fact that numeric column labels are not valid in the JSON Table schema:

>>> frame.index = list('abcd')
>>> pd.read_json(frame.T.to_json(orient="table"), orient="table").T
  AAPL       MSFT      
  OPEN CLOSE OPEN CLOSE
a    1     1    1     1
b    2     2    2     2
c    3     3    3     3
d    4     4    4     4

TomAugspurger · 2018-07-19T15:25:04Z

IIRC JSON Table Schema requires string column labels, so hierarchical columns are probably out of scope for orient='table'.

…

On Thu, Jul 19, 2018 at 10:21 AM William Ayd ***@***.***> wrote: The best be here *may* be orient="table" though I'm not sure what the JSON Table Schema specification says about hierarchical columns (indices are fine). Any investigation or PRs there are certainly welcome Here's a workaround to maintain those via transposition. You need a non-numeric index to account for the fact that numeric column numbers are not valid in the JSON Table schema: >>> frame.index = list('abcd')>>> pd.read_json(frame.T.to_json(orient="table"), orient="table").T AAPL MSFT OPEN CLOSE OPEN CLOSE a 1 1 1 1 b 2 2 2 2 c 3 3 3 3 d 4 4 4 4 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#21976 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIssvUIDVN1FULaNUytXbUXsk-HELks5uIKPlgaJpZM4VWM9O> .

chris-b1 · 2018-07-19T19:03:53Z

read_csv can handle hierarchical columns - but they must be specified on the way in with a list to the header arg.

from io import StringIO
buf = StringIO()
frame.to_csv(buf)
buf.seek(0)

In [109]: pd.read_csv(buf, header=[0,1], index_col=0)
Out[109]: 
  AAPL       MSFT     
 CLOSE OPEN CLOSE OPEN
0     1    1     1    1
1     2    2     2    2
2     3    3     3    3
3     4    4     4    4

jbrockmendel · 2018-07-20T04:34:28Z

Could we use some trickery like using a \xa0 to denote what “columns
Are actually part of an MI?

alexitkes · 2018-07-20T15:14:18Z

Thank you, CSV version works fine, while JSON version still requires frame index to be a string list and crashes if timestamps or integers are used as index.

Additionally, it's looking like even frames with multi-level row indexes are stored in JSON a bit incorrectly if orient='split' used.

In [30]: frame
Out[30]: 
            0  1  2  3
AAPL CLOSE  1  2  3  4
     OPEN   1  2  3  4
MSFT CLOSE  1  2  3  4
     OPEN   1  2  3  4

In [31]: pd.read_json(StringIO(frame.to_json(orient='split')), orient='split')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
ValueError: Shape of passed values is (4, 4), indices imply (4, 2)

In [32]: frame.to_json(orient='split')
Out[32]: '{"columns":[0,1,2,3],"index":[["AAPL","CLOSE"],["AAPL","OPEN"],["MSFT","CLOSE"],["MSFT","OPEN"]],"data":[[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]]}'

WillAyd · 2018-07-20T16:16:24Z

Can you open a separate bug for the JSON orient="split" issue? That does seem off.

As a side note on orient="table":

Timestamp support is being added in ENH: Add Timedelta Support to JSON Reader with orient=table (#21140) #21827
Integers are not valid column labels but should be fine as an index; again if you have an example you can provide please open as a bug in a separate issue

TomAugspurger added the IO Data IO issues that don't fit into a more specific label label Jul 19, 2018

WillAyd closed this as completed Jul 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How should I store frames with multiindex columns in CSV? #21976

How should I store frames with multiindex columns in CSV? #21976

alexitkes commented Jul 19, 2018 •

edited

Loading

INSTALLED VERSIONS

alexitkes commented Jul 19, 2018

Uh oh!

TomAugspurger commented Jul 19, 2018

Uh oh!

WillAyd commented Jul 19, 2018 •

edited

Loading

Uh oh!

TomAugspurger commented Jul 19, 2018 via email

Uh oh!

chris-b1 commented Jul 19, 2018

Uh oh!

jbrockmendel commented Jul 20, 2018

Uh oh!

alexitkes commented Jul 20, 2018

Uh oh!

WillAyd commented Jul 20, 2018

Uh oh!

Uh oh!

How should I store frames with multiindex columns in CSV? #21976

How should I store frames with multiindex columns in CSV? #21976

Comments

alexitkes commented Jul 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

INSTALLED VERSIONS

alexitkes commented Jul 19, 2018

Uh oh!

TomAugspurger commented Jul 19, 2018

Uh oh!

WillAyd commented Jul 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Jul 19, 2018 via email

Uh oh!

chris-b1 commented Jul 19, 2018

Uh oh!

jbrockmendel commented Jul 20, 2018

Uh oh!

alexitkes commented Jul 20, 2018

Uh oh!

WillAyd commented Jul 20, 2018

Uh oh!

alexitkes commented Jul 19, 2018 •

edited

Loading

WillAyd commented Jul 19, 2018 •

edited

Loading