Description
Somehow, DataFrame.sum() seems to always create a temporary copy of the data frame in the memory.
Code Sample
First, we create a large, 3.7 GB DataFrame with many columns:
import pandas as pd
import numpy as np
p = 500
n = 1000000
dtype = np.float64
df = pd.DataFrame(np.arange(p*n, dtype=dtype).reshape((p, n)))
print(df.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Columns: 1000000 entries, 0 to 999999
dtypes: float64(1000000)
memory usage: 3.7 GB
Next, we want to sum over the rows:
# sum over rows
s = df.sum(axis=0) # this step requires > 7GB of memory (!!!)
s.to_frame.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
0 1000000 non-null float64
dtypes: float64(1)
memory usage: 7.6 MB
By monitoring the memory consumption of Python during the execution of this step using top
(execution takes a 2-3 seconds on my machine), I can see that the consumption temporarily goes up two-fold, indicating that a copy of the entire frame is created in memory. However, the following code achieves the same result without creating a copy.
y = df.values.sum(axis=0) # requires < 4 GB of memory
y = pd.Series(y, index=df.columns)
y.to_frame().info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
0 1000000 non-null float64
dtypes: float64(1)
memory usage: 7.6 MB
Problem description
Creating a copy of the data frame seems unnecessary for summing (numpy does it without creating a copy). The current implementation of DataFrame.sum() makes it impossible to sum over data frames if there isn't enough memory available to create a copy.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: 1.6.1
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.6
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None