Description
Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
d = {'l': ['left', 'right', 'left', 'right', 'left', 'right'],
'r': ['right', 'left', 'right', 'left', 'right', 'left'],
'v': [-1, 1, -1, 1, -1, np.nan]}
df = pd.DataFrame(d)
Problem description
When a grouped dataframe contains a value of np.NaN
the expected output is not aligned with numpy.sum
or pandas.Series.sum
NaN
as is given by the skipna=False
flag for pd.Series.sum
and also pd.DataFrame.sum
In [235]: df.v.sum(skipna=False)
Out[235]: nan
However, this behavior is not reflected in the pandas.DataFrame.groupby
object
In [237]: df.groupby('l')['v'].sum()['right']
Out[237]: 2.0
and cannot be forced by applying the np.sum
method directly
In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: 2.0
see this StackOverflow post for a workaround
Expected Output
In [238]: df.groupby('l')['v'].apply(np.sum)['right']
Out[238]: nan
and
In [237]: df.groupby('l')['v'].sum(skipna=False)['right']
Out[237]: nan
Output of pd.show_versions()
pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 32.3.1
Cython: 0.25.2
numpy: 1.12.0
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.4
lxml: 3.7.0
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None