Description
I seem to have encountered a bug while using
DataFrame.apply(np.std) or DF.groupby.agg(np.std)
tested on:
- debian 7.9 (wheezy), pandas version: 0.16.2.dev
- ubuntu 15.10, pandas version: 0.15.0
- more version details for the tested snippets below
test code:
import numpy as np
import pandas as pd
print("pandas version: ", pd.__version__)
# why is there a difference ?
s = pd.Series([1.11] * 10)
print(np.std(s))
#1.88486436615e-08
print(np.std(s.values))
#2.22044604925e-16
# why is it significantly != 0 ?
# why is there a difference ?
print(np.std(pd.Series([53426.7756333882,] * 50)))
#0.0011048543456
print(np.std(pd.Series([53426.7756333882,] * 123)))
#0.000704429402084
# doing that with data frames
df = pd.DataFrame([538512.198638,] * 123)
print(df.apply(np.std))
#0 0.030867
# dtype: float64
print(np.std(df))
#0 0.030867
# dtype: float64
print(np.std(df.values))
#2.32830643654e-10
using the following data set even NaNs appear as result for std
funnily this happens if you add enough digits after the comma
import numpy as np
import pandas as pd
print("pandas version: ", pd.__version__)
df = pd.DataFrame([538512.1986379109,]*126)
df['uid']=15
df.set_index('uid', inplace=True)
# why do I occasionally even get NaN for identical values?
# NaN appears if there are enough decimal places after the comma
print(df.groupby(level=0).agg([np.mean, np.std]))
# 0
# mean std
#uid
#15 538512.198638 NaN
df['bla'] = 911.
print(df.groupby(level=0).agg([np.mean, np.std]))
# 0 bla
# mean std mean std
#uid
#15 538512.198638 NaN 911 0
df['bla'] = 911.12351171571542312243214
print(df.groupby(level=0).agg([np.mean, np.std]))
# 0 bla
# mean std mean std
#uid
#15 538512.198638 NaN 911 NaN
bug hypotheses:
this looks like a detail-problem regarding the way pandas applies functions
is the size of the mapped-on container calculated wrong? (e.g. not the single column series length is used, but the overall container's number of elements)
version details: (ubuntu 15.10)
INSTALLED VERSIONS
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_AT.UTF-8
pandas: 0.15.0
nose: 1.3.6
Cython: 0.23.3
numpy: 1.8.2
scipy: 0.14.1
statsmodels: 0.5.0
IPython: 2.3.0
sphinx: 1.2.3
patsy: 0.4.0
dateutil: 2.2
pytz: 2014.10
bottleneck: None
tables: 3.2.2
numexpr: 2.4.3
matplotlib: 1.4.2
openpyxl: 2.3.0-b1
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.4.4
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.9
apiclient: None
rpy2: 2.6.2
sqlalchemy: 1.0.8
pymysql: 0.6.2.None
psycopg2: 2.6.1 (dt dec mx pq3 ext lo64)