Skip to content

BUG: pandas std broken, erratic behavior #11524

Closed
@Qu-Bit

Description

@Qu-Bit

I seem to have encountered a bug while using
DataFrame.apply(np.std) or DF.groupby.agg(np.std)

tested on:

  • debian 7.9 (wheezy), pandas version: 0.16.2.dev
  • ubuntu 15.10, pandas version: 0.15.0
    • more version details for the tested snippets below

test code:

import numpy as np
import pandas as pd
print("pandas version: ", pd.__version__)


# why is there a difference ?
s = pd.Series([1.11] * 10)
print(np.std(s))
#1.88486436615e-08
print(np.std(s.values))
#2.22044604925e-16

# why is it significantly != 0 ?
# why is there a difference ?
print(np.std(pd.Series([53426.7756333882,] * 50)))
#0.0011048543456
print(np.std(pd.Series([53426.7756333882,] * 123)))
#0.000704429402084

# doing that with data frames
df = pd.DataFrame([538512.198638,] * 123)
print(df.apply(np.std))
#0    0.030867
# dtype: float64
print(np.std(df))
#0    0.030867
# dtype: float64
print(np.std(df.values))
#2.32830643654e-10

using the following data set even NaNs appear as result for std
funnily this happens if you add enough digits after the comma

import numpy as np
import pandas as pd
print("pandas version: ", pd.__version__)

df = pd.DataFrame([538512.1986379109,]*126)
df['uid']=15
df.set_index('uid', inplace=True)

# why do I occasionally even get NaN for identical values?
# NaN appears if there are enough decimal places after the comma 
print(df.groupby(level=0).agg([np.mean, np.std]))
#                 0    
#              mean std
#uid                   
#15   538512.198638 NaN
df['bla'] = 911.
print(df.groupby(level=0).agg([np.mean, np.std]))
#                 0      bla    
#              mean std mean std
#uid                            
#15   538512.198638 NaN  911   0
df['bla'] = 911.12351171571542312243214
print(df.groupby(level=0).agg([np.mean, np.std]))
#                 0      bla    
#              mean std mean std
#uid                            
#15   538512.198638 NaN  911 NaN

bug hypotheses:
this looks like a detail-problem regarding the way pandas applies functions
is the size of the mapped-on container calculated wrong? (e.g. not the single column series length is used, but the overall container's number of elements)

version details: (ubuntu 15.10)

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_AT.UTF-8

pandas: 0.15.0
nose: 1.3.6
Cython: 0.23.3
numpy: 1.8.2
scipy: 0.14.1
statsmodels: 0.5.0
IPython: 2.3.0
sphinx: 1.2.3
patsy: 0.4.0
dateutil: 2.2
pytz: 2014.10
bottleneck: None
tables: 3.2.2
numexpr: 2.4.3
matplotlib: 1.4.2
openpyxl: 2.3.0-b1
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.4.4
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.9
apiclient: None
rpy2: 2.6.2
sqlalchemy: 1.0.8
pymysql: 0.6.2.None
psycopg2: 2.6.1 (dt dec mx pq3 ext lo64)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Numeric OperationsArithmetic, Comparison, and Logical operations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions