Skip to content

new pandas broke correlation #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nzjrs opened this issue Oct 15, 2014 · 5 comments
Closed

new pandas broke correlation #4

nzjrs opened this issue Oct 15, 2014 · 5 comments

Comments

@nzjrs
Copy link

nzjrs commented Oct 15, 2014

Works on buzz

python ./rotation-analysis.py --uuid c3769f8c05ea11e48e436c626d3a008a --zfilt trim --rfilt trim --lenfilt 5 --reindex --arena flycave --cached --outdir /tmp/

Fails on new pandas

Traceback (most recent call last):
  File "./rotation-analysis.py", line 97, in <module>
    curve.plot_correlation_analysis(args, combine, correlations, correlation_options)
  File "/home/stowers/ros-flycave.electric.boost1.46/strawlab_freeflight_experiments/src/analysislib/curvature.py", line 364, in plot_correlation_analysis
    max_latencies_shift[_current_condition] = (latencies[ccef_m.argmax()], ccef_m.max())

looking into the code

def _correlate(df, cola, colb, shift=0):
    return df[cola].shift(shift).corr(df[colb])

returns nan if any series has nan in it. depressingly

def _correlate(df, cola, colb, shift=0):
    _df = df#.dropna(subset=[cola,colb])
    return _df[cola].shift(shift).corr(_df[colb])

also returns nan

@sdvillal
Copy link
Contributor

Looking into it, I already did a nasty quick fix over these lines in my branch here ebe8b17 (sorry for the commit not to be properly isolated, anyway that is not a proper solution). I spotted it while working with Etienne correlation plots, but did not blame pandas for it at that moment.

Right now I get this traceback that indicates a list indexing problem:

Traceback (most recent call last):
  File "./rotation-analysis.py", line 97, in <module>
    curve.plot_correlation_analysis(args, combine, correlations, correlation_options)
  File "/home/santi/Proyectos/imp/software/strawlab_freeflight_experiments/src/analysislib/curvature.py", line 360, in plot_correlation_analysis
    max_latencies_shift[_current_condition] = (latencies[ccef_m.argmax()], ccef_m.max())
TypeError: list indices must be integers, not float

(Note that I get the error in a different line, so I guess you are looking at some modified working copy).

pandas doc states that corr should ignore missing values

Will keep looking at it soon.

@sdvillal
Copy link
Contributor

Mistery solved: there is nothing broken. The offending vector is constant, and correlation with a constant vector is undefined (div by 0). Therefore the correct result is nan.

What I guess they have changed is the behavior or argmax:

import pandas as pd
import numpy as np
argmax = pd.Series(data=[np.nan, np.nan, np.nan]).argmax()  # this is now nan
max_latency = latencies[argmax]  # and we cannot index with nan

I will now write a fix and push

@nzjrs
Copy link
Author

nzjrs commented Oct 15, 2014

I'd like a test too. In general, no touching this stuff without a test.
On 15/10/2014 5:33 PM, "Santi Villalba" [email protected] wrote:

Mistery solved: there is nothing broken. The offending vector is constant,
and correlation with a constant vector is undefined (div by 0). Therefore
the correct result is nan.

What I guess they have changed is the behavior or argmax:

import pandas as pdimport numpy as npargmax = pd.Series(data=[np.nan, np.nan, np.nan]).argmax() # this is now nanmax_latency = latencies[argmax] # and we cannot index with nan

I will now write a fix and push


Reply to this email directly or view it on GitHub
#4 (comment)
.

@nzjrs
Copy link
Author

nzjrs commented Oct 15, 2014

There are already some correlation tests in test_correlation.py. add one
with a constant vector and make sure I'm already testing one with partial
nans.

I'll run the test on buzz and on my desktop
On 15/10/2014 5:50 PM, "Santi Villalba" [email protected] wrote:

Closed #4
#4
via dba01c9
dba01c9
.


Reply to this email directly or view it on GitHub
#4 (comment)
.

@sdvillal
Copy link
Contributor

I have written some tests in 2a97fb4. I did my best but I found hard to really test what we want to test with the current all-in-one design of plot_correlation_analysis.

Some thoughts:

  • Since this has revealed a bug, you might want to rerun all rotation experiments analyses. Otherwise we will have wrong numbers whenever a constant series had happened.
  • I do not know if having constant 0 stimulus time-series (e.g. rotation_rate in the uuid you reported) can be a sympton of a problem or it is just normal.
  • I expect that tests in machines with pandas 0.12 will fail after 31415f0. We could rewrite that code so that it works in both 0.12 and 0.14 (e.g. if we use np.argmax), but that has the risk of hiding again regressions that would allow computing undefined correlations. In my mind it is better to let it fail, but maybe you do not want to update pandas yet in, let's say, buzz.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants