new pandas broke correlation #4

nzjrs · 2014-10-15T08:07:57Z

Works on buzz

python ./rotation-analysis.py --uuid c3769f8c05ea11e48e436c626d3a008a --zfilt trim --rfilt trim --lenfilt 5 --reindex --arena flycave --cached --outdir /tmp/

Fails on new pandas

Traceback (most recent call last):
  File "./rotation-analysis.py", line 97, in <module>
    curve.plot_correlation_analysis(args, combine, correlations, correlation_options)
  File "/home/stowers/ros-flycave.electric.boost1.46/strawlab_freeflight_experiments/src/analysislib/curvature.py", line 364, in plot_correlation_analysis
    max_latencies_shift[_current_condition] = (latencies[ccef_m.argmax()], ccef_m.max())

looking into the code

def _correlate(df, cola, colb, shift=0):
    return df[cola].shift(shift).corr(df[colb])

returns nan if any series has nan in it. depressingly

def _correlate(df, cola, colb, shift=0):
    _df = df#.dropna(subset=[cola,colb])
    return _df[cola].shift(shift).corr(_df[colb])

also returns nan

The text was updated successfully, but these errors were encountered:

sdvillal · 2014-10-15T14:46:18Z

Looking into it, I already did a nasty quick fix over these lines in my branch here ebe8b17 (sorry for the commit not to be properly isolated, anyway that is not a proper solution). I spotted it while working with Etienne correlation plots, but did not blame pandas for it at that moment.

Right now I get this traceback that indicates a list indexing problem:

Traceback (most recent call last):
  File "./rotation-analysis.py", line 97, in <module>
    curve.plot_correlation_analysis(args, combine, correlations, correlation_options)
  File "/home/santi/Proyectos/imp/software/strawlab_freeflight_experiments/src/analysislib/curvature.py", line 360, in plot_correlation_analysis
    max_latencies_shift[_current_condition] = (latencies[ccef_m.argmax()], ccef_m.max())
TypeError: list indices must be integers, not float

(Note that I get the error in a different line, so I guess you are looking at some modified working copy).

pandas doc states that corr should ignore missing values

Will keep looking at it soon.

sdvillal · 2014-10-15T15:33:34Z

Mistery solved: there is nothing broken. The offending vector is constant, and correlation with a constant vector is undefined (div by 0). Therefore the correct result is nan.

What I guess they have changed is the behavior or argmax:

import pandas as pd
import numpy as np
argmax = pd.Series(data=[np.nan, np.nan, np.nan]).argmax()  # this is now nan
max_latency = latencies[argmax]  # and we cannot index with nan

I will now write a fix and push

nzjrs · 2014-10-15T16:29:34Z

I'd like a test too. In general, no touching this stuff without a test.
On 15/10/2014 5:33 PM, "Santi Villalba" [email protected] wrote:

Mistery solved: there is nothing broken. The offending vector is constant,
and correlation with a constant vector is undefined (div by 0). Therefore
the correct result is nan.

What I guess they have changed is the behavior or argmax:

import pandas as pdimport numpy as npargmax = pd.Series(data=[np.nan, np.nan, np.nan]).argmax() # this is now nanmax_latency = latencies[argmax] # and we cannot index with nan

I will now write a fix and push

—
Reply to this email directly or view it on GitHub
#4 (comment)
.

nzjrs · 2014-10-15T16:35:10Z

There are already some correlation tests in test_correlation.py. add one
with a constant vector and make sure I'm already testing one with partial
nans.

I'll run the test on buzz and on my desktop
On 15/10/2014 5:50 PM, "Santi Villalba" [email protected] wrote:

Closed #4
#4
via dba01c9
dba01c9
.

—
Reply to this email directly or view it on GitHub
#4 (comment)
.

sdvillal · 2014-10-16T13:58:28Z

I have written some tests in 2a97fb4. I did my best but I found hard to really test what we want to test with the current all-in-one design of plot_correlation_analysis.

Some thoughts:

Since this has revealed a bug, you might want to rerun all rotation experiments analyses. Otherwise we will have wrong numbers whenever a constant series had happened.
I do not know if having constant 0 stimulus time-series (e.g. rotation_rate in the uuid you reported) can be a sympton of a problem or it is just normal.
I expect that tests in machines with pandas 0.12 will fail after 31415f0. We could rewrite that code so that it works in both 0.12 and 0.14 (e.g. if we use np.argmax), but that has the risk of hiding again regressions that would allow computing undefined correlations. In my mind it is better to let it fail, but maybe you do not want to update pandas yet in, let's say, buzz.

sdvillal closed this as completed in dba01c9 Oct 15, 2014

sdvillal mentioned this issue Oct 16, 2014

index out of bounds in correlation plot #6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

new pandas broke correlation #4

new pandas broke correlation #4

nzjrs commented Oct 15, 2014

sdvillal commented Oct 15, 2014

Uh oh!

sdvillal commented Oct 15, 2014

Uh oh!

nzjrs commented Oct 15, 2014

Uh oh!

nzjrs commented Oct 15, 2014

Uh oh!

sdvillal commented Oct 16, 2014

Uh oh!

new pandas broke correlation #4

new pandas broke correlation #4

Comments

nzjrs commented Oct 15, 2014

sdvillal commented Oct 15, 2014

Uh oh!

sdvillal commented Oct 15, 2014

Uh oh!

nzjrs commented Oct 15, 2014

Uh oh!

nzjrs commented Oct 15, 2014

Uh oh!

sdvillal commented Oct 16, 2014

Uh oh!