ENH: Generalize cut/melt to handle datetime input #6582

8one6 · 2014-03-10T01:03:50Z

see also #6434

Originally from:
http://stackoverflow.com/questions/22286930/is-it-possible-to-use-cut-on-a-collection-of-datetimes?noredirect=1#comment33862629_22286930

Would it be possible to enhance the cut function to handle datetime inputs?

For example, say I have the following setup:

import pandas as pd
import numpy as np

np.random.seed(0)
obs_list = ['John', 'Paul', 'George', 'Ringo']

date_indices = [pd.date_range('2014-01-01', '2014-02-01', freq=f) for f in ['1d', '3h', '25min']]

dfs = [pd.DataFrame({'observer': np.random.choice(obs_list, len(di)),
                     'value': np.random.randn(len(di))+ (100*i)},
                    index=di)
       for i, di in enumerate(date_indices)]

bigframe = pd.concat(dfs)

I would like to split up bigframe into 7 "bins" and explore the properties within each bin. I'd like the bins to be sequential and for each to contain (approxmiately) the same number of data points.

Ideally, I'd be able to do something like:

means = bigframe.groupby(pd.cut(bigframe.index, 7))['value'].mean()

I think I can get something like that by doing:

bigframe['dt_value'] = bigframe.index.map(lambda x: x.value)
means = bigframe.groupby(pd.cut(bigframe['dt_value'], 7))['value'].mean()

but that doesn't seem particularly elegant. And it winds up with bins labelled with the datetime values instead of the more-nicely-readable datetimes themselves.

The text was updated successfully, but these errors were encountered:

hayd · 2014-03-10T03:52:22Z

Could you post an example input and output? How would this be different from a resample?

8one6 · 2014-03-10T21:02:07Z

For example, say I have two columns of data, the first is timestamps and the second is temperature. The data is sampled unevenly. I was hoping to use cut to bin the times into 5 categories each of which contained equal numbers of measurements. Is this something resample can handle?

hayd · 2014-03-10T22:35:47Z

I don't think resample can do equal sized bins. Do you mind editing an example to the original question, I think that'll make it clearer.

8one6 · 2014-03-11T17:05:20Z

Took a shot at it.

jreback · 2014-03-11T17:12:43Z

You just want something like this?

In [28]: df = bigframe.reset_index()

In [29]: df.groupby(df.index % 7).apply(lambda x: Series({ 'start' : x['index'].iloc[0], 'end' : x['index'].iloc[-1], 'value' : x['value'].mean() }))
Out[29]: 
                  end      start       value
0 2014-01-31 23:20:00 2014-01-01  184.452251
1 2014-01-31 23:45:00 2014-01-02  184.793878
2 2014-02-01 00:10:00 2014-01-03  184.790279
3 2014-01-31 21:40:00 2014-01-04  184.704260
4 2014-01-31 22:05:00 2014-01-05  184.999369
5 2014-01-31 22:30:00 2014-01-06  185.152102
6 2014-01-31 22:55:00 2014-01-07  185.076598

[7 rows x 3 columns]

8one6 · 2014-03-11T17:45:19Z

I believe your example only gives the desired behavior when the data is already in chronological order. Part of what I'd hoped to get from cut is that it puts the data in order before binning it. I think if you did a sort before your groupby this would be the exact behavior I'm after. My hope was that since the functionality is exactly what cut does for a living anyway, that it could be incorporated into cut itself.

jreback · 2014-03-11T17:47:09Z

cut is really more for continuous data binning; this is just integer partitioning, and what you would use groupby. I think we have an open issue about creating a partition function to do almost exactly this., see #4059.

The problem with cut in this scenario is you would get potentially really odd dates, if you treat the datetime space as continuous-like. I don't think that's a reasonable soln.

8one6 · 2014-03-11T18:23:36Z

I'm not sure I follow. In my example above, the data is intentionally sampled pretty unevenly. (Combination of daily, 9hourly and 25minutely data). So I would have thought thinking of time as a continuous variable to bin was appropriate in that context.

jreback · 2014-03-11T18:30:52Z

Here is what I would expect if pd.cut handled datetime64[ns](except I don't print the left-right bins),
but its the same idea. I don't think this is intuitive.

In [27]: g,bins = pd.cut(bigframe.index.asi8, 7, retbins=True)

In [28]: df = bigframe.groupby(g)['value'].mean()

In [29]: df
Out[29]: 
(1388531721000000000, 1388917114285714176]    184.568407
(1388917114285714176, 1389299828571428608]    185.379917
(1389299828571428608, 1389682542857142784]    184.464207
(1389682542857142784, 1390065257142857216]    185.334258
(1390065257142857216, 1390447971428571392]    184.395181
(1390447971428571392, 1390830685714285824]    185.390527
(1390830685714285824, 1391213400000000000]    184.447259
Name: value, dtype: float64

In [30]: df.index = [ Timestamp(int(b)) for b in bins ][:7]

In [31]: df
Out[31]: 
2013-12-31 23:15:21              184.568407
2014-01-05 10:18:34.285714176    185.379917
2014-01-09 20:37:08.571428608    184.464207
2014-01-14 06:55:42.857142784    185.334258
2014-01-18 17:14:17.142857216    184.395181
2014-01-23 03:32:51.428571392    185.390527
2014-01-27 13:51:25.714285824    184.447259
Name: value, dtype: float64

jreback · 2014-03-11T18:53:51Z

You would need some sort of 'rounding' I think

In [45]: df.index = [ Timestamp(1e12*int((b/1e12))) for b in bins ][:7]

In [46]: df
Out[46]: 
2013-12-31 23:03:20    184.568407
2014-01-05 10:00:00    185.379917
2014-01-09 20:23:20    184.464207
2014-01-14 06:46:40    185.334258
2014-01-18 16:53:20    184.395181
2014-01-23 03:16:40    185.435758
2014-01-27 13:40:00    184.398994
Name: value, dtype: float64

hayd · 2014-03-11T18:56:17Z

Perhaps it would help to provide a use case for when you'd want this.

I think I'm with Jeff here, a priori it's a strange result... (even with rounding)

jreback · 2014-03-11T18:57:22Z

though I think this should at least have a better error-message....

8one6 · 2014-03-11T19:22:10Z

I'll see if I can think of another case, but I'd put out there that I think Jeff's two results above are quite intuitive and useful. Put another way, up above someone said that cut is generally used to bin continuous variables...why wouldn't time qualify in an intuitive way? For example: I can imagine wanting to make histograms of time measurements (an experiment where lots of people report the time that they see some event, for example). And I might want to report the summary stats for the groups of people who observed the event at similar times. (Maybe I ask lots of people when they saw a lunar eclipse reach maximum coverage, and I also know all sorts of data about their latitude and longitudes...so I want to bin observers based on when they saw the max-coverage of the eclipse and then see if they all live near eachother....)

jreback · 2014-03-11T19:25:13Z

@8one6 not a problem with this result, maybe could just automatically round to second precision (in theory should pass this in). not hard to do.

want to do a PR for this then?

get's you into some juicy internals!

8one6 · 2014-03-11T19:27:06Z

@jreback I've never actually contributed to a major project like this, and am very shaky on Git in general. Best place to start? Best ways to avoid making people angry with potentially-shaky code? I assume I'd need to write a test or two? (Never done that.)

jreback · 2014-03-11T19:30:49Z

https://github.com/pydata/pandas/wiki

clone the repo

look for where the tests for cut are now pandas/tests/tests_algos.py

devise a simple test/tests that show the new behavior

run the tests, they should fail

do a fix

run until tests pass (and others don't break)

submit a PR

just post hear if you need help

8one6 · 2014-03-11T19:43:12Z

So I was being a bit silly on this. I think I meant qcut everywhere I've written cut in all of my examples. The goal isn't to get evenly sized bins in terms of their extent. It is to get bins with equal numbers of members. So should be qcut, not cut, right? But I think the same issues apply. Does that sound right?

jreback · 2014-03-11T19:46:56Z

same idea (and i think fix will work for both those function) e.g. test for both

mroeschke · 2018-07-06T21:28:02Z

cut supports datetime data now and is well tested. Closing

jreback added the Error Reporting label Mar 11, 2014

jreback added this to the 0.14.0 milestone Mar 11, 2014

jreback added Algos and removed Error Reporting labels Mar 11, 2014

jreback mentioned this issue Mar 22, 2014

melting of time-series #6434

Closed

jreback changed the title ~~ENH: Generalize 'cut' to handle datetime input~~ ENH: Generalize cut/melt to handle datetime input Mar 22, 2014

jreback modified the milestones: 0.15.0, 0.14.0 Apr 6, 2014

jreback removed this from the 0.16.0 milestone Mar 3, 2015

jreback modified the milestones: Next Major Release, 0.16.0 Mar 3, 2015

mroeschke closed this as completed Jul 6, 2018

Uh oh!

ENH: Generalize cut/melt to handle datetime input #6582

ENH: Generalize cut/melt to handle datetime input #6582

Comments

8one6 commented Mar 10, 2014

hayd commented Mar 10, 2014

Uh oh!

8one6 commented Mar 10, 2014

Uh oh!

hayd commented Mar 10, 2014

Uh oh!

8one6 commented Mar 11, 2014

Uh oh!

jreback commented Mar 11, 2014

Uh oh!

8one6 commented Mar 11, 2014

Uh oh!

jreback commented Mar 11, 2014

Uh oh!

8one6 commented Mar 11, 2014

Uh oh!

jreback commented Mar 11, 2014

Uh oh!

jreback commented Mar 11, 2014

Uh oh!

hayd commented Mar 11, 2014

Uh oh!

jreback commented Mar 11, 2014

Uh oh!

8one6 commented Mar 11, 2014

Uh oh!

jreback commented Mar 11, 2014

Uh oh!

8one6 commented Mar 11, 2014

Uh oh!

jreback commented Mar 11, 2014

Uh oh!

8one6 commented Mar 11, 2014

Uh oh!

jreback commented Mar 11, 2014

Uh oh!

mroeschke commented Jul 6, 2018

Uh oh!