Skip to content

ENH: Generalize cut/melt to handle datetime input #6582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
8one6 opened this issue Mar 10, 2014 · 19 comments
Closed

ENH: Generalize cut/melt to handle datetime input #6582

8one6 opened this issue Mar 10, 2014 · 19 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Enhancement

Comments

@8one6
Copy link

8one6 commented Mar 10, 2014

see also #6434

Originally from:
http://stackoverflow.com/questions/22286930/is-it-possible-to-use-cut-on-a-collection-of-datetimes?noredirect=1#comment33862629_22286930

Would it be possible to enhance the cut function to handle datetime inputs?

For example, say I have the following setup:

import pandas as pd
import numpy as np

np.random.seed(0)
obs_list = ['John', 'Paul', 'George', 'Ringo']

date_indices = [pd.date_range('2014-01-01', '2014-02-01', freq=f) for f in ['1d', '3h', '25min']]

dfs = [pd.DataFrame({'observer': np.random.choice(obs_list, len(di)),
                     'value': np.random.randn(len(di))+ (100*i)},
                    index=di)
       for i, di in enumerate(date_indices)]

bigframe = pd.concat(dfs)

I would like to split up bigframe into 7 "bins" and explore the properties within each bin. I'd like the bins to be sequential and for each to contain (approxmiately) the same number of data points.

Ideally, I'd be able to do something like:

means = bigframe.groupby(pd.cut(bigframe.index, 7))['value'].mean()

I think I can get something like that by doing:

bigframe['dt_value'] = bigframe.index.map(lambda x: x.value)
means = bigframe.groupby(pd.cut(bigframe['dt_value'], 7))['value'].mean()

but that doesn't seem particularly elegant. And it winds up with bins labelled with the datetime values instead of the more-nicely-readable datetimes themselves.

@hayd
Copy link
Contributor

hayd commented Mar 10, 2014

Could you post an example input and output? How would this be different from a resample?

@8one6
Copy link
Author

8one6 commented Mar 10, 2014

For example, say I have two columns of data, the first is timestamps and the second is temperature. The data is sampled unevenly. I was hoping to use cut to bin the times into 5 categories each of which contained equal numbers of measurements. Is this something resample can handle?

@hayd
Copy link
Contributor

hayd commented Mar 10, 2014

I don't think resample can do equal sized bins. Do you mind editing an example to the original question, I think that'll make it clearer.

@8one6
Copy link
Author

8one6 commented Mar 11, 2014

Took a shot at it.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

You just want something like this?

In [28]: df = bigframe.reset_index()

In [29]: df.groupby(df.index % 7).apply(lambda x: Series({ 'start' : x['index'].iloc[0], 'end' : x['index'].iloc[-1], 'value' : x['value'].mean() }))
Out[29]: 
                  end      start       value
0 2014-01-31 23:20:00 2014-01-01  184.452251
1 2014-01-31 23:45:00 2014-01-02  184.793878
2 2014-02-01 00:10:00 2014-01-03  184.790279
3 2014-01-31 21:40:00 2014-01-04  184.704260
4 2014-01-31 22:05:00 2014-01-05  184.999369
5 2014-01-31 22:30:00 2014-01-06  185.152102
6 2014-01-31 22:55:00 2014-01-07  185.076598

[7 rows x 3 columns]

@8one6
Copy link
Author

8one6 commented Mar 11, 2014

I believe your example only gives the desired behavior when the data is already in chronological order. Part of what I'd hoped to get from cut is that it puts the data in order before binning it. I think if you did a sort before your groupby this would be the exact behavior I'm after. My hope was that since the functionality is exactly what cut does for a living anyway, that it could be incorporated into cut itself.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

cut is really more for continuous data binning; this is just integer partitioning, and what you would use groupby. I think we have an open issue about creating a partition function to do almost exactly this., see #4059.

The problem with cut in this scenario is you would get potentially really odd dates, if you treat the datetime space as continuous-like. I don't think that's a reasonable soln.

@8one6
Copy link
Author

8one6 commented Mar 11, 2014

I'm not sure I follow. In my example above, the data is intentionally sampled pretty unevenly. (Combination of daily, 9hourly and 25minutely data). So I would have thought thinking of time as a continuous variable to bin was appropriate in that context.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

Here is what I would expect if pd.cut handled datetime64[ns](except I don't print the left-right bins),
but its the same idea. I don't think this is intuitive.

In [27]: g,bins = pd.cut(bigframe.index.asi8, 7, retbins=True)

In [28]: df = bigframe.groupby(g)['value'].mean()

In [29]: df
Out[29]: 
(1388531721000000000, 1388917114285714176]    184.568407
(1388917114285714176, 1389299828571428608]    185.379917
(1389299828571428608, 1389682542857142784]    184.464207
(1389682542857142784, 1390065257142857216]    185.334258
(1390065257142857216, 1390447971428571392]    184.395181
(1390447971428571392, 1390830685714285824]    185.390527
(1390830685714285824, 1391213400000000000]    184.447259
Name: value, dtype: float64

In [30]: df.index = [ Timestamp(int(b)) for b in bins ][:7]

In [31]: df
Out[31]: 
2013-12-31 23:15:21              184.568407
2014-01-05 10:18:34.285714176    185.379917
2014-01-09 20:37:08.571428608    184.464207
2014-01-14 06:55:42.857142784    185.334258
2014-01-18 17:14:17.142857216    184.395181
2014-01-23 03:32:51.428571392    185.390527
2014-01-27 13:51:25.714285824    184.447259
Name: value, dtype: float64

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

You would need some sort of 'rounding' I think

In [45]: df.index = [ Timestamp(1e12*int((b/1e12))) for b in bins ][:7]

In [46]: df
Out[46]: 
2013-12-31 23:03:20    184.568407
2014-01-05 10:00:00    185.379917
2014-01-09 20:23:20    184.464207
2014-01-14 06:46:40    185.334258
2014-01-18 16:53:20    184.395181
2014-01-23 03:16:40    185.435758
2014-01-27 13:40:00    184.398994
Name: value, dtype: float64

@hayd
Copy link
Contributor

hayd commented Mar 11, 2014

Perhaps it would help to provide a use case for when you'd want this.

I think I'm with Jeff here, a priori it's a strange result... (even with rounding)

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

though I think this should at least have a better error-message....

@jreback jreback added this to the 0.14.0 milestone Mar 11, 2014
@8one6
Copy link
Author

8one6 commented Mar 11, 2014

I'll see if I can think of another case, but I'd put out there that I think Jeff's two results above are quite intuitive and useful. Put another way, up above someone said that cut is generally used to bin continuous variables...why wouldn't time qualify in an intuitive way? For example: I can imagine wanting to make histograms of time measurements (an experiment where lots of people report the time that they see some event, for example). And I might want to report the summary stats for the groups of people who observed the event at similar times. (Maybe I ask lots of people when they saw a lunar eclipse reach maximum coverage, and I also know all sorts of data about their latitude and longitudes...so I want to bin observers based on when they saw the max-coverage of the eclipse and then see if they all live near eachother....)

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

@8one6 not a problem with this result, maybe could just automatically round to second precision (in theory should pass this in). not hard to do.

want to do a PR for this then?

get's you into some juicy internals!

@8one6
Copy link
Author

8one6 commented Mar 11, 2014

@jreback I've never actually contributed to a major project like this, and am very shaky on Git in general. Best place to start? Best ways to avoid making people angry with potentially-shaky code? I assume I'd need to write a test or two? (Never done that.)

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

https://github.com/pydata/pandas/wiki

clone the repo

look for where the tests for cut are now pandas/tests/tests_algos.py

devise a simple test/tests that show the new behavior

run the tests, they should fail

do a fix

run until tests pass (and others don't break)

submit a PR

just post hear if you need help

@8one6
Copy link
Author

8one6 commented Mar 11, 2014

So I was being a bit silly on this. I think I meant qcut everywhere I've written cut in all of my examples. The goal isn't to get evenly sized bins in terms of their extent. It is to get bins with equal numbers of members. So should be qcut, not cut, right? But I think the same issues apply. Does that sound right?

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

same idea (and i think fix will work for both those function) e.g. test for both

@jreback jreback changed the title ENH: Generalize 'cut' to handle datetime input ENH: Generalize cut/melt to handle datetime input Mar 22, 2014
@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 6, 2014
@jreback jreback removed this from the 0.16.0 milestone Mar 3, 2015
@jreback jreback modified the milestones: Next Major Release, 0.16.0 Mar 3, 2015
@mroeschke
Copy link
Member

cut supports datetime data now and is well tested. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

No branches or pull requests

4 participants