Skip to content

BUG: DataFrame.loc[i:j] works differently for DatetimeIndex #13869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jzwinck opened this issue Aug 1, 2016 · 7 comments
Closed

BUG: DataFrame.loc[i:j] works differently for DatetimeIndex #13869

jzwinck opened this issue Aug 1, 2016 · 7 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@jzwinck
Copy link
Contributor

jzwinck commented Aug 1, 2016

This code all "works," but in a surprising way:

import pandas as pd
t0 = 1234567890123400000
df1 = pd.DataFrame(index=pd.DatetimeIndex([t0, t0 + 1000, t0 + 2000, t0 + 3000]))
df2 = pd.DataFrame(index=range(4))
df1.loc[:2, 'a'] = np.arange(2)
df2.loc[:2, 'a'] = np.arange(3)

We create df1 with a DatetimeIndex and df2 with an integer index. We then create a new column a in each, using .loc[] with an integer slice. With df1 we get the intuitive, normal Python slice behavior where [:2] means "the first 2 elements", whereas with df2 we get the bizarre—but documentedDataFrame.loc slice behavior where [:2] means "elements up to index 2, inclusive."

I don't see why the type of index the DataFrame has should affect the semantics of slicing with .loc[]. I happen to think the exclusive-end behavior is correct in all cases, though apparently Pandas has decided (or at least documented) that .loc[] slicing is inclusive (in which case the DatetimeIndex case looks like a bug).

Also note that trying to "read" df1.loc[:2, 'a'] (e.g. to print it) fails, saying:

TypeError: cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'>
    with these indexers [2] of <class 'int'>

It's sort of strange that you can assign to this slice but not read from it.

I'm using Pandas 0.18.1.

@jreback
Copy link
Contributor

jreback commented Aug 1, 2016

this shouldn't allow the assignment at all. a slice of :2 is not valid for a datetimeindex.

This is a bit deep in the code.

some of the setting validation is not nearly as well tested as the getting logic. You are welcome to take a stab.

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Aug 1, 2016
@jreback jreback added this to the Next Major Release milestone Aug 1, 2016
@jzwinck
Copy link
Contributor Author

jzwinck commented Aug 1, 2016

@jreback I thought you might say that. I am OK with it if you're sure this shouldn't be allowed, though I do find the syntax convenient so if you have another concise way to do the same thing in a more supported way I would love to know what that is.

This is not at all an area I'm familiar with so I'm not planning to try to fix it myself. Just wanted to be clear so nobody waits for me on this.

@jreback
Copy link
Contributor

jreback commented Aug 3, 2016

well you an use partial string indexing and .loc or

pandas by definition aligns things, you need to work with it, rather than fight it.

In [10]: df1['a'] = Series(np.arange(2),index=df1.index[:2])

In [11]: df1
Out[11]: 
                              a
2009-02-13 23:31:30.123400  0.0
2009-02-13 23:31:30.123401  1.0
2009-02-13 23:31:30.123402  NaN
2009-02-13 23:31:30.123403  NaN

Or this

In [17]: df1['a'] = np.array([0,1,np.nan,np.nan])

In [18]: df1
Out[18]: 
                              a
2009-02-13 23:31:30.123400  0.0
2009-02-13 23:31:30.123401  1.0
2009-02-13 23:31:30.123402  NaN
2009-02-13 23:31:30.123403  NaN

using list-like indexers is not providing any aligning information, so you MUST fully fill out the array. Generally using Series is much more convenient.

@jzwinck
Copy link
Contributor Author

jzwinck commented Aug 3, 2016

@jreback Thanks, your example using Series is the best alternative I have seen. But it is not equivalent to what I was doing, because it overwrites the entire column, even for indexes I did not want to overwrite. For example, if you run my original code and then:

df1['a'] = pd.Series(np.arange(2), index=df1.index[2:]) # last 2, not first 2

You end up with [NaN, NaN, 0, 1]--it dropped the existing values from the first two rows!

Using Series also fails if there are duplicate values in the index ("cannot reindex from a duplicate axis"), whereas integer-based slicing is not sensitive to that.

This sort of thing comes up somewhat frequently, and I think ultimately the problem arises from the fact that loc and iloc require that all the indexers are either labels or integers. In this example what I really want is semantically more like one of these hypothetical syntaxes:

df1.iloc[:2].loc['b'] =  np.arange(2)
df1.xloc[:2, 'b'] =  np.arange(2) # like iloc for first argument, loc for second

That is, I often want a way to use integer-based indexing of rows, but name-based indexing of columns. For those of us who use Pandas in concert with other NumPy-based libraries, this use case is fairly common.

The best working alternative I have come up with so far is to drop down to NumPy like this:

df1['a'].values[:2] = np.arange(2)

If the column exists already, this does what I want. But I imagine you don't intend for people to drop down to NumPy to do something simple like this.

@shoyer
Copy link
Member

shoyer commented Aug 3, 2016

@jzwinck You might try the ix indexer, which does mixed labeled/integer indexing (though with some problematic fallback logic).

I usually prefer something more explicit, e.g., df.loc[df.index[:2], 'b'] = np.arange(2). You can also use methods like Index.get_loc to go from labels to integers.

@jreback
Copy link
Contributor

jreback commented Aug 3, 2016

@shoyer is right. This is already a well-defined, very convenient way in pandas. Missing position-and-label based must be very explict on purpose.

@jzwinck
Copy link
Contributor Author

jzwinck commented Aug 4, 2016

@shoyer and @jreback Thanks for the idea to use df.loc[df.index[:2]]. Unfortunately that doesn't work in some cases--I assume it's a bug so I filed #13908.

@jzwinck jzwinck closed this as completed May 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

3 participants