Skip to content

Documentation for .iloc is misleadingly incomplete #8956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cswarth opened this issue Dec 1, 2014 · 13 comments
Closed

Documentation for .iloc is misleadingly incomplete #8956

cswarth opened this issue Dec 1, 2014 · 13 comments
Labels
Docs Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@cswarth
Copy link
Contributor

cswarth commented Dec 1, 2014

The documentation for .loc at http://pandas.pydata.org/pandas-docs/version/0.15.1/indexing.html#different-choices-for-indexing should mention that .iloc also takes a boolean array instead of insisting that it is "strictly integer position based".

.iloc is strictly integer position based (from 0 to length-1 of the axis), will raise IndexError if an indexer is requested and it is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with python/numpy slice semantics). Allowed inputs are:

An integer e.g. 5
A list or array of integers [4, 3, 0]
A slice object with ints 1:7

Unlike .loc immediately above, no mention is made of "A boolean array"
This is incorrect as later in the same document an example uses .iloc and isin to retrieve elements from a Series. isin returns a list of booleans (well, strictly it is a numpy.ndarray, but converting to a list works as well)

@jreback
Copy link
Contributor

jreback commented Dec 1, 2014

yep that was an oversight iloc can take a boolean array

want to do a pr to add a doc example and update?

@jreback jreback added Docs Indexing Related to indexing on series/frames, not to indexes themselves labels Dec 1, 2014
@jreback jreback added this to the 0.16.0 milestone Dec 1, 2014
@cswarth
Copy link
Contributor Author

cswarth commented Dec 2, 2014

Sure, I'll give it a shot.

@TomAugspurger
Copy link
Contributor

Is .iloc ever necessary to index with a boolean? In the example @cswarth mentions from the isin docs, s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)] should be the same as s_mi[s_mi.index.isin(['a', 'c', 'e'], level=1)], i.e. without the .iloc, I think.

@jorisvandenbossche
Copy link
Member

Yes, for the example here, the iloc is not needed / not the logical choice of indexer. So that example could be adapted.

But in general, I think the use case where this can come in handy is if you want to index multiple axes with mixed positional/boolean indexing. Eg if you want to combine iloc on the columns with boolean on the rows:

df.iloc[df.index.isin(...), 0]

@cswarth
Copy link
Contributor Author

cswarth commented Dec 2, 2014

While adding a brief mention of boolean indexers to .iloc I realized I didn't really know how they worked in all cases. For example, what happens when a boolean indexer has a different shape than the axis it is indexing? What happens if the index is too short? Or too long?

Numpy is explicit, "Boolean arrays must be of the same shape as the initial dimensions of the array being indexed", but the examples below suggests that is not true. I actually don't know what the hell numpy is doing in this case - it looks totally unexpected.

pandas.Series behaves rationally, in my opinion. But is this silent 'leftmost-favored' behavior explicitly documented? If not, should it be?

x=np.arange(100,1,-1)
print(x[[True, False,  True,  True, True]])
[ 99 100  99  99  99]
-c:2: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
from pandas import *
s_mi = Series(np.arange(100,1,-1))
print(s_mi.iloc[[True, False,  True,  True, True]])
0    100
2     98
3     97
4     96
dtype: int64

@cswarth
Copy link
Contributor Author

cswarth commented Dec 2, 2014

NVM the numpy example, I realize my example was bogus b/c it passed an array as the index. This example works as documented.

import numpy as np
x=np.arange(100,1,-1)
print(x[True, False,  True,  True, True])
IndexError: too many indices for array

@immerrr
Copy link
Contributor

immerrr commented Dec 2, 2014

On Tue, Dec 2, 2014 at 9:11 PM, cswarth [email protected] wrote:

Numpy is explicit
http://docs.scipy.org/doc/numpy/user/basics.indexing.html#boolean-or-mask-index-arrays,
"Boolean arrays must be of the same shape as the initial dimensions of the
array being indexed", but the examples below suggests that is not true. I
actually don't know what the hell numpy is doing in this case - it looks
totally unexpected.

Numpy only allows ndarrays of bool as masks, your snippet used a list of
bool. This is a common source of confusion, e.g.
http://stackoverflow.com/questions/17779468/numpy-indexing-with-a-one-dimensional-boolean-array
.

@jorisvandenbossche
Copy link
Member

I rather find the pandas behaviour a bit strange here. Shouldn't the array not have the same length as the series with boolean indexing?

@jreback
Copy link
Contributor

jreback commented Dec 3, 2014

@jorisvandenbossche what do you find strange? the boolean array must be equal to the length of the axis

@jreback jreback modified the milestones: 0.15.2, 0.16.0 Dec 3, 2014
@cswarth
Copy link
Contributor Author

cswarth commented Dec 3, 2014

I'm probably doing something wrong, but I dont think that's true.

from pandas import *
s = Series(np.arange(5,1,-1))
print(s)
print(s.iloc[[True, False, True]])
0    5
1    4
2    3
3    2
dtype: int64
0    5
2    3
dtype: int64

@jreback
Copy link
Contributor

jreback commented Dec 3, 2014

@cswarth technically that's ok but I think it should really validate the len(boolean indexer) == len(series).

numpy actually coerces the indexer which is odd (e.g. this is like indexing [1,0,1])

 [17]: s.values[[True,False,True]]
Out[17]: array([4, 5, 4])

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche what do you find strange? the boolean array must be equal to the length of the axis

exactly that, the boolean array does apparantly not need to be of the same length

numpy actually coerces the indexer which is odd (e.g. this is like indexing [1,0,1])

That is what @immerrr said above (#8956 (comment)), numpy only does boolean indexing with arrays, not with lists (so with the array, it is working in the same way as in pandas):

In [58]: s.values[[True,False,True]]
Out[58]: array([4, 5, 4])

In [59]: s.values[np.array([True,False,True])]
Out[59]: array([5, 3])

In [60]: s[[True, False, True]]
Out[60]:
0    5
2    3
dtype: int32

UPDATE: ah, you opened a new issue for that: #8976

@jreback
Copy link
Contributor

jreback commented Dec 4, 2014

closed by #8970

@jreback jreback closed this as completed Dec 4, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants