-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Implement DataFrame.select #61527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
0f64c13
bf2a9ea
92cb1e7
527d1d7
eb91004
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4535,6 +4535,127 @@ def _get_item(self, item: Hashable) -> Series: | |
# ---------------------------------------------------------------------- | ||
# Unsorted | ||
|
||
def select(self, *args): | ||
""" | ||
Select a subset of columns from the DataFrame. | ||
|
||
Select can be used to return a DataFrame with some specific columns. | ||
This can be select a subset of the columns, as well as to return a | ||
DataFrame with the columns sorted in a specific order. | ||
|
||
Parameters | ||
---------- | ||
*args : hashable or a single list arg of hashable | ||
The names of the columns to return. In general this will be strings, | ||
but pandas supports other types of column names, if they are hashable. | ||
If only one argument of type list is provided, the elements of the | ||
list will be considered the names of the columns to be returned | ||
|
||
Returns | ||
------- | ||
DataFrame | ||
The DataFrame with the selected columns. | ||
|
||
See Also | ||
-------- | ||
DataFrame.filter : To return a subset of rows, instead of a subset of columns. | ||
|
||
Examples | ||
-------- | ||
>>> df = pd.DataFrame( | ||
... { | ||
... "first_name": ["John", "Alice", "Bob"], | ||
... "last_name": ["Smith", "Cooper", "Marley"], | ||
... "age": [61, 22, 35], | ||
... } | ||
... ) | ||
|
||
Select a subset of columns: | ||
|
||
>>> df.select("first_name", "age") | ||
first_name age | ||
0 John 61 | ||
1 Alice 22 | ||
2 Bob 35 | ||
|
||
A list can also be used to specify the names of the columns to return: | ||
|
||
>>> df.select(["last_name", "age"]) | ||
last_name age | ||
0 Smith 61 | ||
1 Cooper 22 | ||
2 Marley 35 | ||
|
||
Selecting with a pattern can be done with Python expressions: | ||
|
||
>>> df.select([col for col in df.columns if col.endswith("_name")]) | ||
first_name last_name | ||
0 John Smith | ||
1 Alice Cooper | ||
2 Bob Marley | ||
|
||
All columns can be selected, but in a different order: | ||
|
||
>>> df.select("last_name", "first_name", "age") | ||
last_name first_name age | ||
0 Smith John 61 | ||
1 Cooper Alice 22 | ||
2 Marley Bob 35 | ||
|
||
Note that a DataFrame is always returned. If a single column is requested, a | ||
DataFrame with a single column is returned, not a Series: | ||
|
||
>>> df.select("age") | ||
age | ||
0 61 | ||
1 22 | ||
2 35 | ||
|
||
The ``select`` method also works when columns are a ``MultiIndex``: | ||
|
||
>>> df = pd.DataFrame( | ||
... [("John", "Smith", 61), ("Alice", "Cooper", 22), ("Bob", "Marley", 35)], | ||
... columns=pd.MultiIndex.from_tuples( | ||
... [("names", "first_name"), ("names", "last_name"), ("other", "age")] | ||
... ), | ||
... ) | ||
|
||
If column names are provided, they will select from the first level of | ||
the ``MultiIndex``: | ||
|
||
>>> df.select("names") | ||
names | ||
first_name last_name | ||
0 John Smith | ||
1 Alice Cooper | ||
2 Bob Marley | ||
|
||
To select from multiple or all levels, tuples can be used: | ||
|
||
>>> df.select(("names", "last_name"), ("other", "age")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it worth also showing the list variant of this, i.e., There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I gave this a try, but personally I don't think it adds too much value, as it's already explained in the parameters, and in the second example that this is possible. So, it really felt like repeating this already complex example for little gain, causing more confusion than adding value. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMHO, I think it also shows that you can pass a list of tuples, just like in the |
||
names other | ||
last_name age | ||
0 Smith 61 | ||
1 Cooper 22 | ||
2 Marley 35 | ||
""" | ||
if args and isinstance(args[0], list): | ||
if len(args) == 1: | ||
columns = args[0] | ||
else: | ||
raise ValueError( | ||
"`DataFrame.select` supports individual columns " | ||
"`df.select('col1', 'col2',...)` or a list " | ||
"`df.select(['col1', 'col2',...])`, but not both. " | ||
"You can unpack the list if you have a mix: " | ||
"`df.select(*['col1', 'col2'], 'col3')`." | ||
) | ||
else: | ||
columns = list(args) | ||
|
||
indexer = self.columns._get_indexer_strict(columns, "columns")[1] | ||
return self.take(indexer, axis=1) | ||
|
||
@overload | ||
def query( | ||
self, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
import pytest | ||
|
||
import pandas as pd | ||
from pandas import DataFrame | ||
import pandas._testing as tm | ||
|
||
|
||
@pytest.fixture | ||
def regular_df(): | ||
return DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6], "d": [7, 8]}) | ||
|
||
|
||
@pytest.fixture | ||
def multiindex_df(): | ||
return DataFrame( | ||
[(0, 2, 4), (1, 3, 5)], | ||
columns=pd.MultiIndex.from_tuples([("A", "c"), ("A", "d"), ("B", "e")]), | ||
) | ||
|
||
|
||
class TestSelect: | ||
def test_select_subset_cols(self, regular_df): | ||
expected = DataFrame({"a": [1, 2], "c": [5, 6]}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't want the test to fail for changes in It can make sense what you say if we think that what I'm testing is that both select and [] behave the same. But I see it as testing that select does what I want it to do, regardless of []. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see your point. I could go either way on this |
||
result = regular_df.select("a", "c") | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_single_value(self, regular_df): | ||
expected = DataFrame({"a": [1, 2]}) | ||
result = regular_df.select("a") | ||
assert isinstance(result, DataFrame) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_select_change_order(self, regular_df): | ||
expected = DataFrame({"b": [3, 4], "d": [7, 8], "a": [1, 2], "c": [5, 6]}) | ||
result = regular_df.select("b", "d", "a", "c") | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_select_none(self, regular_df): | ||
result = regular_df.select() | ||
assert result.empty | ||
Dr-Irv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def test_select_duplicated(self, regular_df): | ||
expected = ["a", "d", "a"] | ||
result = regular_df.select("a", "d", "a") | ||
assert result.columns.tolist() == expected | ||
|
||
def test_select_single_list(self, regular_df): | ||
expected = DataFrame({"a": [1, 2], "c": [5, 6]}) | ||
result = regular_df.select(["a", "c"]) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_select_list_and_string(self, regular_df): | ||
with pytest.raises(ValueError, match="supports individual columns"): | ||
regular_df.select(["a", "c"], "b") | ||
|
||
def test_select_missing(self, regular_df): | ||
with pytest.raises(KeyError, match=r"None of .* are in the \[columns\]"): | ||
regular_df.select("z") | ||
|
||
def test_select_not_hashable(self, regular_df): | ||
with pytest.raises(TypeError, match="unhashable type"): | ||
regular_df.select(set()) | ||
|
||
def test_select_multiindex_one_level(self, multiindex_df): | ||
expected = DataFrame( | ||
[(0, 2), (1, 3)], | ||
columns=pd.MultiIndex.from_tuples([("A", "c"), ("A", "d")]), | ||
) | ||
result = multiindex_df.select("A") | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_select_multiindex_single_column(self, multiindex_df): | ||
expected = DataFrame( | ||
[(2,), (3,)], columns=pd.MultiIndex.from_tuples([("A", "d")]) | ||
) | ||
result = multiindex_df.select(("A", "d")) | ||
assert isinstance(result, DataFrame) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_select_multiindex_multiple_columns(self, multiindex_df): | ||
expected = DataFrame( | ||
[(0, 4), (1, 5)], | ||
columns=pd.MultiIndex.from_tuples([("A", "c"), ("B", "e")]), | ||
) | ||
result = multiindex_df.select(("A", "c"), ("B", "e")) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_select_multiindex_multiple_columns_as_list(self, multiindex_df): | ||
expected = DataFrame( | ||
[(0, 4), (1, 5)], | ||
columns=pd.MultiIndex.from_tuples([("A", "c"), ("B", "e")]), | ||
) | ||
result = multiindex_df.select([("A", "c"), ("B", "e")]) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
def test_select_multiindex_missing(self, multiindex_df): | ||
with pytest.raises(KeyError, match="not in index"): | ||
multiindex_df.select("Z") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue here is that it is then possible to do
df.select()
, since when you specify*args
, you don't have to specify any arguments. Maybe change the API to this:This then requires the first argument, which is either a hashable or a list, and the arguments after that (if provided) have to also be hashables.
This also allows better type checking for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I was giving it a try, but after checking in detail this doesn't seem to be a good idea. What about this:
Given that
df[[]]
returns an empty dataframe, I think the above example should also return an empty dataframe. And I don't thinkdf.select(*my_list)
should raise aTypeError
whendf.select(my_list)
doesn't (in the case of an empty list).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a typing perspective, if you have a
*args
argument, it can't support both lists and "deconstructed" lists.E.g., if you had
def select(*args: Hashable)
then that says you would haveHashable
separated by commas.I think this will do what you need:
Then
select()
,select([])
,select("a", "b")
andselect(["a", "b"])
will all pass typing checks, and any combination of lists andHashable
would fail.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, and from a typing prespective I fully agree. But I don't want to introduce the inconsistency I mention above to have more accurate typing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But my suggestion would not introduce that inconsistency. You'd be able to do
select()
andselect([])
and they'd both return an empty dataframe.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, read it on the phone and didn't see the default value. I don't like that it makes the signature significantly more difficult to understand. But open to it, maybe someone else have an opinion?