Skip to content

ENH: Implement DataFrame.select #61527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

datapythonista
Copy link
Member

Based on the feedback in #61522 and on the last devs call, I implemented DataFrame.select in the most simple way. It does work with MultiIndex, but it does not support equivalents to filter(regex=) or filter(like=) directly. I added examples in the docs, so users can do that easily in Python (I can add one for regex if people think it's worth it).

The examples in the docs and the tests should make quite clear what's the behavior, feedback welcome.

For context, this is added so we can make DataFrame.filter focus on filtering rows, for example:

df = df.select("name", "age")
df = df.filter(df.age >= 18)

or

(df.select("name", "age")
   .filter(lambda df: df.age >= 18))

CC: @pandas-dev/pandas-core

@datapythonista datapythonista added Indexing Related to indexing on series/frames, not to indexes themselves API Design Enhancement labels May 31, 2025
Parameters
----------
*args : hashable or tuple of hashable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also support a list of hashable ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the meaning of a list? Same as a tuple, for MultiIndex?

1 Cooper Alice 22
2 Marley Bob 35
In case the columns are in a list, Python unpacking with star can be used:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this - I'd prefer just passing the list

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to it, and it was my first idea to support both df.select("col1", "col2") and df.col(["col1", "col2"]).

But after checking in more detail, I find the second version not so readable with the double brackets, and for the case when the columns are already in a variable just a star makes it work.

And besides readability, that to me would be enough reason to implement it like this, allowing a list adds a decent amount of complexity. For example, what would you do here? df.select(["col1", "col2"], "col3"). Raise? Return all columns? What about this other case: df.select(["col1", "col2"], ["col3", "col4"]) Same as the previous? What about: df.select("col1", ["col2", "col3"]). Personally, I think we shouldn't have to answer this, or make users guess much. The simplest approach seems to be good enough, if I'm not missing any use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'm open to it, and it was my first idea to support both df.select("col1", "col2") and df.col(["col1", "col2"]).

Why not support ONLY a list?

But after checking in more detail, I find the second version not so readable with the double brackets, and for the case when the columns are already in a variable just a star makes it work.

I think this is about consistency in the API. For example, with DataFrame.groupby(), you can't do df.groupby("a", "b"), you have to do df.groupby(["a", "b"]).

And besides readability, that to me would be enough reason to implement it like this, allowing a list adds a decent amount of complexity. It's complexity in the implementation versus consistency of the API.

For example, what would you do here? df.select(["col1", "col2"], "col3"). Raise? Return all columns?

Raise. Only support lists or callables. And a static type checker would see that as invalid.

What about this other case: df.select(["col1", "col2"], ["col3", "col4"]) Same as the previous?

Raise. And a static type checker would see that as invalid.

What about: df.select("col1", ["col2", "col3"]).

Raise. And a static type checker would see that as invalid.

Personally, I think we shouldn't have to answer this, or make users guess much. The simplest approach seems to be good enough, if I'm not missing any use case.

I don't see why a list isn't simple (and consistent), and it allows better type checking, as well as additions to the API in the future, if we should decide to do so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed feedback, what you say seems reasonable. To me, there is a significant advantage in readability and usability on using df.select("col1", "col2") over df.select(["col1", "col"]). I see you point on consistency with groupby, and while still the list is not my favorite option, it does seem reasonable. I'll let others share their opinion too, as at the end there is a trade-off and is a question of personal preference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with list-only.

@jbrockmendel
Copy link
Member

Slight preference for (arg) over (*arg), strong preference for supporting one, not both.

@datapythonista
Copy link
Member Author

For reference, PySpark uses *cols, and Polars uses *exprs, but also supports passing a list.

If in the future we implement pandas.col (as discussed for filter), and we'd like to support selecting expressions, the syntax of using **kwargs in my opinion is even better than kwargs compared to *args vs args. Not just the extra brackets, but also extra quotes:

df.select(new_col=pd.col("old_col") * 2)

df.select({"new_col": pd.col("old_col") * 2})

I'm personally not convinced by the reasons to use a list so far. Being consistent with groupby doesn't seem so important, since by is way more complicated than what's implemented here, and groupby has many other parameters, which makes a difference. I don't think adding extra args to select is likely, given that not PySpark or Polars needed them, and adding them as keyword arguments only wouldn't be a problem, df.select("col1", "col", some_flag=True) seems perfectly clear if that's what we want to introduce. I guess there may be differences in type checkers, you surely know better than me @Dr-Irv, but this is a core Python pattern, that we and everyone else are using constantl., I don't see how type checkers can have significant problems with a function that just receives an *args argument, may be you can expand on what are the implications.

I feel that we're making the API more complicated for users by using a list, and fail to see the reason for it so far. It'd be great to know more details about the reasons. I'm fine to update the PR if that's what everybody else things it's best. But it seems like a mistake to me so far.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 4, 2025

I'm personally not convinced by the reasons to use a list so far. Being consistent with groupby doesn't seem so important, since by is way more complicated than what's implemented here, and groupby has many other parameters, which makes a difference.

It's not just groupby. Here's a list of some other methods where we pass lists/sequences of columns:

  • reset_index()
  • drop() with columns argument
  • sort_values()
  • value_counts()

I feel that we're making the API more complicated for users by using a list, and fail to see the reason for it so far. It'd be great to know more details about the reasons. I'm fine to update the PR if that's what everybody else things it's best. But it seems like a mistake to me so far.

My point here is that the convention we have in the rest of the API that requires a sequence/list of column names is to pass a list, and not use *args.

@datapythonista
Copy link
Member Author

Thanks @Dr-Irv for putting together the list, I really appreciate, and it's very helpful.

I'm not convinced, as all them are different cases. You can do df.sort_values("my_col"), df.value_counts("my_col"),... The pattern in them the way I understand it is that the base case it's expecting a single column, but a list is also accepted.

We could also consider this for select, assume that the base case is for it to select a single column, allow df.select("my_column"), and for multiple columns use the same pattern as the methods you shared, df.select(["col1", "col2"]). To me personally, the base case for select is not to select one column, in the same way as for sort_values. And if we could have statistics on how many times select is used with one column, and we compare to sort_values or the others, I think the proportion would be very different. Also, for a single column I don't think anyone proposed using df.select("col"), my understanding is that everybody's preference is df.select(["col"]). So, I don't think consistency with all these methods is a good reason.

To me *args literally exists in Python for the use case of select. Not using it seems unpythonic and a poor choice of API as it makes their code more complex to write and less readable. Personally I'd also use it for sort_values and the others. But given that it can be considered that they don't get N parameters, but 1 or N, and that they have many other parameters, I'm not too annoyed by them, as I would be by using .select(arg) instead of .select(*args).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 4, 2025

We could also consider this for select, assume that the base case is for it to select a single column, allow df.select("my_column"), and for multiple columns use the same pattern as the methods you shared, df.select(["col1", "col2"]).

I'd be fine with having one column or a list of columns.

But this also raises the following. Compare the following possibilities:

  • df.select("col") vs. df[["col"]]
  • df.select("col1", "col2") vs. df[["col1", "col2"]]
  • df.select(["col1", "col2"]) vs. df[["col1", "col2"]]
  • df.select(lambda c: func(c))vsdf[c for c in df.columns if func(c)]`

It seems to me there is an existing way of selecting a subset of columns that has a concise syntax, so why would I use select ??

@datapythonista
Copy link
Member Author

select will offer a much better syntax with method chaining, that's the motivation for having both select and filter. I can show you with an example if needed.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 4, 2025

select will offer a much better syntax with method chaining, that's the motivation for having both select and filter. I can show you with an example if needed.

Maybe I'm missing something, but using df[["col1", "col2"]] in a method chain is equivalent to using df.select("col1", "col2") in a method chain.

@datapythonista
Copy link
Member Author

Maybe I'm missing something, but using df[["col1", "col2"]] in a method chain is equivalent to using df.select("col1", "col2") in a method chain.

You're right, it's the inconsistency with everything else being a method that I don't think it's great:

import pandas

(pandas.read_csv("taxi_data/yellow_tripdata_2015-01.csv")
       .rename(columns={"trip_distance": "trip_miles"})
       [["pickup_longitude", "pickup_latitude", "trip_miles"]]
       .assign(distance_kms=lambda df: df.trip_miles * 1.60934)
       .drop("trip_miles", axis=1)
       .pipe(lambda df: df[df.distance_kms > 10.])
       .to_parquet("long_taxi_pickups.parquet"))

I would rather have this, so everything is a method, and the filter is explicit:

import pandas

(pandas.read_csv("taxi_data/yellow_tripdata_2015-01.csv")
       .rename(columns={"trip_distance": "trip_miles"})
       .select("pickup_longitude", "pickup_latitude", "trip_miles")
       .assign(distance_kms=lambda df: df.trip_miles * 1.60934)
       .drop("trip_miles", axis=1)
       .filter(lambda df: df.distance_kms > 10.)
       .to_parquet("long_taxi_pickups.parquet"))

To me the second option is significantly more intuitive and clear. And if in the future we decide to implement pandas.col(), this would be even better (but that's a discussion for the future):

import pandas

(pandas.read_csv("taxi_data/yellow_tripdata_2015-01.csv")
       .rename(columns={"trip_distance": "trip_miles"})
       .select("pickup_longitude", "pickup_latitude", distance_kms=pandas.col("trip_miles") * 1.60934)
       .filter(pandas.col("distance_kms") > 10.)
       .to_parquet("long_taxi_pickups.parquet"))

I think this is very similar to what PySpark and Polars, and I think they improved the pandas syntax, and I think we can improve it in the same way too.

@jbrockmendel
Copy link
Member

.select("pickup_longitude", "pickup_latitude", distance_kms=pandas.col("trip_miles") * 1.60934)

I'm not comfortable with having a named-keyword assignment in here. IIUC from the dev call there was another keyword from filter being discussed as being moved to select, so then we'd have both explicit keywords and implicit keyword-assignment being mixed-and-matched.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 4, 2025

You're right, it's the inconsistency with everything else being a method that I don't think it's great:

So this is just about a syntax preference. I'd probably continue to use the current methodology of [["pickup_longitude", "pickup_latitude", "trip_miles"]] instead of select() because I'm used to the former.

But if you are trying to get people coming from Polars or PySpark to use pandas instead, then your proposal makes sense, although I think the API should accept either a *args OR a list, so that people can use either, dependent on what they are used to.

@datapythonista
Copy link
Member Author

Thanks for the feedback. This is something that Polars allows and I wanted to show as it makes the example cleaner. But surely not part of this PR, or any plan for the short term. We should have pandas.col implemented to have this discussion. I personally think we shouldn't introduce keyword arguments to select, even if as you say it was discussed. For me the problem with implementing this is that keyword arguments must be after positional arguments, making the user life difficult when trying to provide the columns in a particular order. I probably wouldn't allow this for this reason. But I think it improves the syntax in the example I wanted to show.

Sorry if I created more confusion with it. As said, select with keyword arguments is unrelated to this PR.

@datapythonista
Copy link
Member Author

You are correct @Dr-Irv, and of course df[["col1", "col2"]] will stay. I wouldn't say it's for people coming from PySpark or Polars, even if it surely will make life easier to users using both. I think for someone new to pandas, in particular the ones who will use method chaining, the new syntax can make it easier to learn. df[foo] is nice in a way, but it's quite overloaded and probably difficult to understand for beginners. Selecting columns being a method like any other DataFrame method helps, in my opinion, make the operation seem less magic and complicated. But the main reason to have this to me is that I think it brings consistency and readability to pipelines with method chaining as in the example.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 4, 2025

But the main reason to have this to me is that I think it brings consistency and readability to pipelines with method chaining as in the example.

So if you look at our docs at: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#astype

There is this example:

dft[["a", "b"]] = dft[["a", "b"]].astype(np.uint8)

If you are encouraging readability, would you suggest we update the docs/tutorials that use a pattern like the above to:

dft[["a", "b"]] = dft.select("a", "b").astype(np.uint8)

in order to encourage the use of select() ?

I'm not sure encouraging that usage is a good idea in that particular example.

Or what about at https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#grouping , where we show

df.groupby("A")[["C", "D"]].sum()

If you think that select() is "better", then shouldn't you also introduce DataFrameGroupBy.select(), so the above would then read:

df.groupby("A").select("A", "B").sum()

so that the "suggested" syntax is the same for both selecting columns in a DataFrame and a groupby() operation?

I understand your argument about consistency within method chains if select() were available, and if you are suggesting that new users of pandas should use select() instead of the bracketed syntax, then you'd be suggesting a rewrite of some of our user guides (and examples throughout the docs). And you'd have to introduce it for groupby().

On the other hand, if this is just syntactic sugar, with no promotion of its usage as "best practice", it's not a big deal to introduce it, but I don't necessarily agree that using select() versus [["col1", "col2"]] is "easier" to learn when other parts of the API don't have it.

@datapythonista
Copy link
Member Author

It's just for method chaining that I think it's better. I don't think in isolation select is better. It's more explicit, but I don't think we should rewrite examples or encourage select. I think df[[...]] is totally fine, it's just for cases like the example I share that I think it complicates things, and select will be useful.

Good point about select for groupby, it does seem as a good idea, as a pipeline with method chaining and a group by will also be clearer and more consistent with select.

@rhshadrach
Copy link
Member

rhshadrach commented Jun 4, 2025

To me the real issue here is that in method chains you'd like one operation on each line. But when you format your code, you get this:

(
    pandas.read_csv("taxi_data/yellow_tripdata_2015-01.csv")
    .rename(columns={"trip_distance": "trip_miles"})[["pickup_longitude", "pickup_latitude", "trip_miles"]]
    .assign(distance_kms=lambda df: df.trip_miles * 1.60934)
    .drop("trip_miles", axis=1)
    .pipe(lambda df: df[df.distance_kms > 10.])
    .to_parquet("long_taxi_pickups.parquet"))
)

In addition, the expression syntax in Polars and PySpark of allowing both args and kwargs is quite powerful. Instead of having to break up selection and assign into different calls you can do everything all at once.

I also think select should be on groupby et al.

@datapythonista
Copy link
Member Author

We discussed this in today's call, and while not the perfect solution, everybody agreed that allowing both *args and a list is better than the alternatives.

I fully agree that supporting two different APIs is something that we would want to avoid. But at the same time, forcing a list makes the API more complicated and less Pythonic (.select("col1", "col2") is clearly simpler and more pythonic than .select(["col1", "col2"])). Also, not allowing *args would be counterintuitive for users of the same method of PySpark and Polars). But not allowing a list it makes it counterintuitive for users of other pandas methods such as sort_values, goupby) and others like the rest provided by Irv above.

So, if there are no objections, I'll move forward with the double API. Good thing is that select is very simple both in terms of the signature and in terms of implementation. So, even with the small added complexity, I think it'll still be a very simple method.

@jbrockmendel
Copy link
Member

I'm -1 on supporting both, but I missed today's meeting and I'm +1 on over-weighting the opinions of people who show up.

Was there discussion of how to disambiguate a sequence versus a column label that is a tuple?

@datapythonista
Copy link
Member Author

Thanks @jbrockmendel for the comment. We discussed your point of view which you previously shared, on wanting just one way. I think everybody in the call was also -1 in supporting both (including myself). It's just that in this particular case not supporting both seems even worse, for the reasons in my previous comment.

We did discuss about ambiguity and tuples. I'll make sure I'll test every possible case in that sense, but if I don't miss anything, the dual API doesn't make things worse

If a list (not a tuple) is the only parameter, we make it the list of columns. Otherwise (multiple args or single one not a list) every arg is a column (and lists in any arg will raise as they aren't hashable). After this, we are back to the logic of using just one way.

The logic with tuples will be the same as we currently have in df[[]]. IIRC in a MultiIndex they mean multiple levels, otherwise it's just a column with a tuple label.

If you can think of any case that becomes ambiguos or confusing and not addressed above, please let me know. Happy to reconsider things if we are making this API too complicated. But I think adding a clear error message to .select([...], "one_more_col"), which is in my opinion the less obvious case, should make the method quite intuitive.

@datapythonista
Copy link
Member Author

Updated the PR to implement what we agreed. The code and behavior seem very reasonable to me. Feedback welcome.

@TomAugspurger
Copy link
Contributor

FWIW, my uninformed opinion matches @jbrockmendel's. Especially when all it costs to avoid that ambiguity is a * to unpack the sequence (example 2 below).

df.select("a", "b")     # 1. select the two columns: "a", "b" 
df.select(["a", "b"])   # 2. select the two columns: "a", "b" 
df.select(*["a", "b"])  # 3. select the two columns: "a", "b" 
df.select(("a", "b"))   # 4. select the *one* column: ("a", "b")

But this isn't the only place pandas treats tuples differently from other containers, and I think the rules to disambiguate things are relatively clear to state, so maybe this isn't a huge deal.

@datapythonista
Copy link
Member Author

If I understand correctly that it'd be better to not support example 2 (if example 1 is considered the best API, which I think in isolation mostly everybody thinks), then I think there is agreement.

The reason to still support it is that it was considered confusing and inconsistent for users who are used to .groupby(["col1", "col2"]), .sort_values(["col1", "col2"]), .reset_index(["col1", "col2"]) to not support `.select(["col1", "col2"])...

Based on that, we have next options:

  1. Only .select("a", "b"): Not ideal for consistency with other methods
  2. Only .select(["a", "b"]): Not ideal as it's a more complex and unpythonic API
  3. Both .select("a", "b") and .select(["a", "b"]): Not ideal, but preferred by people to 1 and 2
  4. Change sort_values... to use .sort_values("a", "b"): Not a crazy idea in my opinion. sort_values, reset_index and others only support the first parameter as positional, so easy to implement a transition to use *args over a list. But groupby and others don't limit the positional arguments, making the transition significantly more difficult. So. probably not something we want to do immediately, and if/when we do it, select can have the same deprecation cycle as the rest if we implement 3.

Select a subset of columns from the DataFrame.
Select can be used to return a DataFrame with some specific columns.
This can be used to remove unwanted columns, as well as to return a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't use the word "remove" here, because it implies the columns are removed from the source DF. So instead of "remove unwanted columns", maybe say "select a subset of columns"

The names of the columns to return. In general this will be strings,
but pandas supports other types of column names, if they are hashable.
If only one argument of type list is provided, the elements of the
list will be considered the named of the columns to be returned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
list will be considered the named of the columns to be returned
list will be considered the names of the columns to be returned

To select from multiple or all levels, tuples can be used:
>>> df.select(("names", "last_name"), ("other", "age"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth also showing the list variant of this, i.e., df.select([("names", "last_name"), ("other", "age")])

@@ -4479,6 +4479,125 @@ def _get_item(self, item: Hashable) -> Series:
# ----------------------------------------------------------------------
# Unsorted

def select(self, *args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue here is that it is then possible to do df.select(), since when you specify *args, you don't have to specify any arguments. Maybe change the API to this:

def select(self, arg0: Hashable | list[Hashable], *args: Hashable) -> pd.DataFrame:

This then requires the first argument, which is either a hashable or a list, and the arguments after that (if provided) have to also be hashables.

This also allows better type checking for users.


class TestSelect:
def test_select_subset_cols(self, regular_df):
expected = DataFrame({"a": [1, 2], "c": [5, 6]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use expected = df[["a", "c"]] ? (here and in other tests)

Comment on lines +39 to +40
result = regular_df.select()
assert result.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should allow this. See comment above related to slight change in the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Implement DataFrame.select to select columns
5 participants