Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Internally categories already distinguish different subtypes. consider for example:
import pandas as pd
s = pd.Series(["foo", "bar"], dtype=object)
print(s.astype("category"))
print(s.astype("string").astype("category"))
In the first case, s.dtype.categories
is Index(['bar', 'foo'], dtype='object')
, in the latter case it is Index(['bar', 'foo'], dtype='string')
.
However currently handling of these subtypes is a bit awkward, hence the proposed features are quality-of-life improvements when working with such kinds of data, mainly:
- Allow direct casting to categories of specific subtype via
.astype("category[<type>]")
- Ensure round tripping subtypes when serializing in formats that support categorical types.
import pandas as pd
df = pd.DataFrame({"col":[ "foo", "bar"]}, dtype=object)
df = df.astype("string").astype("category")
df.to_parquet("test.parquet")
print(df["col"].dtype.categories)
df= pd.read_parquet("test.parquet")
print(df["col"].dtype.categories)
Feature Description
- Make
CategoricalDtype
atyping.Generic
parametrized by a scalar type. (⇝ relevant forpandas-stubs
) - The fallback should be
category[object]
(cf. Defaults for Generics? python/mypy#4236 (comment)) - Allow type casting
.astype("category[<type>]")
series.astype("category[string]")
should behave equivalently toseries.astype("string").astype("category")
- Allow usage in constructor methods such as
read_csv(file, dtype=...)
andDataFrame(..., dtype=...)
- Ensure category subtypes are maintained trough serialization and loading
- In particular, when reading parquet/feather format. (⇝ interoperability with
pyarrow
's dictionary type)
- In particular, when reading parquet/feather format. (⇝ interoperability with
- Allow type checking
series.dtype == "category[string]"
.- Possibly
series.dtype == "string"
andpd.api.types.is_string_dtype(series)
should evaluate toTrue
if thedtype
iscategory[string]
, sincecategory
acts only as a kind of wrapper and things likeSeries.str
accessor are still applicable. (needs discussion)
- Possibly
Alternative Solutions
Existing functionality is to manually cast as .astype(<type>).astype("category")
whenever necessary, or to explicitly construct an instance of CategoricalDtype
, which however requires a-priori knowledge of the categories.
Additional Context
Allowing direct casting to category[<type>]
when using read_csv
should bring minor performance benfits