Skip to content

ENH: Improved CategoricalDtype subtype handling. #48515

Open
@randolf-scholz

Description

@randolf-scholz

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Internally categories already distinguish different subtypes. consider for example:

import pandas as pd
s = pd.Series(["foo", "bar"], dtype=object)
print(s.astype("category"))
print(s.astype("string").astype("category"))

In the first case, s.dtype.categories is Index(['bar', 'foo'], dtype='object'), in the latter case it is Index(['bar', 'foo'], dtype='string').

However currently handling of these subtypes is a bit awkward, hence the proposed features are quality-of-life improvements when working with such kinds of data, mainly:

  1. Allow direct casting to categories of specific subtype via .astype("category[<type>]")
  2. Ensure round tripping subtypes when serializing in formats that support categorical types.
import pandas as pd
df = pd.DataFrame({"col":[ "foo", "bar"]}, dtype=object)
df = df.astype("string").astype("category")
df.to_parquet("test.parquet")
print(df["col"].dtype.categories)
df= pd.read_parquet("test.parquet")
print(df["col"].dtype.categories)

Feature Description

  • Make CategoricalDtype a typing.Generic parametrized by a scalar type. (⇝ relevant for pandas-stubs)
  • The fallback should be category[object] (cf. Defaults for Generics? python/mypy#4236 (comment))
  • Allow type casting .astype("category[<type>]")
    • series.astype("category[string]") should behave equivalently to series.astype("string").astype("category")
  • Allow usage in constructor methods such as read_csv(file, dtype=...) and DataFrame(..., dtype=...)
  • Ensure category subtypes are maintained trough serialization and loading
    • In particular, when reading parquet/feather format. (⇝ interoperability with pyarrow's dictionary type)
  • Allow type checkingseries.dtype == "category[string]".
    • Possibly series.dtype == "string" and pd.api.types.is_string_dtype(series) should evaluate to True if the dtype is category[string], since category acts only as a kind of wrapper and things like Series.str accessor are still applicable. (needs discussion)

Alternative Solutions

Existing functionality is to manually cast as .astype(<type>).astype("category") whenever necessary, or to explicitly construct an instance of CategoricalDtype, which however requires a-priori knowledge of the categories.

Additional Context

Allowing direct casting to category[<type>] when using read_csv should bring minor performance benfits

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions