Skip to content

error: Item "str" of "Union[str, bytes, date, datetime, timedelta, bool, int, float, complex, Timestamp, Timedelta]" has no attribute "copy" [union-attr] #453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
randolf-scholz opened this issue Nov 28, 2022 · 5 comments

Comments

@randolf-scholz
Copy link
Contributor

Using a tuple as the key for DataFrame.loc returns ScalarType.

from pandas import MultiIndex, DataFrame

index = MultiIndex.from_tuples([(0, 0), (0, 1)])
df = DataFrame(range(2), index=index)
key = (0, 0)
x = df.loc[key].copy()  # raises [union-attr] Item has no attribute "copy"

I think the issue is these lines, that seem to just flatout make wrong assumptions in case when the DataFrame is equipped with a MultiIndex:

@overload
def __getitem__(
self,
idx: tuple[int | StrLike | tuple[ScalarT, ...], int | StrLike],
) -> Scalar: ...
@overload
def __getitem__(
self,
idx: ScalarT
| tuple[IndexType | MaskType | _IndexSliceTuple, ScalarT | None]
| None,
) -> Series: ...

Please complete the following information:

  • python version 3.10.6
  • version of type checker mypy 0.991
  • version of installed pandas-stubs 1.5.2.221124
@randolf-scholz
Copy link
Contributor Author

randolf-scholz commented Nov 28, 2022

This seems to be one of the weak points of the current python typing system. To properly type hint this, one would need to know if the DataFrame is equipped with a MultiIndex or not.

The only possible approach I see is to do something along the lines of

IndexType = TypeVar("IndexType", bound=Index)
ColumnType = TypeVar("IndexType", bound=Index)

class DataFrame(NDFrame, OpsMixin, Generic[IndexType, ColumnType]):
    @property
    def loc(self: DataFrame[IndexType, ColumnType]) -> _LocIndexerFrame[IndexType, ColumnType]: ...
   

class _LocIndexerFrame(_LocIndexer, Generic[IndexType, ColumnType]):
   ...
   @overload
   def __getitem__(self: _LocIndexerFrame[MultiIndex, MultiIndex], key: ...) -> ...: ...
   @overload
   def __getitem__(self: _LocIndexerFrame[MultiIndex, Index], key: ...) -> ...: ...
   @overload
   def __getitem__(self: _LocIndexerFrame[Index, MultiIndex], key: ...) -> ...: ...
   @overload
   def __getitem__(self: _LocIndexerFrame[Index, Index], key: ...) -> ...: ...

But it looks super messy.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Nov 29, 2022

It's actually more of an issue that .loc[] is pretty permissive.

If you use x = df.loc[key, :].copy() then it works.

The issue here is that df.loc[key] is ambiguous unless you know what is inside the DataFrame. We can't track what is dynamically changing. So with .loc[], the solution is to have users specify both the index and the columns.

@randolf-scholz
Copy link
Contributor Author

Given that the documentation explicitly shows such examples, I think that may be too much to ask. It will certainly make applying pandas-stubs to existing repositories very difficult.

I used the examples in the documentation and turned them into a typing unit-test. Currently, this test alone raises 13 typing errors.

from pandas import DataFrame, Index, MultiIndex, Series
from typing_extensions import assert_type, reveal_type

# Getting values
df = DataFrame(
    [[1, 2], [4, 5], [7, 8]],
    index=["cobra", "viper", "sidewinder"],
    columns=["max_speed", "shield"],
)

assert_type(df, DataFrame)
assert_type(df.loc["viper"], Series)
assert_type(df.loc[["viper", "sidewinder"]], DataFrame)
assert_type(df.loc["cobra", "shield"], int)
assert_type(df.loc["cobra":"viper", "max_speed"], Series)
assert_type(df.loc[[False, False, True]], DataFrame)
assert_type(
    df.loc[Series([False, True, False], index=["viper", "sidewinder", "cobra"])],
    DataFrame,
)
assert_type(df.loc[Index(["cobra", "viper"], name="foo")], DataFrame)
assert_type(df.loc[df["shield"] > 6], DataFrame)
assert_type(df.loc[df["shield"] > 6, ["max_speed"]], Series)
assert_type(df.loc[lambda df: df["shield"] == 8], DataFrame)

# Setting values
df.loc[["viper", "sidewinder"], ["shield"]] = 50
df.loc["cobra"] = 10
df.loc[:, "max_speed"] = 30
df.loc[df["shield"] > 35] = 0

# Getting values on a DataFrame with an index that has integer labels
df = DataFrame(
    [[1, 2], [4, 5], [7, 8]], index=[7, 8, 9], columns=["max_speed", "shield"]
)
assert_type(df, DataFrame)
assert_type(df.loc[7:9], DataFrame)


# Getting values with a MultiIndex
tuples = [
    ("cobra", "mark i"),
    ("cobra", "mark ii"),
    ("sidewinder", "mark i"),
    ("sidewinder", "mark ii"),
    ("viper", "mark ii"),
    ("viper", "mark iii"),
]
index = MultiIndex.from_tuples(tuples)
values = [[12, 2], [0, 4], [10, 20], [1, 4], [7, 1], [16, 36]]
df = DataFrame(values, columns=["max_speed", "shield"], index=index)
assert_type(df, DataFrame)
assert_type(df.loc["cobra"], DataFrame)
assert_type(df.loc[("cobra", "mark ii")], Series)
assert_type(df.loc["cobra", "mark i"], Series)
assert_type(df.loc[[("cobra", "mark ii")]], DataFrame)
assert_type(df.loc[[("cobra", "mark ii"), "shield"]], int)
assert_type(df.loc[("cobra", "mark i"):"viper"], DataFrame)
assert_type(df.loc[("cobra", "mark i"):("viper", "mark ii")], DataFrame)

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Nov 29, 2022

Thanks for doing this. We've been doing a whack-a-mole approach to improve the typing on .loc . Clearly we need some improvements. PR's welcome!

I should note that the current approach has evolved over time due to reports like these, and tested on code bases that I have.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jan 2, 2023

I looked into your test code and I don't think there is anything we can do to support all of these cases, because some of these ways of using pandas are ambiguous, and I don't think there is anything we can do in the stubs to support all the cases.

Note: mypy doesn't support slices of non-integer slices. See python/mypy#2410

Here is an analysis of the failures I get and what the resolution is:

  1. assert_type(df.loc["cobra":"viper", "max_speed"], pd.Series) fails mypy, not pyright, because mypy doesn't support non-integer slices.
  2. assert_type(df.loc[lambda df: df["shield"] == 8], pd.DataFrame) fails type checking because lambda functions are untyped. If you do
 def bool_mask(df: pd.DataFrame) -> pd.Series[bool]:  # better
        return df["shield"] == 8

    assert_type(df.loc[bool_mask(df)], pd.DataFrame)

then the stubs correctly type check that line.
3. assert_type(df.loc["cobra"], pd.DataFrame) is ambiguous. Since we don't know if the DataFrame is backed by a MultiIndex or not, the expression df.loc["cobra"] could correspond to a column named "cobra". To achieve what you want, you can write df.loc[pd.IndexSlice["cobra", :]]
4. assert_type(df.loc[("cobra", "mark ii")], pd.Series) is also ambiguous. If you had

df =  pd.DataFrame([[1,2], [3,4]],  index=["cobra", "viper"], columns=["mark i", "mark ii"])

then the expression df.loc[("cobra", "mark ii")] would return a scalar value, not a Series
5. As in (4), the test assert_type(df.loc["cobra", "mark i"], pd.Series) is also ambiguous, because df.loc["cobra", "mark i"] would return a scalar.
6. assert_type(df.loc[[("cobra", "mark ii"), "shield"]], int) is invalid code. The expression df.loc[[("cobra", "mark ii"), "shield"]] will fail with pandas. However, if you used assert_type(df.loc[("cobra", "mark ii"), "shield"], Scalar), it passes. Note that you have to use Scalar here because from a static typing perspective, we can't infer the type of any column of a DataFrame.
7. Both of the following lines pass pyright, but fail mypy because mypy doesn't support slices that aren't integers:

    assert_type(df.loc[("cobra", "mark i"):"viper"], pd.DataFrame)
    assert_type(df.loc[("cobra", "mark i"):("viper", "mark ii")], pd.DataFrame)  

Based on this analysis, I'm going to close this issue. With static typing, we can't support pandas expressions that are ambiguous. We also can't do anything about mypy bugs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants