You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
By clicking “Sign up for GitHub”, you agree to our
terms of service
and
privacy statement
. We’ll occasionally send you account related emails.
Already on GitHub?
Sign in
to your account
error: Item "str" of "Union[str, bytes, date, datetime, timedelta, bool, int, float, complex, Timestamp, Timedelta]" has no attribute "copy" [union-attr]
error: Item "str" of "Union[str, bytes, date, datetime, timedelta, bool, int, float, complex, Timestamp, Timedelta]" has no attribute "copy" [union-attr]
randolf-scholz
opened this issue
Nov 28, 2022
· 5 comments
Using a tuple as the key for
DataFrame.loc
returns
ScalarType
.
from pandas import MultiIndex, DataFrame
index = MultiIndex.from_tuples([(0, 0), (0, 1)])
df = DataFrame(range(2), index=index)
key = (0, 0)
x = df.loc[key].copy() # raises [union-attr] Item has no attribute "copy"
I think the issue is these lines, that seem to just flatout make wrong assumptions in case when the DataFrame is equipped with a
MultiIndex
:
pandas-stubs/pandas-stubs/core/frame.pyi
Lines 159 to 170
68780c7
This seems to be one of the weak points of the current python typing system. To properly type hint this, one would need to know if the
DataFrame
is equipped with a
MultiIndex
or not.
The only possible approach I see is to do something along the lines of
IndexType = TypeVar("IndexType", bound=Index)
ColumnType = TypeVar("IndexType", bound=Index)
class DataFrame(NDFrame, OpsMixin, Generic[IndexType, ColumnType]):
@property
def loc(self: DataFrame[IndexType, ColumnType]) -> _LocIndexerFrame[IndexType, ColumnType]: ...
class _LocIndexerFrame(_LocIndexer, Generic[IndexType, ColumnType]):
@overload
def __getitem__(self: _LocIndexerFrame[MultiIndex, MultiIndex], key: ...) -> ...: ...
@overload
def __getitem__(self: _LocIndexerFrame[MultiIndex, Index], key: ...) -> ...: ...
@overload
def __getitem__(self: _LocIndexerFrame[Index, MultiIndex], key: ...) -> ...: ...
@overload
def __getitem__(self: _LocIndexerFrame[Index, Index], key: ...) -> ...: ...
But it looks super messy.
It's actually more of an issue that
.loc[]
is pretty permissive.
If you use
x = df.loc[key, :].copy()
then it works.
The issue here is that
df.loc[key]
is ambiguous unless you know what is inside the
DataFrame
. We can't track what is dynamically changing. So with
.loc[]
, the solution is to have users specify both the index and the columns.
Given that
the documentation explicitly shows such examples
, I think that may be too much to ask. It will certainly make applying pandas-stubs to existing repositories very difficult.
I used the examples in the documentation and turned them into a typing unit-test. Currently, this test alone raises 13 typing errors.
from pandas import DataFrame, Index, MultiIndex, Series
from typing_extensions import assert_type, reveal_type
# Getting values
df = DataFrame(
[[1, 2], [4, 5], [7, 8]],
index=["cobra", "viper", "sidewinder"],
columns=["max_speed", "shield"],
assert_type(df, DataFrame)
assert_type(df.loc["viper"], Series)
assert_type(df.loc[["viper", "sidewinder"]], DataFrame)
assert_type(df.loc["cobra", "shield"], int)
assert_type(df.loc["cobra":"viper", "max_speed"], Series)
assert_type(df.loc[[False, False, True]], DataFrame)
assert_type(
df.loc[Series([False, True, False], index=["viper", "sidewinder", "cobra"])],
DataFrame,
assert_type(df.loc[Index(["cobra", "viper"], name="foo")], DataFrame)
assert_type(df.loc[df["shield"] > 6], DataFrame)
assert_type(df.loc[df["shield"] > 6, ["max_speed"]], Series)
assert_type(df.loc[lambda df: df["shield"] == 8], DataFrame)
# Setting values
df.loc[["viper", "sidewinder"], ["shield"]] = 50
df.loc["cobra"] = 10
df.loc[:, "max_speed"] = 30
df.loc[df["shield"] > 35] = 0
# Getting values on a DataFrame with an index that has integer labels
df = DataFrame(
[[1, 2], [4, 5], [7, 8]], index=[7, 8, 9], columns=["max_speed", "shield"]
assert_type(df, DataFrame)
assert_type(df.loc[7:9], DataFrame)
# Getting values with a MultiIndex
tuples = [
("cobra", "mark i"),
("cobra", "mark ii"),
("sidewinder", "mark i"),
("sidewinder", "mark ii"),
("viper", "mark ii"),
("viper", "mark iii"),
index = MultiIndex.from_tuples(tuples)
values = [[12, 2], [0, 4], [10, 20], [1, 4], [7, 1], [16, 36]]
df = DataFrame(values, columns=["max_speed", "shield"], index=index)
assert_type(df, DataFrame)
assert_type(df.loc["cobra"], DataFrame)
assert_type(df.loc[("cobra", "mark ii")], Series)
assert_type(df.loc["cobra", "mark i"], Series)
assert_type(df.loc[[("cobra", "mark ii")]], DataFrame)
assert_type(df.loc[[("cobra", "mark ii"), "shield"]], int)
assert_type(df.loc[("cobra", "mark i"):"viper"], DataFrame)
assert_type(df.loc[("cobra", "mark i"):("viper", "mark ii")], DataFrame)
Thanks for doing this. We've been doing a whack-a-mole approach to improve the typing on
.loc
. Clearly we need some improvements. PR's welcome!
I should note that the current approach has evolved over time due to reports like these, and tested on code bases that I have.
I looked into your test code and I don't think there is anything we can do to support all of these cases, because some of these ways of using pandas are ambiguous, and I don't think there is anything we can do in the stubs to support all the cases.
Note: mypy doesn't support slices of non-integer slices. See
python/mypy#2410
Here is an analysis of the failures I get and what the resolution is:
assert_type(df.loc["cobra":"viper", "max_speed"], pd.Series)
fails mypy, not pyright, because mypy doesn't support non-integer slices.
assert_type(df.loc[lambda df: df["shield"] == 8], pd.DataFrame)
fails type checking because
lambda
functions are untyped. If you do
def bool_mask(df: pd.DataFrame) -> pd.Series[bool]: # better
return df["shield"] == 8
assert_type(df.loc[bool_mask(df)], pd.DataFrame)
then the stubs correctly type check that line.
3.
assert_type(df.loc["cobra"], pd.DataFrame)
is ambiguous. Since we don't know if the DataFrame is backed by a
MultiIndex
or not, the expression
df.loc["cobra"]
could correspond to a column named
"cobra"
. To achieve what you want, you can write
df.loc[pd.IndexSlice["cobra", :]]
4.
assert_type(df.loc[("cobra", "mark ii")], pd.Series)
is also ambiguous. If you had
df = pd.DataFrame([[1,2], [3,4]], index=["cobra", "viper"], columns=["mark i", "mark ii"])
then the expression
df.loc[("cobra", "mark ii")]
would return a scalar value, not a
Series
5. As in (4), the test
assert_type(df.loc["cobra", "mark i"], pd.Series)
is also ambiguous, because
df.loc["cobra", "mark i"]
would return a scalar.
6.
assert_type(df.loc[[("cobra", "mark ii"), "shield"]], int)
is invalid code. The expression
df.loc[[("cobra", "mark ii"), "shield"]]
will fail with pandas. However, if you used
assert_type(df.loc[("cobra", "mark ii"), "shield"], Scalar)
, it passes. Note that you have to use
Scalar
here because from a static typing perspective, we can't infer the type of any column of a
DataFrame
.
7. Both of the following lines pass pyright, but fail mypy because mypy doesn't support slices that aren't integers:
assert_type(df.loc[("cobra", "mark i"):"viper"], pd.DataFrame)
assert_type(df.loc[("cobra", "mark i"):("viper", "mark ii")], pd.DataFrame)
Based on this analysis, I'm going to close this issue. With static typing, we can't support pandas expressions that are ambiguous. We also can't do anything about mypy bugs!