添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
寂寞的松树  ·  请端起科学的武器 ...·  2 周前    · 
讲道义的番茄  ·  Ubuntu Manpage: ...·  2 月前    · 
淡定的油条  ·  Visual Studio 2022 ...·  2 月前    · 

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account error: Item "str" of "Union[str, bytes, date, datetime, timedelta, bool, int, float, complex, Timestamp, Timedelta]" has no attribute "copy" [union-attr] error: Item "str" of "Union[str, bytes, date, datetime, timedelta, bool, int, float, complex, Timestamp, Timedelta]" has no attribute "copy" [union-attr] randolf-scholz opened this issue Nov 28, 2022 · 5 comments

Using a tuple as the key for DataFrame.loc returns ScalarType .

from pandas import MultiIndex, DataFrame
index = MultiIndex.from_tuples([(0, 0), (0, 1)])
df = DataFrame(range(2), index=index)
key = (0, 0)
x = df.loc[key].copy()  # raises [union-attr] Item has no attribute "copy"

I think the issue is these lines, that seem to just flatout make wrong assumptions in case when the DataFrame is equipped with a MultiIndex :

pandas-stubs/pandas-stubs/core/frame.pyi Lines 159 to 170 68780c7

This seems to be one of the weak points of the current python typing system. To properly type hint this, one would need to know if the DataFrame is equipped with a MultiIndex or not.

The only possible approach I see is to do something along the lines of

IndexType = TypeVar("IndexType", bound=Index)
ColumnType = TypeVar("IndexType", bound=Index)
class DataFrame(NDFrame, OpsMixin, Generic[IndexType, ColumnType]):
    @property
    def loc(self: DataFrame[IndexType, ColumnType]) -> _LocIndexerFrame[IndexType, ColumnType]: ...
class _LocIndexerFrame(_LocIndexer, Generic[IndexType, ColumnType]):
   @overload
   def __getitem__(self: _LocIndexerFrame[MultiIndex, MultiIndex], key: ...) -> ...: ...
   @overload
   def __getitem__(self: _LocIndexerFrame[MultiIndex, Index], key: ...) -> ...: ...
   @overload
   def __getitem__(self: _LocIndexerFrame[Index, MultiIndex], key: ...) -> ...: ...
   @overload
   def __getitem__(self: _LocIndexerFrame[Index, Index], key: ...) -> ...: ...

But it looks super messy.

It's actually more of an issue that .loc[] is pretty permissive.

If you use x = df.loc[key, :].copy() then it works.

The issue here is that df.loc[key] is ambiguous unless you know what is inside the DataFrame . We can't track what is dynamically changing. So with .loc[] , the solution is to have users specify both the index and the columns.

Given that the documentation explicitly shows such examples , I think that may be too much to ask. It will certainly make applying pandas-stubs to existing repositories very difficult.

I used the examples in the documentation and turned them into a typing unit-test. Currently, this test alone raises 13 typing errors.

from pandas import DataFrame, Index, MultiIndex, Series
from typing_extensions import assert_type, reveal_type
# Getting values
df = DataFrame(
    [[1, 2], [4, 5], [7, 8]],
    index=["cobra", "viper", "sidewinder"],
    columns=["max_speed", "shield"],
assert_type(df, DataFrame)
assert_type(df.loc["viper"], Series)
assert_type(df.loc[["viper", "sidewinder"]], DataFrame)
assert_type(df.loc["cobra", "shield"], int)
assert_type(df.loc["cobra":"viper", "max_speed"], Series)
assert_type(df.loc[[False, False, True]], DataFrame)
assert_type(
    df.loc[Series([False, True, False], index=["viper", "sidewinder", "cobra"])],
    DataFrame,
assert_type(df.loc[Index(["cobra", "viper"], name="foo")], DataFrame)
assert_type(df.loc[df["shield"] > 6], DataFrame)
assert_type(df.loc[df["shield"] > 6, ["max_speed"]], Series)
assert_type(df.loc[lambda df: df["shield"] == 8], DataFrame)
# Setting values
df.loc[["viper", "sidewinder"], ["shield"]] = 50
df.loc["cobra"] = 10
df.loc[:, "max_speed"] = 30
df.loc[df["shield"] > 35] = 0
# Getting values on a DataFrame with an index that has integer labels
df = DataFrame(
    [[1, 2], [4, 5], [7, 8]], index=[7, 8, 9], columns=["max_speed", "shield"]
assert_type(df, DataFrame)
assert_type(df.loc[7:9], DataFrame)
# Getting values with a MultiIndex
tuples = [
    ("cobra", "mark i"),
    ("cobra", "mark ii"),
    ("sidewinder", "mark i"),
    ("sidewinder", "mark ii"),
    ("viper", "mark ii"),
    ("viper", "mark iii"),
index = MultiIndex.from_tuples(tuples)
values = [[12, 2], [0, 4], [10, 20], [1, 4], [7, 1], [16, 36]]
df = DataFrame(values, columns=["max_speed", "shield"], index=index)
assert_type(df, DataFrame)
assert_type(df.loc["cobra"], DataFrame)
assert_type(df.loc[("cobra", "mark ii")], Series)
assert_type(df.loc["cobra", "mark i"], Series)
assert_type(df.loc[[("cobra", "mark ii")]], DataFrame)
assert_type(df.loc[[("cobra", "mark ii"), "shield"]], int)
assert_type(df.loc[("cobra", "mark i"):"viper"], DataFrame)
assert_type(df.loc[("cobra", "mark i"):("viper", "mark ii")], DataFrame)

Thanks for doing this. We've been doing a whack-a-mole approach to improve the typing on .loc . Clearly we need some improvements. PR's welcome!

I should note that the current approach has evolved over time due to reports like these, and tested on code bases that I have.

I looked into your test code and I don't think there is anything we can do to support all of these cases, because some of these ways of using pandas are ambiguous, and I don't think there is anything we can do in the stubs to support all the cases.

Note: mypy doesn't support slices of non-integer slices. See python/mypy#2410

Here is an analysis of the failures I get and what the resolution is:

  • assert_type(df.loc["cobra":"viper", "max_speed"], pd.Series) fails mypy, not pyright, because mypy doesn't support non-integer slices.
  • assert_type(df.loc[lambda df: df["shield"] == 8], pd.DataFrame) fails type checking because lambda functions are untyped. If you do
  •  def bool_mask(df: pd.DataFrame) -> pd.Series[bool]:  # better
            return df["shield"] == 8
        assert_type(df.loc[bool_mask(df)], pd.DataFrame)

    then the stubs correctly type check that line.
    3. assert_type(df.loc["cobra"], pd.DataFrame) is ambiguous. Since we don't know if the DataFrame is backed by a MultiIndex or not, the expression df.loc["cobra"] could correspond to a column named "cobra" . To achieve what you want, you can write df.loc[pd.IndexSlice["cobra", :]]
    4. assert_type(df.loc[("cobra", "mark ii")], pd.Series) is also ambiguous. If you had

    df =  pd.DataFrame([[1,2], [3,4]],  index=["cobra", "viper"], columns=["mark i", "mark ii"])

    then the expression df.loc[("cobra", "mark ii")] would return a scalar value, not a Series
    5. As in (4), the test assert_type(df.loc["cobra", "mark i"], pd.Series) is also ambiguous, because df.loc["cobra", "mark i"] would return a scalar.
    6. assert_type(df.loc[[("cobra", "mark ii"), "shield"]], int) is invalid code. The expression df.loc[[("cobra", "mark ii"), "shield"]] will fail with pandas. However, if you used assert_type(df.loc[("cobra", "mark ii"), "shield"], Scalar) , it passes. Note that you have to use Scalar here because from a static typing perspective, we can't infer the type of any column of a DataFrame .
    7. Both of the following lines pass pyright, but fail mypy because mypy doesn't support slices that aren't integers:

        assert_type(df.loc[("cobra", "mark i"):"viper"], pd.DataFrame)
        assert_type(df.loc[("cobra", "mark i"):("viper", "mark ii")], pd.DataFrame)  

    Based on this analysis, I'm going to close this issue. With static typing, we can't support pandas expressions that are ambiguous. We also can't do anything about mypy bugs!