DataFrame.groupby().std() fails on filtered DataFrame · Issue #16174 · pandas-dev/pandas

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Code Sample, a copy-pastable example if possible

dicts = [{'filter_col':False, 'groupby_col':True, 'bool_col':True, 'float_col':10.5}, {'filter_col':True, 'groupby_col':True, 'bool_col':True, 'float_col':20.5}, {'filter_col':True, 'groupby_col':True, 'bool_col':True, 'float_col':30.5}]
df = DataFrame(dicts)
df_filter = df[df['filter_col'] == True]
dfgb = df_filter.groupby('groupby_col')
dfgb.std()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.5/site-packages/pandas/core/groupby.py", line 1055, in std
    return np.sqrt(self.var(ddof=ddof))
AttributeError: 'bool' object has no attribute 'sqrt'

Problem description

Required elements for the error to appear are:

groupby() is applied to a filtered DataFrame, not an original DataFrame

std(), not another aggregate function (e.g. mean()), is called on the DataFrameGroupBy object

the DataFrame contains a column of type bool

there are at least 2 rows w/ the same value of the .groupby() column (here, 'groupby_col')

In my more-complicated real-world data where I ran into the error, I would also see an Exception complaining about type float:

AttributeError: 'float' object has no attribute 'sqrt'

However, even in that case, deleting the bool column would resolve the issue.

Presumably I'll be able to work around the issue by calling .std() on individual columns of the DataFrameGroupBy object, but it seems like pandas should be able to handle this case w/o choking.

Expected Output

             bool_col  filter_col  float_col
groupby_col                                 
True              0.0     0.0       7.07107
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.16-gentoo

machine: x86_64

processor: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz

byteorder: little

LC_ALL: en_US.UTF-8

LANG: en_US.UTF-8

LOCALE: en_US.UTF-8
pandas: 0.19.1

nose: None

pip: 7.1.2

setuptools: 30.4.0

Cython: 0.25.1

numpy: 1.10.4

scipy: 0.16.1

statsmodels: 0.6.1

xarray: None

IPython: None

sphinx: None

patsy: 0.4.1

dateutil: 2.4.2

pytz: 2016.3

blosc: None

bottleneck: 1.0.0

tables: None

numexpr: 2.6.1

matplotlib: 1.5.3

openpyxl: None

xlrd: None

xlwt: None

xlsxwriter: None

lxml: None

bs4: 4.5.3

html5lib: 0.9999999

httplib2: 0.9.2

apiclient: None

sqlalchemy: None

pymysql: None

psycopg2: 2.6.2 (dt dec pq3 ext lo64)

jinja2: 2.9.5

boto: None

pandas_datareader: None
          we exclude non-numeric columns in aggregations. however, bool is valid for some.
In [8]: df.groupby('groupby_col').sum()
Out[8]: 
             bool_col  filter_col  float_col
groupby_col                                 
True              3.0         2.0       61.5
In [9]: df.groupby('groupby_col').mean()
Out[9]: 
             bool_col  filter_col  float_col
groupby_col                                 
True              1.0    0.666667       20.5
In [10]: df.dtypes
Out[10]: 
bool_col          bool
filter_col        bool
float_col      float64
groupby_col       bool
dtype: object
so we could fix generally, by simply astyping bool columns (we already cast certain columns for computation anyhow), or could pull back and remove bool from numeric aggregations like sum/mean.
@TomAugspurger
          sqrt and var can also make sense for booleans, but we seem to fail for when the column being aggregated has no variance.
In [5]: pd.DataFrame({"A": [1, 1, 1], "B": [True, True, True], "C": [1, 1, 1]}).groupby("A").std()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-375645985fa5> in <module>()
----> 1 pd.DataFrame({"A": [1, 1, 1], "B": [True, True, True], "C": [1, 1, 1]}).groupby("A").std()
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/groupby.py in std(self, ddof, *args, **kwargs)
   1080         # TODO: implement at Cython level?
   1081         nv.validate_groupby_func('std', args, kwargs)
-> 1082         return np.sqrt(self.var(ddof=ddof, **kwargs))
   1084     @Substitution(name='groupby')
AttributeError: 'bool' object has no attribute 'sqrt'
In [6]: pd.DataFrame({"A": [1, 1, 2], "B": [True, True, True], "C": [1, 1, 1]}).groupby("A").std()
Out[6]:
1  0.0  0.0
2  NaN  NaN
In [7]: pd.DataFrame({"A": [1, 1, 1], "B": [True, True, False], "C": [1, 1, 1]}).groupby("A").std()
Out[7]:
1  0.57735  0.0
          Really, the underlying issue is probably unrelated to groupby.
In [45]: pd.DataFrame({"A": [1, 1, 1, 1], "B": [True, True, True, True], "C": [1, 1, 1, 2]}).groupby("A").var()
Out[45]:
1  False  0.25
Should the B column there be 0, not False? That'd be consistent with numpy
In [46]: np.var([1, 1, 1, 1])
Out[46]: 0.0
>>> dicts = [
...     {"filter_col": False, "groupby_col": True, "bool_col": True, "float_col": 10.5},
...     {"filter_col": True, "groupby_col": True, "bool_col": True, "float_col": 20.5},
...     {"filter_col": True, "groupby_col": True, "bool_col": True, "float_col": 30.5},
... ]
>>> df = pd.DataFrame(dicts)
>>> df_filter = df[df["filter_col"] == True]
>>> dfgb = df_filter.groupby("groupby_col")
>>> dfgb.std()
             filter_col  bool_col  float_col
groupby_col
True                0.0       0.0   7.071068
could use a test.