dicts = [{'filter_col':False, 'groupby_col':True, 'bool_col':True, 'float_col':10.5}, {'filter_col':True, 'groupby_col':True, 'bool_col':True, 'float_col':20.5}, {'filter_col':True, 'groupby_col':True, 'bool_col':True, 'float_col':30.5}]
df = DataFrame(dicts)
df_filter = df[df['filter_col'] == True]
dfgb = df_filter.groupby('groupby_col')
dfgb.std()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.5/site-packages/pandas/core/groupby.py", line 1055, in std
return np.sqrt(self.var(ddof=ddof))
AttributeError: 'bool' object has no attribute 'sqrt'
In my more-complicated real-world data where I ran into the error, I would also see an Exception complaining about type float:
However, even in that case, deleting the bool column would resolve the issue.
Presumably I'll be able to work around the issue by calling .std() on individual columns of the DataFrameGroupBy object, but it seems like pandas should be able to handle this case w/o choking.
bool_col filter_col float_col
groupby_col
True 0.0 0.0 7.07107
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.16-gentoo
machine: x86_64
processor: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.1
nose: None
pip: 7.1.2
setuptools: 30.4.0
Cython: 0.25.1
numpy: 1.10.4
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: None
sphinx: None
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.9999999
httplib2: 0.9.2
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.9.5
boto: None
pandas_datareader: None
we exclude non-numeric columns in aggregations. however, bool is valid for some.
In [8]: df.groupby('groupby_col').sum()
Out[8]:
bool_col filter_col float_col
groupby_col
True 3.0 2.0 61.5
In [9]: df.groupby('groupby_col').mean()
Out[9]:
bool_col filter_col float_col
groupby_col
True 1.0 0.666667 20.5
In [10]: df.dtypes
Out[10]:
bool_col bool
filter_col bool
float_col float64
groupby_col bool
dtype: object
so we could fix generally, by simply astyping bool columns (we already cast certain columns for computation anyhow), or could pull back and remove bool from numeric aggregations like sum/mean.
@TomAugspurger
sqrt and var can also make sense for booleans, but we seem to fail for when the column being aggregated has no variance.
In [5]: pd.DataFrame({"A": [1, 1, 1], "B": [True, True, True], "C": [1, 1, 1]}).groupby("A").std()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-375645985fa5> in <module>()
----> 1 pd.DataFrame({"A": [1, 1, 1], "B": [True, True, True], "C": [1, 1, 1]}).groupby("A").std()
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/groupby.py in std(self, ddof, *args, **kwargs)
1080 # TODO: implement at Cython level?
1081 nv.validate_groupby_func('std', args, kwargs)
-> 1082 return np.sqrt(self.var(ddof=ddof, **kwargs))
1084 @Substitution(name='groupby')
AttributeError: 'bool' object has no attribute 'sqrt'
In [6]: pd.DataFrame({"A": [1, 1, 2], "B": [True, True, True], "C": [1, 1, 1]}).groupby("A").std()
Out[6]:
1 0.0 0.0
2 NaN NaN
In [7]: pd.DataFrame({"A": [1, 1, 1], "B": [True, True, False], "C": [1, 1, 1]}).groupby("A").std()
Out[7]:
1 0.57735 0.0
Really, the underlying issue is probably unrelated to groupby.
In [45]: pd.DataFrame({"A": [1, 1, 1, 1], "B": [True, True, True, True], "C": [1, 1, 1, 2]}).groupby("A").var()
Out[45]:
1 False 0.25
Should the B
column there be 0
, not False? That'd be consistent with numpy
In [46]: np.var([1, 1, 1, 1])
Out[46]: 0.0
>>> dicts = [
... {"filter_col": False, "groupby_col": True, "bool_col": True, "float_col": 10.5},
... {"filter_col": True, "groupby_col": True, "bool_col": True, "float_col": 20.5},
... {"filter_col": True, "groupby_col": True, "bool_col": True, "float_col": 30.5},
... ]
>>> df = pd.DataFrame(dicts)
>>> df_filter = df[df["filter_col"] == True]
>>> dfgb = df_filter.groupby("groupby_col")
>>> dfgb.std()
filter_col bool_col float_col
groupby_col
True 0.0 0.0 7.071068
could use a test.