Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame grou

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

直爽的墨镜 · 重大利好！南通多个轨道项目纳入国家级规划· 5 月前 ·

年轻有为的啄木鸟 · 山东省工业和信息化厅新闻发布会 ...· 5 月前 ·

神勇威武的地瓜 · Konva Class: Context· 1 年前 ·

不爱学习的风衣 · MYSQL：如何計算每筆數據的移動平均值？ ...· 1 年前 ·

温柔的长颈鹿 · Ambiguous use of ...· 1 年前 ·

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame groupby #17605 Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame groupby #17605 nmusolino opened this issue Sep 20, 2017 · 9 comments · Fixed by #29690

In [1]: import pandas
In [2]: df = pandas.DataFrame({'type': pandas.Categorical(['AAA', 'AAA', 'B', 'C']),
   ...:                        'voltage': pandas.Series([1.5, 1.5, 1.5, 1.5]),
   ...:                        'treatment': pandas.Categorical(['T', 'C', 'T', 'C'])})
In [3]: df.groupby(['treatment', 'type']).count()
Out[3]:
                voltage
treatment type
C         AAA       1.0
          B         NaN
          C         1.0
T         AAA       1.0
          B         1.0
          C         NaN
In [4]: df.groupby(['treatment', 'type'])['voltage'].count()
Out[4]:
treatment  type
C          AAA     1
           C       1
T          AAA     1
           B       1
Name: voltage, dtype: int64
Problem description
When performing a groupby on categorical columns, categories with empty groups should be present in output.  That is, the multi-index of the object returned by count() should contain the Cartesian product of all the labels of the first categorical column ("treatment" in the example above) and the second categorical column ("type") by which the grouping was performed.
The behavior in cell [3] above is correct.  But in cell [4], after obtaining a pandas.core.groupby.SeriesGroupBy object, the series returned by the count() method does not have entries for all levels of the "type" categorical.
Expected Output
The output from cell [4] should be equivalent to this output, with length 6, and include values for the index values (C, B) and (T, C).
In [5]: df.groupby(['treatment', 'type']).count().squeeze()
Out[5]:
treatment  type
C          AAA     1.0
           B       NaN
           C       1.0
T          AAA     1.0
           B       1.0
           C       NaN
Name: voltage, dtype: float64
Workaround
Perform column access after calling count():
In [7]: df.groupby(['treatment', 'type']).count()['voltage']
Out[7]:
treatment  type
C          AAA     1.0
           B       NaN
           C       1.0
T          AAA     1.0
           B       1.0
           C       NaN
Name: voltage, dtype: float64
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: 1.5.0
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.3
html5lib: 0.999
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None
      changed the title
Series groupby does not included zero or nan counts for categoricals, unlike DataFrame groupby
Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame groupby
    Sep 20, 2017
          And to be more constructive, I would imagine in your use case you want some sort of reindex first using something like this stuff:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.MultiIndex.from_product.html
          This issue was created before the introduction of the observed keyword in groupby(), but it seems to me that with the default of observed=False all categories should be present in the output as @nmusolino expects.
On first sight this is true for most aggregations, but count() seems to be an exception:
pdf = pd.DataFrame({
    "category_1": pd.Categorical(list("AABBCC"), categories=list("ABCDEF")),
    "category_2": pd.Categorical(list("ABC") * 2, categories=list("ABCDEF")),
    "value": [0.1] * 6
pdf.groupby(["category_1", "category_2"])["value"].sum()  # All categories present
pdf.groupby(["category_1", "category_2"])["value"].mean()  # All categories present
pdf.groupby(["category_1", "category_2"])["value"].min()  # All categories present
pdf.groupby(["category_1", "category_2"])["value"].count()  # Only observed present!
So I do think this is a bug that's still present.
          @jreback is there a way of doing that though without killing the test framework? I don't think it is a test-worthy case really ... I mean simply that if you have 20k rows indexed by three cols with arity 10k x 10k x 10k, you will get a cube ravelled to 1e12 rows with the default settings. Setting observed=True gives < 20k rows.
The new default is fine, probably best folks learn to turn off the cartesian expansion. But could hit people if they upgrade old code.
          I don't think we are talking about the same thing.
A reasonable test to block this default change would have been any test that fails due to explosion of dimensions when observed=False.
The test would need to run and try to produce an array too large to compute. If the test was runnable with observed=False, then it would have been an invalid test.
As the new default is in, there is nothing to block anymore and this kind of test has no value in the current state. It is probably the good state since now anyway since everything must be explicit in high arity cases.
Below is the example above in the two cases for ref.