添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

This feels like a dumb question but I am stuck. Using Explorer I want to count the number of rows for each group (groups may be defined by one column, or multiple columns).

I would think this would be:

alias Explorer.{Datasets, DataFrame}
Datasets.fossil_fuels()
|> DataFrame.group_by("country")
|> DataFrame.summarise(country: [:count])

And the result would be a data frame like:

#Explorer.DataFrame<
  Polars[222 x 2]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  country_count integer [5, 5, 5, 5, 5, ...]

Instead I get this error:

%Inspect.Error{
  message: "got RuntimeError with message \"Polars(NotFound(\\\"country_count\\\"))\" 
  while inspecting %{__struct__: Explorer.DataFrame, data: %{__struct__:  
  Explorer.PolarsBackend.DataFrame, resource: #Reference<0.3895999949.566099982.184593>},
  dtypes: %{\"country\" => :string, \"country_count\" => :string}, groups: [], names: 
  [\"country\", \"country_count\"]}"

It works fine if I group and count by different variables, e.g.,

Datasets.fossil_fuels()
|> DataFrame.group_by("country")
|> DataFrame.summarise(total: [:count])

returns:

#Explorer.DataFrame<
  Polars[222 x 2]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  total_count integer [5, 5, 5, 5, 5, ...]

but it seems odd that I have to summarise an arbitrary other variable. What am I missing?

From the docs:

When the dataframe has grouping variables, operations are performed per group

You can think of this as Explorer creating sub-dataframes using the grouping variable and applying operations on each group. But importantly, the sub-dataframe does not contain the variable any more (since it does not vary) and the variable cannot be used any longer.

┌───────────────────────────────────────────┐ ┌──────────────────┐ │ Country 1 Country 2 Country N │ │ │ group_by │ ┌─────────┐ ┌───────────┐ ┌──────────┐ │ │ Fossil Fuels ├───────────► │ Sub-DF │ │ Sub-DF │ │ Sub-DF │ │ │ │ │ │ │ │ │ │ │ │ └──────────────────┘ │ └─────────┘ └───────────┘ └──────────┘ │ └───────────────────────────────────────────┘

As of version 0.3.0, Explorer.Series.summarise_with/2 is a bit more flexible:

alias Explorer.{Datasets, DataFrame, Series}
Datasets.fossil_fuels()
|> DataFrame.group_by("country")
|> DataFrame.summarise_with(&[country_count: Series.count(&1["country"])])

returns

#Explorer.DataFrame<
  Polars[222 x 2]
  country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
  country_count integer [5, 5, 5, 5, 5, ...]