This feels like a dumb question but I am stuck. Using Explorer I want to count the number of rows for each group (groups may be defined by one column, or multiple columns).
I would think this would be:
alias Explorer.{Datasets, DataFrame}
Datasets.fossil_fuels()
|> DataFrame.group_by("country")
|> DataFrame.summarise(country: [:count])
And the result would be a data frame like:
#Explorer.DataFrame<
Polars[222 x 2]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
country_count integer [5, 5, 5, 5, 5, ...]
Instead I get this error:
%Inspect.Error{
message: "got RuntimeError with message \"Polars(NotFound(\\\"country_count\\\"))\"
while inspecting %{__struct__: Explorer.DataFrame, data: %{__struct__:
Explorer.PolarsBackend.DataFrame, resource: #Reference<0.3895999949.566099982.184593>},
dtypes: %{\"country\" => :string, \"country_count\" => :string}, groups: [], names:
[\"country\", \"country_count\"]}"
It works fine if I group and count by different variables, e.g.,
Datasets.fossil_fuels()
|> DataFrame.group_by("country")
|> DataFrame.summarise(total: [:count])
returns:
#Explorer.DataFrame<
Polars[222 x 2]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
total_count integer [5, 5, 5, 5, 5, ...]
but it seems odd that I have to summarise an arbitrary other variable. What am I missing?
From the docs:
When the dataframe has grouping variables, operations are performed per group
You can think of this as Explorer creating sub-dataframes using the grouping variable and applying operations on each group. But importantly, the sub-dataframe does not contain the variable any more (since it does not vary) and the variable cannot be used any longer.
┌───────────────────────────────────────────┐
┌──────────────────┐ │ Country 1 Country 2 Country N │
│ │ group_by │ ┌─────────┐ ┌───────────┐ ┌──────────┐ │
│ Fossil Fuels ├───────────► │ Sub-DF │ │ Sub-DF │ │ Sub-DF │ │
│ │ │ │ │ │ │ │ │ │
└──────────────────┘ │ └─────────┘ └───────────┘ └──────────┘ │
└───────────────────────────────────────────┘
As of version 0.3.0, Explorer.Series.summarise_with/2
is a bit more flexible:
alias Explorer.{Datasets, DataFrame, Series}
Datasets.fossil_fuels()
|> DataFrame.group_by("country")
|> DataFrame.summarise_with(&[country_count: Series.count(&1["country"])])
returns
#Explorer.DataFrame<
Polars[222 x 2]
country string ["AFGHANISTAN", "ALBANIA", "ALGERIA", "ANDORRA", "ANGOLA", ...]
country_count integer [5, 5, 5, 5, 5, ...]