A set of methods for aggregations on a
DataFrame
, created by
Dataset.groupBy
.
The main method is the agg function, which has multiple variants. This class also contains
convenience some first order statistics such as mean, sum for convenience.
This class was named
GroupedData
in Spark 1.x.
Since:
2.0.0
Dataset
<
Row
>
agg
(scala.collection.immutable.Map<String,String> exprs)
(Scala-specific) Compute aggregates by specifying a map from column name to
aggregate methods.
Dataset
<
Row
>
agg
(java.util.Map<String,String> exprs)
(Java-specific) Compute aggregates by specifying a map from column name to
aggregate methods.
Dataset
<
Row
>
agg
(scala.Tuple2<String,String> aggExpr,
scala.collection.Seq<scala.Tuple2<String,String>> aggExprs)
(Scala-specific) Compute aggregates by specifying the column names and
aggregate methods.
static
RelationalGroupedDataset
apply
(
Dataset
<
Row
> df,
scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs,
org.apache.spark.sql.RelationalGroupedDataset.GroupType groupType)
Dataset
<
Row
>
avg
(scala.collection.Seq<String> colNames)
Compute the mean value for each numeric columns for each group.
Dataset
<
Row
>
avg
(String... colNames)
Compute the mean value for each numeric columns for each group.
Dataset
<
Row
>
count
()
Count the number of rows for each group.
Dataset
<
Row
>
max
(scala.collection.Seq<String> colNames)
Compute the max value for each numeric columns for each group.
Dataset
<
Row
>
max
(String... colNames)
Compute the max value for each numeric columns for each group.
Dataset
<
Row
>
mean
(scala.collection.Seq<String> colNames)
Compute the average value for each numeric columns for each group.
Dataset
<
Row
>
mean
(String... colNames)
Compute the average value for each numeric columns for each group.
Dataset
<
Row
>
min
(scala.collection.Seq<String> colNames)
Compute the min value for each numeric column for each group.
Dataset
<
Row
>
min
(String... colNames)
Compute the min value for each numeric column for each group.
RelationalGroupedDataset
pivot
(String pivotColumn)
Pivots a column of the current
DataFrame
and perform the specified aggregation.
RelationalGroupedDataset
pivot
(String pivotColumn,
java.util.List<Object> values)
Pivots a column of the current
DataFrame
and perform the specified aggregation.
RelationalGroupedDataset
pivot
(String pivotColumn,
scala.collection.Seq<Object> values)
Pivots a column of the current
DataFrame
and perform the specified aggregation.
Dataset
<
Row
>
sum
(scala.collection.Seq<String> colNames)
Compute the sum for each numeric columns for each group.
Dataset
<
Row
>
sum
(String... colNames)
Compute the sum for each numeric columns for each group.
apply
public static RelationalGroupedDataset apply(Dataset<Row> df,
scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs,
org.apache.spark.sql.RelationalGroupedDataset.GroupType groupType)
public Dataset<Row> agg(Column expr,
Column... exprs)
Compute aggregates by specifying a series of aggregate columns. Note that this function by
default retains the grouping columns in its output. To not retain grouping columns, set
spark.sql.retainGroupColumns
to false.
The available aggregate methods are defined in
functions
.
// Selects the age of the oldest employee and the aggregate expense for each department
// Scala:
import org.apache.spark.sql.functions._
df.groupBy("department").agg(max("age"), sum("expense"))
// Java:
import static org.apache.spark.sql.functions.*;
df.groupBy("department").agg(max("age"), sum("expense"));
Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change
to that behavior, set config variable
spark.sql.retainGroupColumns
to
false
.
// Scala, 1.3.x:
df.groupBy("department").agg($"department", max("age"), sum("expense"))
// Java, 1.3.x:
df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
Parameters:
expr
- (undocumented)
exprs
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> mean(String... colNames)
Compute the average value for each numeric columns for each group. This is an alias for
avg
.
The resulting
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the average values for them.
Parameters:
colNames
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> max(String... colNames)
Compute the max value for each numeric columns for each group.
The resulting
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the max values for them.
Parameters:
colNames
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> avg(String... colNames)
Compute the mean value for each numeric columns for each group.
The resulting
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the mean values for them.
Parameters:
colNames
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> min(String... colNames)
Compute the min value for each numeric column for each group.
The resulting
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the min values for them.
Parameters:
colNames
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> sum(String... colNames)
Compute the sum for each numeric columns for each group.
The resulting
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the sum for them.
Parameters:
colNames
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> agg(scala.Tuple2<String,String> aggExpr,
scala.collection.Seq<scala.Tuple2<String,String>> aggExprs)
(Scala-specific) Compute aggregates by specifying the column names and
aggregate methods. The resulting
DataFrame
will also contain the grouping columns.
The available aggregate methods are
avg
,
max
,
min
,
sum
,
count
.
// Selects the age of the oldest employee and the aggregate expense for each department
df.groupBy("department").agg(
"age" -> "max",
"expense" -> "sum"
Parameters:
aggExpr
- (undocumented)
aggExprs
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> agg(scala.collection.immutable.Map<String,String> exprs)
(Scala-specific) Compute aggregates by specifying a map from column name to
aggregate methods. The resulting
DataFrame
will also contain the grouping columns.
The available aggregate methods are
avg
,
max
,
min
,
sum
,
count
.
// Selects the age of the oldest employee and the aggregate expense for each department
df.groupBy("department").agg(Map(
"age" -> "max",
"expense" -> "sum"
Parameters:
exprs
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> agg(java.util.Map<String,String> exprs)
(Java-specific) Compute aggregates by specifying a map from column name to
aggregate methods. The resulting
DataFrame
will also contain the grouping columns.
The available aggregate methods are
avg
,
max
,
min
,
sum
,
count
.
// Selects the age of the oldest employee and the aggregate expense for each department
import com.google.common.collect.ImmutableMap;
df.groupBy("department").agg(ImmutableMap.of("age", "max", "expense", "sum"));
Parameters:
exprs
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> agg(Column expr,
scala.collection.Seq<Column> exprs)
Compute aggregates by specifying a series of aggregate columns. Note that this function by
default retains the grouping columns in its output. To not retain grouping columns, set
spark.sql.retainGroupColumns
to false.
The available aggregate methods are defined in
functions
.
// Selects the age of the oldest employee and the aggregate expense for each department
// Scala:
import org.apache.spark.sql.functions._
df.groupBy("department").agg(max("age"), sum("expense"))
// Java:
import static org.apache.spark.sql.functions.*;
df.groupBy("department").agg(max("age"), sum("expense"));
Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change
to that behavior, set config variable
spark.sql.retainGroupColumns
to
false
.
// Scala, 1.3.x:
df.groupBy("department").agg($"department", max("age"), sum("expense"))
// Java, 1.3.x:
df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
Parameters:
expr
- (undocumented)
exprs
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> count()
Count the number of rows for each group.
The resulting
DataFrame
will also contain the grouping columns.
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> mean(scala.collection.Seq<String> colNames)
Compute the average value for each numeric columns for each group. This is an alias for
avg
.
The resulting
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the average values for them.
Parameters:
colNames
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> max(scala.collection.Seq<String> colNames)
Compute the max value for each numeric columns for each group.
The resulting
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the max values for them.
Parameters:
colNames
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> avg(scala.collection.Seq<String> colNames)
Compute the mean value for each numeric columns for each group.
The resulting
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the mean values for them.
Parameters:
colNames
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> min(scala.collection.Seq<String> colNames)
Compute the min value for each numeric column for each group.
The resulting
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the min values for them.
Parameters:
colNames
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
public Dataset<Row> sum(scala.collection.Seq<String> colNames)
Compute the sum for each numeric columns for each group.
The resulting
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the sum for them.
Parameters:
colNames
- (undocumented)
Returns:
(undocumented)
Since:
1.3.0
pivot
public RelationalGroupedDataset pivot(String pivotColumn)
Pivots a column of the current
DataFrame
and perform the specified aggregation.
There are two versions of pivot function: one that requires the caller to specify the list
of distinct values to pivot on, and one that does not. The latter is more concise but less
efficient, because Spark needs to first compute the list of distinct values internally.
// Compute the sum of earnings for each year by course with each course as a separate column
df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings")
// Or without specifying column values (less efficient)
df.groupBy("year").pivot("course").sum("earnings")
Parameters:
pivotColumn
- Name of the column to pivot.
Returns:
(undocumented)
Since:
1.6.0
pivot
public RelationalGroupedDataset pivot(String pivotColumn,
scala.collection.Seq<Object> values)
Pivots a column of the current
DataFrame
and perform the specified aggregation.
There are two versions of pivot function: one that requires the caller to specify the list
of distinct values to pivot on, and one that does not. The latter is more concise but less
efficient, because Spark needs to first compute the list of distinct values internally.
// Compute the sum of earnings for each year by course with each course as a separate column
df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings")
// Or without specifying column values (less efficient)
df.groupBy("year").pivot("course").sum("earnings")
Parameters:
pivotColumn
- Name of the column to pivot.
values
- List of values that will be translated to columns in the output DataFrame.
Returns:
(undocumented)
Since:
1.6.0
pivot
public RelationalGroupedDataset pivot(String pivotColumn,
java.util.List<Object> values)
Pivots a column of the current
DataFrame
and perform the specified aggregation.
There are two versions of pivot function: one that requires the caller to specify the list
of distinct values to pivot on, and one that does not. The latter is more concise but less
efficient, because Spark needs to first compute the list of distinct values internally.
// Compute the sum of earnings for each year by course with each course as a separate column
df.groupBy("year").pivot("course", Arrays.<Object>asList("dotNET", "Java")).sum("earnings");
// Or without specifying column values (less efficient)
df.groupBy("year").pivot("course").sum("earnings");
Parameters:
pivotColumn
- Name of the column to pivot.
values
- List of values that will be translated to columns in the output DataFrame.
Returns:
(undocumented)
Since:
1.6.0