RelationalGroupedDataset (Spark 2.2.1 JavaDoc)

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

有腹肌的啄木鸟 · 使用 JdbcTemplate ...· 1 周前 ·

爱跑步的钥匙 · MybatisPlus学习笔记 | ...· 1 周前 ·

咆哮的馒头 · Get Nth Entry from ...· 1 周前 ·

留胡子的汤圆 · SharePoint 搜索 REST ...· 6 天前 ·

没有腹肌的开水瓶 · Exception in thread ...· 6 天前 ·

坚强的四季豆 · TypeScript Keyof· 2 月前 ·

豪气的打火机 · 上市公司回购股份是不是好事_新闻商报观察老周 ...· 3 月前 ·

长情的沙滩裤 · 【RL+Transformer综述】A ...· 4 月前 ·

高大的白开水 · 趣书网TXT电子书APP下载-趣书网软件20 ...· 5 月前 ·

风流倜傥的香烟 · 张纪中谈《央视水浒》幕后_哔哩哔哩_bili ...· 6 月前 ·

public class RelationalGroupedDataset
extends Object

A set of methods for aggregations on a


    DataFrame

, created by


    Dataset.groupBy

. The main method is the agg function, which has multiple variants. This class also contains convenience some first order statistics such as mean, sum for convenience. This class was named


    GroupedData

in Spark 1.x.

Since:

2.0.0


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      agg
     
    
    (scala.collection.immutable.Map<String,String> exprs)

(Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods.


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      agg
     
    
    (java.util.Map<String,String> exprs)

(Java-specific) Compute aggregates by specifying a map from column name to aggregate methods.


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      agg
     
    
    (scala.Tuple2<String,String> aggExpr,
   scala.collection.Seq<scala.Tuple2<String,String>> aggExprs)

(Scala-specific) Compute aggregates by specifying the column names and aggregate methods.


    static
    
     RelationalGroupedDataset


    
     
      apply
     
    
    (
    
     Dataset
    
    <
    
     Row
    
    > df,
     scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs,
     org.apache.spark.sql.RelationalGroupedDataset.GroupType groupType)


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      avg
     
    
    (scala.collection.Seq<String> colNames)

Compute the mean value for each numeric columns for each group.


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      avg
     
    
    (String... colNames)

Compute the mean value for each numeric columns for each group.


    
     Dataset
    
    <
    
     Row
    
    >

Count the number of rows for each group.


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      max
     
    
    (scala.collection.Seq<String> colNames)

Compute the max value for each numeric columns for each group.


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      max
     
    
    (String... colNames)

Compute the max value for each numeric columns for each group.


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      mean
     
    
    (scala.collection.Seq<String> colNames)

Compute the average value for each numeric columns for each group.


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      mean
     
    
    (String... colNames)

Compute the average value for each numeric columns for each group.


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      min
     
    
    (scala.collection.Seq<String> colNames)

Compute the min value for each numeric column for each group.


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      min
     
    
    (String... colNames)

Compute the min value for each numeric column for each group.


    
     RelationalGroupedDataset


    
     
      pivot
     
    
    (String pivotColumn)

Pivots a column of the current


     DataFrame

and perform the specified aggregation.


    
     RelationalGroupedDataset


    
     
      pivot
     
    
    (String pivotColumn,
     java.util.List<Object> values)

Pivots a column of the current


     DataFrame

and perform the specified aggregation.


    
     RelationalGroupedDataset


    
     
      pivot
     
    
    (String pivotColumn,
     scala.collection.Seq<Object> values)

Pivots a column of the current


     DataFrame

and perform the specified aggregation.


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      sum
     
    
    (scala.collection.Seq<String> colNames)

Compute the sum for each numeric columns for each group.


    
     Dataset
    
    <
    
     Row
    
    >


    
     
      sum
     
    
    (String... colNames)

Compute the sum for each numeric columns for each group.

apply

public static RelationalGroupedDataset apply(Dataset<Row> df,
                                             scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs,
                                             org.apache.spark.sql.RelationalGroupedDataset.GroupType groupType)

public Dataset<Row> agg(Column expr,
                        Column... exprs)

Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, set


     spark.sql.retainGroupColumns

to false. The available aggregate methods are defined in


      functions

. // Selects the age of the oldest employee and the aggregate expense for each department // Scala: import org.apache.spark.sql.functions._ df.groupBy("department").agg(max("age"), sum("expense")) // Java: import static org.apache.spark.sql.functions.*; df.groupBy("department").agg(max("age"), sum("expense")); Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change to that behavior, set config variable


     spark.sql.retainGroupColumns


     false

. // Scala, 1.3.x: df.groupBy("department").agg($"department", max("age"), sum("expense")) // Java, 1.3.x: df.groupBy("department").agg(col("department"), max("age"), sum("expense"));

Parameters:


      expr

- (undocumented)


      exprs

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> mean(String... colNames)

Compute the average value for each numeric columns for each group. This is an alias for

avg

. The resulting


      DataFrame

will also contain the grouping columns. When specified columns are given, only compute the average values for them.

Parameters:


       colNames

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> max(String... colNames)

Compute the max value for each numeric columns for each group. The resulting


       DataFrame

will also contain the grouping columns. When specified columns are given, only compute the max values for them.

Parameters:


        colNames

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> avg(String... colNames)

Compute the mean value for each numeric columns for each group. The resulting


        DataFrame

will also contain the grouping columns. When specified columns are given, only compute the mean values for them.

Parameters:


         colNames

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> min(String... colNames)

Compute the min value for each numeric column for each group. The resulting


         DataFrame

will also contain the grouping columns. When specified columns are given, only compute the min values for them.

Parameters:


          colNames

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> sum(String... colNames)

Compute the sum for each numeric columns for each group. The resulting


          DataFrame

will also contain the grouping columns. When specified columns are given, only compute the sum for them.

Parameters:


           colNames

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> agg(scala.Tuple2<String,String> aggExpr,
                        scala.collection.Seq<scala.Tuple2<String,String>> aggExprs)

(Scala-specific) Compute aggregates by specifying the column names and aggregate methods. The resulting


           DataFrame

will also contain the grouping columns. The available aggregate methods are

avg

max

min

sum


           count

. // Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg( "age" -> "max", "expense" -> "sum"

Parameters:


            aggExpr

- (undocumented)


            aggExprs

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> agg(scala.collection.immutable.Map<String,String> exprs)

(Scala-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting


            DataFrame

will also contain the grouping columns. The available aggregate methods are

avg

max

min

sum


            count

. // Selects the age of the oldest employee and the aggregate expense for each department df.groupBy("department").agg(Map( "age" -> "max", "expense" -> "sum"

Parameters:


             exprs

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> agg(java.util.Map<String,String> exprs)

(Java-specific) Compute aggregates by specifying a map from column name to aggregate methods. The resulting


             DataFrame

will also contain the grouping columns. The available aggregate methods are

avg

max

min

sum


             count

. // Selects the age of the oldest employee and the aggregate expense for each department import com.google.common.collect.ImmutableMap; df.groupBy("department").agg(ImmutableMap.of("age", "max", "expense", "sum"));

Parameters:


              exprs

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> agg(Column expr,
                        scala.collection.Seq<Column> exprs)

Compute aggregates by specifying a series of aggregate columns. Note that this function by default retains the grouping columns in its output. To not retain grouping columns, set


              spark.sql.retainGroupColumns

to false. The available aggregate methods are defined in


               functions


              spark.sql.retainGroupColumns


              false

. // Scala, 1.3.x: df.groupBy("department").agg($"department", max("age"), sum("expense")) // Java, 1.3.x: df.groupBy("department").agg(col("department"), max("age"), sum("expense"));

Parameters:


               expr

- (undocumented)


               exprs

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> count()

Count the number of rows for each group. The resulting


               DataFrame

will also contain the grouping columns.

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> mean(scala.collection.Seq<String> colNames)

Compute the average value for each numeric columns for each group. This is an alias for

avg

. The resulting


                DataFrame

will also contain the grouping columns. When specified columns are given, only compute the average values for them.

Parameters:


                 colNames

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> max(scala.collection.Seq<String> colNames)

Compute the max value for each numeric columns for each group. The resulting


                 DataFrame

will also contain the grouping columns. When specified columns are given, only compute the max values for them.

Parameters:


                  colNames

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> avg(scala.collection.Seq<String> colNames)

Compute the mean value for each numeric columns for each group. The resulting


                  DataFrame

will also contain the grouping columns. When specified columns are given, only compute the mean values for them.

Parameters:


                   colNames

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> min(scala.collection.Seq<String> colNames)

Compute the min value for each numeric column for each group. The resulting


                   DataFrame

will also contain the grouping columns. When specified columns are given, only compute the min values for them.

Parameters:


                    colNames

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

public Dataset<Row> sum(scala.collection.Seq<String> colNames)

Compute the sum for each numeric columns for each group. The resulting


                    DataFrame

will also contain the grouping columns. When specified columns are given, only compute the sum for them.

Parameters:


                     colNames

- (undocumented)

Returns:

(undocumented)

Since:

1.3.0

pivot

public RelationalGroupedDataset pivot(String pivotColumn)

Pivots a column of the current


                     DataFrame

and perform the specified aggregation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally. // Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Seq("dotNET", "Java")).sum("earnings") // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings")

Parameters:


                      pivotColumn

- Name of the column to pivot.

Returns:

(undocumented)

Since:

1.6.0

pivot

public RelationalGroupedDataset pivot(String pivotColumn,
                                      scala.collection.Seq<Object> values)

Pivots a column of the current


                      DataFrame

Parameters:


                       pivotColumn

- Name of the column to pivot.


                       values

- List of values that will be translated to columns in the output DataFrame.

Returns:

(undocumented)

Since:

1.6.0

pivot

public RelationalGroupedDataset pivot(String pivotColumn,
                                      java.util.List<Object> values)

Pivots a column of the current


                       DataFrame

and perform the specified aggregation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally. // Compute the sum of earnings for each year by course with each course as a separate column df.groupBy("year").pivot("course", Arrays.<Object>asList("dotNET", "Java")).sum("earnings"); // Or without specifying column values (less efficient) df.groupBy("year").pivot("course").sum("earnings");

Parameters:


                        pivotColumn

- Name of the column to pivot.


                        values

- List of values that will be translated to columns in the output DataFrame.

Returns:

(undocumented)

Since:

1.6.0