Spark SQL集合数据类型array\map的取值方式

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
相关文章推荐
爱听歌的猴子 · 关于领取商标代理机构“商标数字证书”的通知· 4 月前 ·
低调的炒饭 · 4月29日10：00 ...· 4 月前 ·
挂过科的机器猫 · Divide by zero error ...· 5 月前 ·
飞翔的小刀 · [菏泽市曹县G240曹县天润实验学校西侧约6 ...· 6 月前 ·
傲视众生的鞭炮 · 前田香织 - 搜狗百科· 9 月前 ·
数组\列表 `array` 的索引方式

我们首先来看一下数组\列表 array 的索引方式：
//c的数据类型为array，我们可以单纯使用点的方式把数组中的某个结构给提取出来
//同样可以使用expr("c['a']")或col("c")("a")的方式获得相同的结果。
scala> df.select("c.a").show(10, false)
+-----------------------------------------------------------------------+
|a                                                                      |
+-----------------------------------------------------------------------+
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
|[str_1, str_2, str_3, str_4, str_5, str_6, str_7, str_8, str_9, str_10]|
+-----------------------------------------------------------------------+
scala> df.select("c.a").printSchema
 |-- a: array (nullable = true)
 |    |-- element: string (containsNull = true)
//这里介绍一个很有用的表达式explode，它能把数组中的元素展开成多行数据
//比如：
//> SELECT explode(array(10, 20));
// 10
// 20
//还有一个比较有用的函数是posexplode，顾名思义，这个函数会增加一列原始数组的索引
scala> df.select(expr("explode(c.a)")).show
+------+
|   col|
+------+
| str_1|
| str_2|
| str_3|
| str_4|
| str_5|
| str_6|
| str_7|
| str_8|
| str_9|
|str_10|
| str_1|
| str_2|
| str_3|
| str_4|
| str_5|
| str_6|
| str_7|
| str_8|
| str_9|
|str_10|
+------+
only showing top 20 rows
scala> df.select(expr("explode(c.a)")).printSchema
 |-- col: string (nullable = true)
scala> df.select(expr("explode(c)")).show
+------------+
|         col|
+------------+
|  [str_1, 1]|
|  [str_2, 2]|
|  [str_3, 3]|
|  [str_4, 4]|
|  [str_5, 5]|
|  [str_6, 6]|
|  [str_7, 7]|
|  [str_8, 8]|
|  [str_9, 9]|
|[str_10, 10]|
|  [str_1, 1]|
|  [str_2, 2]|
|  [str_3, 3]|
|  [str_4, 4]|
|  [str_5, 5]|
|  [str_6, 6]|
|  [str_7, 7]|
|  [str_8, 8]|
|  [str_9, 9]|
|[str_10, 10]|
+------------+
only showing top 20 rows
scala> df.select(expr("explode(c)")).printSchema
 |-- col: struct (nullable = true)
 |    |-- a: string (nullable = true)
 |    |-- b: integer (nullable = false)
//inline也是一个非常有用的函数，它可以把array[struct[XXX]]直接展开成XXX
scala> df.select(expr("inline(c)")).show
+------+---+
|     a|  b|
+------+---+
| str_1|  1|
| str_2|  2|
| str_3|  3|
| str_4|  4|
| str_5|  5|
| str_6|  6|
| str_7|  7|
| str_8|  8|
| str_9|  9|
|str_10| 10|
| str_1|  1|
| str_2|  2|
| str_3|  3|
| str_4|  4|
| str_5|  5|
| str_6|  6|
| str_7|  7|
| str_8|  8|
| str_9|  9|
|str_10| 10|
+------+---+
only showing top 20 rows
scala> df.select(expr("inline(c)")).printSchema
 |-- a: string (nullable = true)
 |-- b: integer (nullable = false)
scala> df.select(expr("posexplode(d)")).printSchema
 |-- pos: integer (nullable = false)
 |-- key: string (nullable = false)
 |-- value: struct (nullable = true)
 |    |-- a: string (nullable = true)
 |    |-- b: integer (nullable = false)
scala> df.select(expr("posexplode(e)")).printSchema
 |-- pos: integer (nullable = false)
 |-- key: integer (nullable = false)
 |-- value: string (nullable = true)
scala> df.select(expr("posexplode(f)")).show
+---+------------+--------+
|pos|         key|   value|
+---+------------+--------+
|  0|  [str_8, 8]| value_8|
|  1|[str_10, 10]|value_10|
|  2|  [str_3, 3]| value_3|
|  3|  [str_1, 1]| value_1|
|  4|  [str_6, 6]| value_6|
|  5|  [str_5, 5]| value_5|
|  6|  [str_7, 7]| value_7|
|  7|  [str_2, 2]| value_2|
|  8|  [str_4, 4]| value_4|
|  9|  [str_9, 9]| value_9|
|  0|  [str_8, 8]| value_8|
|  1|[str_10, 10]|value_10|
|  2|  [str_3, 3]| value_3|
|  3|  [str_1, 1]| value_1|
|  4|  [str_6, 6]| value_6|
|  5|  [str_5, 5]| value_5|
|  6|  [str_7, 7]| value_7|
|  7|  [str_2, 2]| value_2|
|  8|  [str_4, 4]| value_4|
|  9|  [str_9, 9]| value_9|
+---+------------+--------+
scala> df.select(expr("posexplode(f)")).printSchema
 |-- pos: integer (nullable = false)
 |-- key: struct (nullable = false)
 |    |-- a: string (nullable = true)
 |    |-- b: integer (nullable = false)
 |-- value: string (nullable = true)
//我们可以使用点表达式去用map的key取value
//如果key不存在这行数据会为null
scala> df.select("d.key_1").show
+----------+
|     key_1|
+----------+
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
|[str_1, 1]|
+----------+
scala> df.select("d.key_1").printSchema
 |-- key_1: struct (nullable = true)
 |    |-- a: string (nullable = true)
 |    |-- b: integer (nullable = false)
//数字为key同样可以使用
//对于数字来讲，expr("e[1]")、expr("e['1']")、col("e")(1)、col("e")("1")这四种表达式都可用
//只是最后取得的列名不同
scala> df.select("e.1").show
+-------+
|      1|
+-------+
|value_1|
|value_1|
|value_1|
|value_1|
|value_1|
|value_1|
|value_1|
|value_1|
|value_1|
|value_1|
+-------+
scala> df.select("e.1").printSchema
 |-- 1: string (nullable = true)
在学习了struct和array的取值后，再看map的取值是不是就特别简单了，下面我们来看一个难一点的例子
最有意思的就是f这个map了，我们用struct作为map的key 

这种情况下，我们可以用namedExpressionSeq表达式类构造这个struct
scala> df.select(expr("f[('str_1' AS a, 1 AS b)]")).show
+---------------------------------------------+
|f[named_struct(a, str_1 AS `a`, b, 1 AS `b`)]|
+---------------------------------------------+
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
|                                      value_1|
+---------------------------------------------+
scala> df.select(expr("f[('str_1' AS a, 1 AS b)]")).printSchema
 |-- f[named_struct(a, str_1 AS `a`, b, 1 AS `b`)]: string (nullable = true)
以上这种构造方式当然不是凭空想出来的，依据呢当然还是我之前提到的另一个博客里介绍的查看方式https://blog.csdn.net/wang_wbq/article/details/79673780
我们可以在SqlBase.g4文件中找到以下词法描述
primaryExpression
    : #前面太长不看
    | '(' namedExpression (',' namedExpression)+ ')'         #rowConstructor
    #中间太长不看
    | value=primaryExpression '[' index=valueExpression ']'  #subscript
    #后面太长不看
valueExpression
    : primaryExpression                                                                      
    #后面太长不看
namedExpression
    : expression (AS? (identifier | identifierList))?
从上面我们可以看出： 

1、中括号里需要放置valueExpression 

2、valueExpression可以是一个primaryExpression 

3、primaryExpression可以是一个'(' namedExpression (',' namedExpression)+ ')'结构 

4、namedExpression又是一个exp AS alias的结构
因此，显而易见，我们可以用这种方式来构造结构体去匹配map的key
数组\列表 array 的索引方式

数组\列表 `array` 的索引方式