impala在一个select中执行多个count distinct时会报错,比如执行
select key, count(distinct column_a), count(distinct column_b) from test_table group by key
Query submitted at: 2019-09-28 00:34:20 (Coordinator: http://DataOne-001:25000)
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT column_a);
deviating function: count(DISTINCT column_b)
Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to
perform this rewrite automatically.
这时有几种方法:
1 使用近似值
1.1 set APPX_COUNT_DISTINCT = true
1.2 count distinct改为ndv,即ndv(column_a)
这两种方法底层实现是一样的,设置APPX_COUNT_DISTINCT会自动将count distinct改写为ndv,ndv全称为(number of distinct values),用到
Cardinality(基数计数),底层实现是类似HLLC(Hyper LogLog Counting)这种概率算法,详见参考;
An aggregate function that returns an approximate value similar to the result of
COUNT(DISTINCT
col
)
, the
"number of distinct values". It is much faster than the combination of
COUNT
and
DISTINCT
, and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality.
2 使用精确值
改写为多个子查询然后join,比如
select a.key, a.count_a, b.count_b from
(select key, count(distinct column_a) count_a from test_table group by key) a join
(select key, count(distinct column_b) count_b from test_table group by key) b on a.key = b.key
http://impala.apache.org/docs/build/html/topics/impala_ndv.html#ndv
APPX_COUNT_DISTINCT
http://impala.apache.org/docs/build/html/topics/impala_appx_count_distinct.html
https://stackoverflow.com/questions/39236076/impala-all-distinct-aggregate-functions-need-to-have-the-same-set-of-parameters
---------------------------------------------------------------- 结束啦,我是大魔王先生的分割线 :) ----------------------------------------------------------------
由于大魔王先生能力有限,文中可能存在错误,欢迎指正、补充!
感谢您的阅读,如果文章对您有用,那么请为大魔王先生轻轻点个赞,ありがとう