大叔经验分享（83）impala执行多个select distinct

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

玩命的小虾米 · SQL ...· 7 小时前 ·

听话的感冒药 · sql server视图的作用_信息技术中心· 14 小时前 ·

奔跑的西瓜 · SQL ...· 14 小时前 ·

瘦瘦的野马 · SQL ...· 14 小时前 ·

自信的槟榔 · update 语句where ...· 20 小时前 ·

傲视众生的开水瓶 · 花少为何停播？因为张翰没谈恋爱啊！ - ...· 3 周前 ·

魁梧的凉面 · 86版《西游记》的配乐和歌曲曾被批“特别庸俗”· 1 月前 ·

豪爽的菠萝 · 2023年文化和自然遗产日丨@山西文物界的“ ...· 2 月前 ·

玉树临风的口罩 · 错误提示· 2 月前 ·

淡定的菠萝 · 湖北芯擎科技有限公司· 3 月前 ·

impala在一个select中执行多个count distinct时会报错，比如执行

select key, count(distinct column_a), count(distinct column_b) from test_table group by key

Query submitted at: 2019-09-28 00:34:20 (Coordinator: http://DataOne-001:25000)
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT column_a);
deviating function: count(DISTINCT column_b)
Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to
perform this rewrite automatically.

这时有几种方法：

1 使用近似值

1.1 set APPX_COUNT_DISTINCT = true
1.2 count distinct改为ndv，即ndv(column_a)
这两种方法底层实现是一样的，设置APPX_COUNT_DISTINCT会自动将count distinct改写为ndv，ndv全称为（number of distinct values），用到
Cardinality（基数计数），底层实现是类似HLLC（Hyper LogLog Counting）这种概率算法，详见参考；

An aggregate function that returns an approximate value similar to the result of COUNT(DISTINCT col ) , the "number of distinct values". It is much faster than the combination of COUNT and DISTINCT , and uses a constant amount of memory and thus is less memory-intensive for columns with high cardinality.

2 使用精确值

改写为多个子查询然后join，比如

select a.key, a.count_a, b.count_b from
(select key, count(distinct column_a) count_a from test_table group by key) a join
(select key, count(distinct column_b) count_b from test_table group by key) b on a.key = b.key

http://impala.apache.org/docs/build/html/topics/impala_ndv.html#ndv

APPX_COUNT_DISTINCT

http://impala.apache.org/docs/build/html/topics/impala_appx_count_distinct.html

https://stackoverflow.com/questions/39236076/impala-all-distinct-aggregate-functions-need-to-have-the-same-set-of-parameters

---------------------------------------------------------------- 结束啦，我是大魔王先生的分割线：) ----------------------------------------------------------------

由于大魔王先生能力有限，文中可能存在错误，欢迎指正、补充！

感谢您的阅读，如果文章对您有用，那么请为大魔王先生轻轻点个赞，ありがとう

推荐文章

玩命的小虾米 · SQL Server日期加减函数DATEDIFF与DATEADD用法分析 - 80VPS主机

7 小时前

听话的感冒药 · sql server视图的作用_信息技术中心

14 小时前

奔跑的西瓜 · SQL Server查询优化方法(查询速度慢的原因很多，常见如下几种) . - 有梦就能实现 - 博客园

14 小时前

瘦瘦的野马 · SQL Server视图定义中不能使用option(recompile)提示 - 潇湘隐者 - 博客园

14 小时前

自信的槟榔 · update 语句where 条件in是个select 子句的返回结果集的用法_update where in-CSDN博客

20 小时前

傲视众生的开水瓶 · 花少为何停播？因为张翰没谈恋爱啊！ - 360娱乐，你开心就好

3 周前

魁梧的凉面 · 86版《西游记》的配乐和歌曲曾被批“特别庸俗”

1 月前

豪爽的菠萝 · 2023年文化和自然遗产日丨@山西文物界的“佼佼者”，快来领奖！_黄河新闻网

2 月前

玉树临风的口罩 · 错误提示

2 月前

淡定的菠萝 · 湖北芯擎科技有限公司

3 月前