子查询调优
子查询背景介绍
openGauss根据子查询在SQL语句中的位置把子查询分成了子查询、子链接两种形式。
-
子查询SubQuery:对应于查询解析树中的范围表RangeTblEntry,更通俗一些指的是出现在FROM语句后面的独立的SELECT语句。
-
子链接SubLink:对应于查询解析树中的表达式,更通俗一些指的是出现在where/on子句、targetlist里面的语句。
- exist_sublink:对应EXIST、NOT EXIST语句
- any_sublink:对应op ALL(select…)语句,其中OP可以是IN,<,>,=操作符
- all_sublink:对应op ALL(select…)语句,其中OP可以是IN,<,>,=操作符
- rowcompare_sublink:对应record op (select …)语句
- expr_sublink:对应(SELECT with single targetlist item …)语句
- array_sublink:对应ARRAY(select…)语句
- cte_sublink:对应with query(…)语句
-
非相关子查询None-Correlated SubQuery
子查询的执行不依赖于外层父查询的任何属性值。这样子查询具有独立性,可独自求解,形成一个子查询计划先于外层的查询求解。
select t1.c1,t1.c2 from t1 where t1.c1 in ( select c2 from t2 where t2.c2 IN (2,3,4) QUERY PLAN ---------------------------------------------------------------- Hash Join Hash Cond: (t1.c1 = t2.c2) -> Seq Scan on t1 Filter: (c1 = ANY ('{2,3,4}'::integer[])) -> Hash -> HashAggregate Group By Key: t2.c2 -> Seq Scan on t2 Filter: (c2 = ANY ('{2,3,4}'::integer[])) (9 rows)
-
子查询的执行依赖于外层父查询的一些属性值(如下列示例t2.c1 = t1.c1条件中的t1.c1)作为内层查询的一个AND-ed条件。这样的子查询不具备独立性,需要和外层查询按分组进行求解。
select t1.c1,t1.c2 from t1 where t1.c1 in ( select c2 from t2 where t2.c1 = t1.c1 AND t2.c2 in (2,3,4) QUERY PLAN ------------------------------------------------------------------------ Seq Scan on t1 Filter: (SubPlan 1) SubPlan 1 -> Seq Scan on t2 Filter: ((c1 = t1.c1) AND (c2 = ANY ('{2,3,4}'::integer[]))) (5 rows)
openGauss对SubLink的优化
QUERY PLAN
--------------------------------
Seq Scan on t1
Filter: (SubPlan 1)
SubPlan 1
-> Seq Scan on t2
Filter: (c1 = t1.c1)
(5 rows)
-
目前openGauss支持的Sublink-Release场景
-
QUERY PLAN -------------------------------------- Hash Join Hash Cond: (t1.c1 = t2.c2) -> Seq Scan on t1 -> Hash -> HashAggregate Group By Key: t2.c2 -> Seq Scan on t2 Filter: (c1 = 1) (8 rows)
-
Where子句中必须包含上一层查询的表中的列,子查询的其它部分不能含有上层查询的表中的列。其它限制如下。
-
子查询的where条件中必须含有来自上一层的列,而且此列必须和子查询本层涉及表中的列做相等判断,且这些条件必须用and连接。其它地方不能包含上层的列。其它限制条件如下。
-
子查询的Select关键字后,必须有且仅有一个输出列,此输出列必须是聚集函数(如max),并且聚集函数的参数(t2.c2)不能是来自外层表(t1)中的列。聚集函数不能是count。
select * from t1 where c1 >( select max(t2.c1) from t2 where t2.c1=t1.c1
select * from t1 where c1 >( select t2.c1 from t2 where t2.c1=t1.c1
select * from t1 where (c1,c2) >( select max(t2.c1),min(t2.c2) from t2 where t2.c1=t1.c1
-
select * from t1 where c1 >( select max(t2.c1) from t2 full join t3 on (t2.c2=t3.c2) where t2.c1=t1.c1
-
子查询的where条件中必须含有来自上一层的列,而且此列必须和子查询层涉及表中的列做相等判断,且这些条件必须用and连接。其它地方不能包含上层的上层中的列。例如:下列示例中的最内层子链接可以提升。
select * from t3 where t3.c1=( select t1.c1 from t1 where c1 >( select max(t2.c1) from t2 where t2.c1=t1.c1
基于上面的示例,再加一个条件,则不能提升,因为最内侧子查询引用了上层中的列。示例如下:
select * from t3 where t3.c1=( select t1.c1 from t1 where c1 >( select max(t2.c1) from t2 where t2.c1=t1.c1 and t3.c1>t2.c2
-
当WHERE过滤条件中有OR连接的EXIST相关SubLink,
select a, c from t1 where t1.a = (select avg(a) from t3 where t1.b = t3.b) or exists (select * from t4 where t1.c = t4.c);
-
提取where条件中,or子句中的opExpr。为:t1.a = (select avg(a) from t3 where t1.b = t3.b)
-
select a, c from t1 left join (select avg(a) avg, t3.b from t3 group by t3.b) as t3 on (t1.a = avg and t1.b = t3.b) where t3.b is not null or exists (select * from t4 where t1.c = t4.c);
-
select a, c from t1 left join (select avg(a) avg, t3.b from t3 group by t3.b) as t3 on (t1.a = avg and t1.b = t3.b) left join (select t4.c from t4 group by t4.c) where t3.b is not null or t4.c is not null;
-
-
-
目前openGauss不支持的Sublink-Release场景
除了以上场景之外都不支持Sublink提升,因此关联子查询会被计划成SubPlan+Broadcast的执行计划,当inner表的数据量较大时则会产生性能风险。
如果相关子查询中跟外层的两张表做join,那么无法提升该子查询,需要通过将父SQL创建成with子句,然后再跟子查询中的表做相关子查询查询。
select distinct t1.a, t2.a from t1 left join t2 on t1.a=t2.a and not exists (select a,b from test1 where test1.a=t1.a and test1.b=t2.a);
with temp as select * from (select t1.a as a, t2.a as b from t1 left join t2 on t1.a=t2.a) select distinct a,b from temp where not exists (select a,b from test1 where temp.a=test1.a and temp.b=test1.b);
-
出现在targetlist里的相关子查询无法提升(不含count)
explain (costs off) select (select c2 from t2 where t1.c1 = t2.c1) ssq, t1.c2 from t1 where t1.c2 > 10;
explain (costs off) select (select c2 from t2 where t1.c1 = t2.c1) ssq, t1.c2 from t1 where t1.c2 > 10; QUERY PLAN -------------------------------- Seq Scan on t1 Filter: (c2 > 10) SubPlan 1 -> Seq Scan on t2 Filter: (t1.c1 = c1) (5 rows)
with ssq as select t2.c2 from t2 select ssq.c2, t1.c2 from t1 left join ssq on t1.c1 = ssq.c2 where t1.c2 > 10;
QUERY PLAN --------------------------------- Hash Right Join Hash Cond: (ssq.c2 = t1.c1) CTE ssq -> Seq Scan on t2 -> CTE Scan on ssq -> Hash -> Seq Scan on t1 Filter: (c2 > 10) (8 rows)
可以看到出现在SSQ返回列表里的相关子查询SSQ,已经被提升成Right Join,从而避免当內表T2较大时出现SubPlan计划导致性能变差。
-
出现在targetlist里的相关子查询无法提升(带count)
select (select count(*) from t2 where t2.c1=t1.c1) cnt, t1.c1, t3.c1 from t1,t3 where t1.c1=t3.c1 order by cnt, t1.c1;
QUERY PLAN -------------------------------------------- Sort Key: ((SubPlan 1)), t1.c1 -> Hash Join Hash Cond: (t1.c1 = t3.c1) -> Seq Scan on t1 -> Hash -> Seq Scan on t3 SubPlan 1 -> Aggregate -> Seq Scan on t2 Filter: (c1 = t1.c1) (11 rows)
with ssq as select count(*) cnt, c1 from t2 group by c1 select case when ssq.cnt is null then 0 else ssq.cnt end cnt, t1.c1, t3.c1 from t1 left join ssq on ssq.c1 = t1.c1,t3 where t1.c1 = t3.c1 order by ssq.cnt, t1.c1;
QUERY PLAN ------------------------------------------- Sort Key: ssq.cnt, t1.c1 CTE ssq -> HashAggregate Group By Key: t2.c1 -> Seq Scan on t2 -> Hash Join Hash Cond: (t1.c1 = t3.c1) -> Hash Left Join Hash Cond: (t1.c1 = ssq.c1) -> Seq Scan on t1 -> Hash -> CTE Scan on ssq -> Hash -> Seq Scan on t3 (15 rows)
-
select t1.c1, t1.c2 from t1 where t1.c1 = (select agg() from t2.c2 > t1.c2);
对于非等值相关条件的SubLink目前无法提升,从语义上可以通过做2次join(一次CorrelationKey,一次rownum自关联)达到提升改写的目的。
-
select t1.c1, t1.c2 from t1, ( select t1.rowid, agg() aggref from t1,t2 where t1.c2 > t2.c2 group by t1.rowid ) dt /* derived table */ where t1.rowid = dt.rowid AND t1.c1 = dt.aggref;
-
WITH dt as select t1.rowid, agg() aggref from t1,t2 where t1.c2 > t2.c2 group by t1.rowid select t1.c1, t1.c2 from t1, derived_table where t1.rowid = derived_table.rowid AND t1.c1 = derived_table.aggref;
-
-
更多优化示例
**示例:**修改select语句,将子查询修改为和主表的join,或者修改为可以提升的subquery,但是在修改前后需要保证语义的正确性。
explain (costs off) select * from t1 where t1.c1 in (select t2.c1 from t2 where t1.c1 = t2.c2);
QUERY PLAN
--------------------------------
Seq Scan on t1
Filter: (SubPlan 1)
SubPlan 1
-> Seq Scan on t2
Filter: (t1.c1 = c2)
(5 rows)
上面事例计划中存在一个subPlan,为了消除这个subPlan可以修改语句为:
explain (costs off) select * from t1 where exists (select t2.c1 from t2 where t1.c1 = t2.c2 and t1.c1 = t2.c1);
QUERY PLAN
------------------------------------------
Hash Join
Hash Cond: (t1.c1 = t2.c2)
-> Seq Scan on t1
-> Hash
-> HashAggregate
Group By Key: t2.c2, t2.c1
-> Seq Scan on t2
Filter: (c2 = c1)
(8 rows)