例如,下列示例可以提升。
select * from t1 where c1 >(
select max(t2.c1) from t2 where t2.c1=t1.c1
select * from t1 where (c1,c2) >(
select max(t2.c1),min(t2.c2) from t2 where t2.c1=t1.c1
select * from t1 where c1 >(
select max(t2.c1) from t2 full join t3 on (t2.c2=t3.c2) where t2.c1=t1.c1
select t1.c1
from t1 where c1 >(
select max(t2.c1) from t2 where t2.c1=t1.c1
select t1.c1
from t1 where c1 >(
select max(t2.c1) from t2 where t2.c1=t1.c1 and t3.c1>t2.c2
select a, c from t1
where t1.a = (select avg(a) from t3 where t1.b = t3.b) or
exists (select * from t4 where t1.c = t4.c);
将OR连接的EXIST相关子查询OR子句的提升过程:
提取where条件中,or子句中的opExpr。为:t1.a = (select avg(a) from t3 where t1.b = t3.b)
这个op操作中包含subquery,判断是否可以提升,如果可以提升,重写subquery为:select avg(a), t3.b from t3 group by t3.b,生成not null条件t3.b is not null,并将这个opexpr用这个not null条件替换。此时SQL变为:
select a, c
from t1 left join (select avg(a) avg, t3.b from t3 group by t3.b) as t3 on (t1.a = avg and t1.b = t3.b)
where t3.b is not null or exists (select * from t4 where t1.c = t4.c);
select t1.a, t1.c from t1 left join (select avg(a) avg, t3.b from t3 group by t3.b) as t3 on (t1.a = avg and t1.b = t3.b) left join (select t5.c from t5 group by t5.c) as t5 on (t1.c = t5.c) where t3.b is not null or t5.c is not null;
目前GaussDB不支持的Sublink-Release场景
除了以上场景之外都不支持Sublink提升,因此关联子查询会被计划成SubPlan+Broadcast的执行计划,当inner表的数据量较大时则会产生性能风险。
如果相关子查询中跟外层的两张表做join,那么无法提升该子查询,需要通过将父SQL创建成with子句,然后再跟子查询中的表做相关子查询查询。
with temp as
select * from (select t1.a as a, t2.a as b from t1 left join t2 on t1.a=t2.a)
select distinct a,b
from temp
where not exists (select a,b from test1 where temp.a=test1.a and temp.b=test1.b);
gaussdb=# explain (costs off)
select (select c2 from t2 where t1.c1 = t2.c1) ssq, t1.c2
from t1
where t1.c2 > 10;
gaussdb=# explain (costs off)
select (select c2 from t2 where t1.c1 = t2.c1) ssq, t1.c2
from t1
where t1.c2 > 10;
QUERY PLAN
------------------------------------------------------
Streaming (type: GATHER)
Node/s: All datanodes
-> Seq Scan on t1
Filter: (c2 > 10)
SubPlan 1
-> Result
Filter: (t1.c1 = t2.c1)
-> Materialize
-> Streaming(type: BROADCAST)
Spawn on: All datanodes
-> Seq Scan on t2
(11 rows)
---------------------------------------
Streaming (type: GATHER)
Node/s: All datanodes
-> Hash Right Join
Hash Cond: (t2.c1 = t1.c1)
-> Seq Scan on t2
-> Hash
-> Seq Scan on t1
Filter: (c2 > 10)
(8 rows)
可以看到出现在SSQ返回列表里的相关子查询SSQ,已经被提升成Right Join,从而避免当內表t2较大时出现SubPlan+Broadcast计划导致性能变差。
出现在targetlist里的相关子查询无法提升(带count)
select (select count(*) from t2 where t2.c1=t1.c1) cnt, t1.c1, t3.c1
from t1,t3
where t1.c1=t3.c1 order by cnt, t1.c1;
QUERY PLAN
------------------------------------------------------------------
Streaming (type: GATHER)
Node/s: All datanodes
-> Sort
Sort Key: ((SubPlan 1)), t1.c1
-> Hash Join
Hash Cond: (t1.c1 = t3.c1)
-> Seq Scan on t1
-> Hash
-> Seq Scan on t3
SubPlan 1
-> Aggregate
-> Result
Filter: (t2.c1 = t1.c1)
-> Materialize
-> Streaming(type: BROADCAST)
Spawn on: All datanodes
-> Seq Scan on t2
(17 rows)
由于相关子查询出现在targetlist(查询返回列表)里,对于t1.c1=t2.c1不匹配的场景仍然需要输出值,因此使用left-outerjoin关联T1&T2确保t1.c1=t2.c1在不匹配时子SSQ能够返回不匹配的补空值,但是这里带了count语句及时在t1.c1=t2.t1不匹配时需要输出0,因此可以使用一个case-when NULL then 0 else count(*)来代替。
上述SQL语句可以改写为:
else ssq.cnt
end cnt, t1.c1, t3.c1
from t1 left join ssq on ssq.c1 = t1.c1,t3
where t1.c1 = t3.c1
order by ssq.cnt, t1.c1;
QUERY PLAN
-----------------------------------------------------
Streaming (type: GATHER)
Node/s: All datanodes
-> Sort
Sort Key: (count(*)), t1.c1
-> Hash Join
Hash Cond: (t1.c1 = t3.c1)
-> Hash Left Join
Hash Cond: (t1.c1 = t2.c1)
-> Seq Scan on t1
-> Hash
-> HashAggregate
Group By Key: t2.c1
-> Seq Scan on t2
-> Hash
-> Seq Scan on t3
(15 rows)
select t1.rowid, agg() aggref
from t1,t2
where t1.c2 > t2.c2 group by t1.rowid
) dt /* derived table */
where t1.rowid = dt.rowid AND t1.c1 = dt.aggref;
select t1.rowid, agg() aggref
from t1,t2
where t1.c2 > t2.c2 group by t1.rowid
select t1.c1, t1.c2
from t1, dt
where t1.rowid = dt.rowid AND
t1.c1 = dt.aggref;
目前GaussDB尚无高效的实现表、中间结果集的全局唯一rowid,因此目前此类场景很难改写,建议通过业务层进行规避,或者可以使用t1.xc_nodeid + t1.ctid进行rowid关联,但是xc_nodeid的重复率较高会导致join关联效率变低,而xc_node_id+ctid类型无法作为hashjoin的关联条件。
对于AGG类型为count(*)时需要进行CASE-WHEN对没有match的场景补0处理,非COUNT(*)场景NULL处理。
CTE改写方式如果有sharescan支持性能上能够更优。
gaussdb=# explain (costs off)select * from master_table as t1 where t1.a in (select t2.a from sub_table as t2 where t1.a = t2.b);
QUERY PLAN
----------------------------------------------------------
Streaming (type: GATHER)
Node/s: All datanodes
-> Seq Scan on master_table t1
Filter: (SubPlan 1)
SubPlan 1
-> Result
Filter: (t1.a = t2.b)
-> Materialize
-> Streaming(type: BROADCAST)
Spawn on: All datanodes
-> Seq Scan on sub_table t2
(11 rows)
gaussdb=# explain(costs off) select * from master_table as t1 where exists (select t2.a from sub_table as t2 where t1.a = t2.b and t1.a = t2.a);
QUERY PLAN
--------------------------------------------------
Streaming (type: GATHER)
Node/s: All datanodes
-> Hash Semi Join
Hash Cond: (t1.a = t2.b)
-> Seq Scan on master_table t1
-> Hash
-> Streaming(type: REDISTRIBUTE)
Spawn on: All datanodes
-> Seq Scan on sub_table t2
(9 rows)