不知道大家在看pg的执行计划时主要关注什么呢,cost还是执行时间?
最近碰到一个奇葩的SQL,执行计划里cost显示1.9亿,时间执行时间只有1秒多,大致如下:
EXPLAIN ANALYZE select count(*) from (select (
-> SELECT
-> string_agg(lsd4.itemname, ',') bb
-> FROM
-> (
-> SELECT
-> lsd3.itemcode,
-> lsd3.itemname
-> FROM
-> lis_sampledetail lsd3
-> WHERE
-> lsd3.sampleinfo_id = lsi.id
-> AND lsd3.sample_status != '3'
-> GROUP BY
-> lsd3.itemcode,
-> lsd3.itemname
-> ) lsd4
-> ) item_names,
-> lsi.sample_receive_time receivetime
-> FROM
-> lis_sampleinfo lsi WHERE
-> lsi.sample_status != '3'
-> ) aa
-> WHERE
-> aa.item_names IS NOT NULL
-> and aa.receivetime >= '2022-02-14 00:00:00'
-> AND aa.receivetime <= '2022-03-01 23:59:59';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=190804877.15..190804877.16 rows=1 width=8) (actual time=1601.530..1601.532 rows=1 loops=1)
-> Index Scan using idx_lis_sampleinfo_rectime_orgcode on lis_sampleinfo lsi (cost=0.43..190804824.94 rows=20883 width=0) (actual time=7.170..1601.091 rows=3745 loops=1)
Index Cond: ((sample_receive_time >= '2022-02-14 00:00:00'::timestamp without time zone) AND (sample_receive_time <= '2022-03-01 23:59:59'::timestamp without time zone))
Filter: (((sample_status)::text <> '3'::text) AND ((SubPlan 1) IS NOT NULL))
Rows Removed by Filter: 828369
SubPlan 1
-> Aggregate (cost=231.95..231.96 rows=1 width=32) (actual time=0.036..0.036 rows=1 loops=7316)
-> Group (cost=231.91..231.92 rows=2 width=39) (actual time=0.036..0.036 rows=1 loops=7316)
Group Key: lsd3.itemcode, lsd3.itemname
-> Sort (cost=231.91..231.91 rows=2 width=39) (actual time=0.035..0.035 rows=1 loops=7316)
Sort Key: lsd3.itemcode, lsd3.itemname
Sort Method: quicksort Memory: 25kB
-> Index Scan using idx_lis_sampledetail_ix01 on lis_sampledetail lsd3 (cost=0.56..231.90 rows=2 width=39) (actual time=0.033..0.033 rows=1 loops=7316)
Index Cond: ((sampleinfo_id)::text = (lsi.id)::text)
Filter: ((sample_status)::text <> '3'::text)
Rows Removed by Filter: 1
Planning Time: 1.577 ms
Execution Time: 1601.832 ms
(18 rows)
Time: 1606.280 ms (00:01.606)
可以看到“Index Scan using idx_lis_sampleinfo_rectime_orgcode on lis_sampleinfo lsi”这一步的cost竟然高达190804824.94,这张表也就几十万条数据,实际执行时间也是很快,那么这高达1.9亿的cost是怎么计算出来的呢?
不难发现,上面这个SQL用到了标量子查询。而我们都知道标量子查询类似于一个天然的嵌套循环,而且驱动表固定为主表。例如下面这个SQL:
bill=# explain analyze select id,info,(select info from t2 where t2.id=t1.id) as name
bill-# from t1 ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..2190194.00 rows=10000 width=69) (actual time=0.798..5411.748 rows=10000 loops=1)
SubPlan 1
-> Seq Scan on t2 (cost=0.00..219.00 rows=1 width=33) (actual time=0.270..0.540 rows=1 loops=10000)
Filter: (id = t1.id)
Rows Removed by Filter: 9999
Planning Time: 0.144 ms
Execution Time: 5412.437 ms
(7 rows)
t1和t2表均只有10000条数据,Seq Scan on t1的cost也高达2190194.00,仔细看看也不难发现,这是因为包含了标量子查询中所有循环的总cost,即219*10000,再加上t1表自身全表扫描一次的cost。
但是前面那个SQL不同的是,Index Scan using idx_lis_sampledetail_ix01 on lis_sampledetail lsd3 (cost=0.56…231.90 rows=2 width=39) (actual time=0.033…0.033 rows=1 loops=7316) 中也仅仅循环了7316次,每次的cost是231.90,这么算的话也和显示的1.9亿相差甚远。
这又是为什么呢?这就是这个标量子查询的SQL坑爹的地方:将标量子查询作为where条件。
所以问题就在于:
Filter: (((sample_status)::text <> ‘3’::text) AND ((SubPlan 1) IS NOT NULL))
Rows Removed by Filter: 828369
这一步中由于使用子查询去过滤了828369条数据,但是pg执行计划中在计算cost时还是会将这部分算上,828369乘以每次执行的cost那么就可以的出1.9亿这个夸张的cost了。
但实际上,子查询并没有实际执行这么多次,只是执行了显示的7316次,所以时间上也就没有那么夸张了。
因此我们可以得出结论:当我们使用标量子查询作为查询条件时,pg执行计划中得出的cost并不是实际的cost,它不会去过滤掉被标量子查询过滤的部分,而实际执行却是会先进行过滤。