PlannerTPC-H — TPCH-Q04

Engine Compare

Accuracy chart, rows processed ?
Scan
Scan
Seek
Seek
Join Probe
Join
Sort
Sort
Hash Build
Hash
Aggregate
Agg
Distribute
Dist
ClickHouse
Estimation Error
Est Err
Row Operations
Ops
3,849,231
3.8M
Rank
Estimation Error
Est Err
Row Operations
Ops
3,793,296
3.8M
Rank
Estimation Error
Est Err
Row Operations
Ops
5
5
Rank
Estimation Error
Est Err
Row Operations
Ops
55,935
56K
Rank
Estimation Error
Est Err
Row Operations
Ops
51,276
51K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
Databricks
Estimation Error
Est Err
Row Operations
Ops
3,849,231
3.8M
Rank
Estimation Error
Est Err
Row Operations
Ops
3,793,296
3.8M
Rank
Estimation Error
Est Err
Row Operations
Ops
5
5
Rank
Estimation Error
Est Err
Row Operations
Ops
55,935
56K
Rank
Estimation Error
Est Err
Row Operations
Ops
51,316
51K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
DuckDB
Estimation Error
Est Err
Row Operations
Ops
6,057,030
6.1M
Rank
Estimation Error
Est Err
Row Operations
Ops
3,990,214
4M
Rank
Estimation Error
Est Err
Row Operations
Ops
5
5
Rank
Estimation Error
Est Err
Row Operations
Ops
248,261
248K
Rank
Estimation Error
Est Err
Row Operations
Ops
51,276
51K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
PostgreSQL
Estimation Error
Est Err
Row Operations
Ops
111,870
112K
Rank
Estimation Error
Est Err
Row Operations
Ops
55,935
56K
Rank
Estimation Error
Est Err
Row Operations
Ops
55,935
56K
Rank
Estimation Error
Est Err
Row Operations
Ops
51,276
51K
Rank
Estimation Error
Est Err
Row Operations
Ops
55,935
56K
Rank
Estimation Error
Est Err
Row Operations
Ops
51,291
51K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
SQL Server
Estimation Error
Est Err
Row Operations
Ops
196,985
197K
Rank
Estimation Error
Est Err
Row Operations
Ops
141,050
141K
Rank
Estimation Error
Est Err
Row Operations
Ops
5
5
Rank
Estimation Error
Est Err
Row Operations
Ops
55,935
56K
Rank
Estimation Error
Est Err
Row Operations
Ops
51,276
51K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank

Actual Query Plans

Query Plan per Engine ?
Query Plan
ClickHouse
Estimate    Actual  Operator
       -         5  PROJECT o_orderpriority, a1 AS order_count
       -         5  SORT o_orderpriority
       -         5  PROJECT o_orderpriority, a1
       -         5  AGGREGATE COUNT() AS a1 GROUP BY HASH o_orderpriority
       -     51276  PROJECT o_orderpriority
       -     51276  PROJECT o_orderpriority
       -     51276  RIGHT SEMI INNER JOIN HASH ON PROJECTION_94.l_orderkey = PROJECTION_91.o_orderkey
       -     55935  │└PROJECT o_orderkey AS o_orderkey_right, o_orderpriority
       -     55935   PROJECT o_orderkey, o_orderpriority
       -     55935   TABLE SCAN orders WHERE (o_orderdate >= '1995-02-01') AND (o_orderdate < '1995-05-01')
       -   3793296  PROJECT l_orderkey
       -   3793296  TABLE SCAN lineitem WHERE (l_orderkey = l_orderkey) AND (l_commitdate < l_receiptdate)
DuckDB
Estimate    Actual  Operator
       -         5  SORT o_orderpriority
       5         5  AGGREGATE count_star() GROUP BY HASH #0
   60000     51276  PROJECT o_orderpriority
   60000         0  RIGHT SEMI INNER JOIN HASH ON o_orderkey IS NOT DISTINCT FROM o_orderkey
  271684     55935  │└AGGREGATE  GROUP BY HASH #3
   60000     51276   RIGHT SEMI INNER JOIN HASH ON o_orderkey IS NOT DISTINCT FROM o_orderkey
       -         0   │└SCAN EMPTY
  673980    141050   PROJECT o_orderkey
  673980    141050   INNER JOIN HASH ON l_orderkey = o_orderkey
  271684         0   │└SCAN MATERIALISED
 1200243   3793229   FILTER l_commitdate < l_receiptdate
 6001215   6001095   TABLE SCAN lineitem
  300000     55935  TABLE SCAN orders WHERE o_orderdate >= '1995-02-01' AND o_orderdate < '1995-05-01'
Databricks
Estimate    Actual  Operator
       1         5  SORT o_orderpriority ASC NULLS FIRST
       1         5  AGGREGATE COUNT(1) GROUP BY HASH o_orderpriority
       1        40  DISTRIBUTE HASH ON o_orderpriority
       1        40  AGGREGATE COUNT(1) GROUP BY HASH o_orderpriority
   55500     51276  LEFT SEMI JOIN HASH ON o_orderkey = l_orderkey
   55500     55935  │└DISTRIBUTE HASH ON o_orderkey
 1500000     55935   TABLE SCAN orders WHERE (o_orderdate >= DATE'1995-02-01') AND (o_orderdate < DATE'1995-05-01')
   55500   3793296  DISTRIBUTE HASH ON l_orderkey
 6000000   3793296  TABLE SCAN lineitem WHERE l_commitdate < l_receiptdate
PostgreSQL
Estimate    Actual  Operator
       5         5  AGGREGATE COUNT(*) GROUP BY SORT o_orderpriority
      15        15  AGGREGATE PARTIAL COUNT(*) GROUP BY SORT o_orderpriority
   20847     51276  SORT o_orderpriority
   20847     51276  LEFT SEMI JOIN LOOP ON l_orderkey = o_orderkey
   71532     55935  │└TABLE SCAN orders AS orders WHERE (o_orderdate >= '1995-02-01') AND (o_orderdate < '1995-05-01')
  279675     55935  TABLE SEEK lineitem AS lineitem WHERE l_commitdate < l_receiptdate
SQL Server
Estimate    Actual  Operator
       5         5  SORT o_orderpriority
       5         5  PROJECT CONVERT_IMPLICIT(int,Expr1014,0) AS Expr1007
       5         5  AGGREGATE COUNT(*) AS Expr1014 GROUP BY HASH o_orderpriority
   44574     51276  INNER JOIN HASH ON l_orderkey = o_orderkey
   54864     55935  │└TABLE SCAN orders WHERE o_orderdate >= '1995-02-01' AND o_orderdate < '1995-05-01'
  180036    141050  TABLE SCAN lineitem WHERE (BLOOM(l_orderkey)) AND (l_commitdate < l_receiptdate)

Commentary

EXIST decorrelation into SEMI JOIN

This expression shoudl be decorrelated:

EXISTS (SELECT *
          FROM tpch.lineitem
         WHERE l_orderkey = o_orderkey
           AND l_commitdate < l_receiptdate)

Optimizers will generally convert this into a SEMI JOIN.

Flipping the inner and outer side of the join (for optimisers that know how to do this) will greatly reduce the amount of memory consumed and it opens up additional opportunities for bloom filtering.