PlannerTPC-H — TPCH-Q12

Engine Compare

Accuracy chart, rows processed ?
Scan
Scan
Seek
Seek
Join Probe
Join
Sort
Sort
Hash Build
Hash
Aggregate
Agg
Distribute
Dist
ClickHouse
Estimation Error
Est Err
Row Operations
Ops
1,530,837
1.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
30,837
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
2
2
Rank
Estimation Error
Est Err
Row Operations
Ops
1,500,000
1.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
30,837
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
Databricks
Estimation Error
Est Err
Row Operations
Ops
1,526,820
1.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
1,495,983
1.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
2
2
Rank
Estimation Error
Est Err
Row Operations
Ops
30,837
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
30,839
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
DataFusion
Estimation Error
Est Err
Row Operations
Ops
7,501,215
7.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
1,500,000
1.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
4
4
Rank
Estimation Error
Est Err
Row Operations
Ops
30,837
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
30,857
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
DuckDB
Estimation Error
Est Err
Row Operations
Ops
920,591
921K
Rank
Estimation Error
Est Err
Row Operations
Ops
34,818
35K
Rank
Estimation Error
Est Err
Row Operations
Ops
2
2
Rank
Estimation Error
Est Err
Row Operations
Ops
30,837
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
30,837
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
PostgreSQL
Estimation Error
Est Err
Row Operations
Ops
1,530,837
1.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
1,500,000
1.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
30,837
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
30,837
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
30,843
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
SQL Server
Estimation Error
Est Err
Row Operations
Ops
59,799
60K
Rank
Estimation Error
Est Err
Row Operations
Ops
30,837
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
2
2
Rank
Estimation Error
Est Err
Row Operations
Ops
30,837
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
30,837
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
Trino
Estimation Error
Est Err
Row Operations
Ops
7,501,215
7.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
30,837
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
2
2
Rank
Estimation Error
Est Err
Row Operations
Ops
1,500,000
1.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
30,839
31K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank

Actual Query Plans

Query Plan per Engine ?
Query Plan
Trino
Estimate    Actual  Operator
       2         2  PROJECT l_shipmode, sum AS high_line_count, sum_4 AS low_line_count
       2         2  SORT l_shipmode
       2         2  AGGREGATE SUM(sum_5) AS sum, SUM(sum_6) AS sum_4 GROUP BY HASH l_shipmode
   85780         2  DISTRIBUTE HASH ON l_shipmode
   85780         2  AGGREGATE SUM(expr_2) AS sum_5, SUM(expr_3) AS sum_6 GROUP BY PARTIAL l_shipmode
   85780     30837  PROJECT CAST((CASE WHEN (o_orderpriority IN('1-URGENT','2-HIGH')) THEN 1 ELSE 0 END) AS bigint) AS expr_2, CAST((CASE WHEN ((o_orderpriority <> '1-URGENT') AND (o_orderpriority <> '2-HIGH')) THEN 1 ELSE 0 END) AS bigint) AS expr_3, l_shipmode
   85780     30837  INNER JOIN HASH ON o_orderkey = l_orderkey
 1500000   1500000  │└TABLE SCAN orders
   85780     30837  DISTRIBUTE HASH ON l_orderkey
 6001215     30837  FILTER (l_shipmode IN('AIR','TRUCK')) AND (l_commitdate < l_receiptdate) AND (l_shipdate < l_commitdate) AND (l_receiptdate >= '1994-01-01') AND (l_receiptdate < '1995-01-01')
 6001215   6001215  TABLE SCAN lineitem
SQL Server
Estimate    Actual  Operator
       2         2  SORT l_shipmode
       2         2  AGGREGATE SUM(Expr1008) AS Expr1006, SUM(Expr1009) AS Expr1007 GROUP BY HASH l_shipmode
  311180     30837  INNER JOIN HASH ON o_orderkey = l_orderkey
  311173     30837  │└TABLE SCAN lineitem WHERE ((l_shipmode = 'AIR' OR l_shipmode = 'TRUCK') AND l_receiptdate >= '1994-01-01' AND l_receiptdate < '1995-01-01') AND (l_shipdate < l_commitdate AND l_commitdate < l_receiptdate)
  150000     28962  PROJECT CASE WHEN o_orderpriority = '1-URGENT' OR o_orderpriority = '2-HIGH' THEN 1 ELSE 0 END AS Expr1008, CASE WHEN o_orderpriority <> '1-URGENT' AND o_orderpriority <> '2-HIGH' THEN 1 ELSE 0 END AS Expr1009
  150000     28962  TABLE SCAN orders WHERE BLOOM(o_orderkey)
PostgreSQL
Estimate    Actual  Operator
       7         2  AGGREGATE SUM(CASE WHEN ((o_orderpriority = '1-URGENT') OR (o_orderpriority = '2-HIGH')) THEN 1 ELSE 0 END), SUM(CASE WHEN ((o_orderpriority <> '1-URGENT') AND (o_orderpriority <> '2-HIGH')) THEN 1 ELSE 0 END) GROUP BY SORT l_shipmode
      21         6  AGGREGATE PARTIAL SUM(CASE WHEN ((o_orderpriority = '1-URGENT') OR (o_orderpriority = '2-HIGH')) THEN 1 ELSE 0 END), PARTIAL SUM(CASE WHEN ((o_orderpriority <> '1-URGENT') AND (o_orderpriority <> '2-HIGH')) THEN 1 ELSE 0 END) GROUP BY SORT l_shipmode
   36711     30837  SORT l_shipmode
   12237     10279  INNER JOIN HASH ON o_orderkey = l_orderkey
   36711     30837  │└TABLE SCAN lineitem AS lineitem WHERE (l_shipmode IN('AIR','TRUCK')) AND (l_commitdate < l_receiptdate) AND (l_shipdate < l_commitdate) AND (l_receiptdate >= '1994-01-01') AND (l_receiptdate < '1995-01-01')
 1875000   1500000  TABLE SCAN orders AS orders
DuckDB
Estimate    Actual  Operator
       -         2  SORT l_shipmode
       7         2  AGGREGATE SUM(#1), SUM(#2) GROUP BY HASH #0
  242051     30837  PROJECT l_shipmode, CASE WHEN ((o_orderpriority = '1-URGENT') OR (o_orderpriority = '2-HIGH')) THEN 1 ELSE 0 END, CASE WHEN ((o_orderpriority != '1-URGENT') AND (o_orderpriority != '2-HIGH')) THEN 1 ELSE 0 END
  242051     30837  INNER JOIN HASH ON o_orderkey = l_orderkey
  240048     30837  │└FILTER (l_commitdate < l_receiptdate) AND (l_shipdate < l_commitdate) AND ((l_shipmode = 'AIR') OR (l_shipmode = 'TRUCK'))
 1200243    885773   TABLE SCAN lineitem WHERE l_shipmode IN('AIR','TRUCK') AND l_commitdate < '1995-01-01' AND l_receiptdate >= '1994-01-01' AND l_receiptdate < '1995-01-01' AND l_shipdate < '1995-01-01'
 1500000     34818  TABLE SCAN orders
Databricks
Estimate    Actual  Operator
       1         2  SORT l_shipmode ASC NULLS FIRST
       1         2  AGGREGATE SUM(casewhen(in(o_orderpriority,'1-URGENT'collate UTF8_BINARY,'2-HIGH'collate UTF8_BINARY),1,0)), SUM(casewhen(( NOT in(o_orderpriority,'1-URGENT'collate UTF8_BINARY,'2-HIGH'collate UTF8_BINARY)),1,0)) GROUP BY HASH l_shipmode
       1         2  DISTRIBUTE HASH ON l_shipmode
       1         2  AGGREGATE SUM(casewhen(in(o_orderpriority,'1-URGENT'collate UTF8_BINARY,'2-HIGH'collate UTF8_BINARY),1,0)), SUM(casewhen(( NOT in(o_orderpriority,'1-URGENT'collate UTF8_BINARY,'2-HIGH'collate UTF8_BINARY)),1,0)) GROUP BY HASH l_shipmode
   51600     30837  INNER JOIN HASH ON l_orderkey = o_orderkey
   51600     30837  │└DISTRIBUTE GATHER
 1500000     30837   TABLE SCAN lineitem WHERE (l_commitdate < l_receiptdate) AND (l_shipdate < l_commitdate) AND (l_receiptdate >= DATE'1994-01-01') AND (l_receiptdate < DATE'1995-01-01') AND in(l_shipmode,'AIR'collate UTF8_BINARY,'TRUCK'collate UTF8_BINARY)
 6000000   1495983  TABLE SCAN orders
DataFusion
Estimate    Actual  Operator
 1200243         2  SORT l_shipmode
 1200243         2  SORT l_shipmode
 1200243         2  PROJECT l_shipmode, high_line_count, low_line_count
 1200243         2  AGGREGATE SUM(CASE WHEN ((o_orderpriority = '1-URGENT') OR (o_orderpriority = '2-HIGH')) THEN 1 ELSE 0 END), SUM(CASE WHEN ((o_orderpriority <> '1-URGENT') AND (o_orderpriority <> '2-HIGH')) THEN 1 ELSE 0 END) GROUP BY HASH l_shipmode
 1200243        20  DISTRIBUTE HASH ON l_shipmode
 1200243        20  AGGREGATE SUM(CASE WHEN ((o_orderpriority = '1-URGENT') OR (o_orderpriority = '2-HIGH')) THEN 1 ELSE 0 END), SUM(CASE WHEN ((o_orderpriority <> '1-URGENT') AND (o_orderpriority <> '2-HIGH')) THEN 1 ELSE 0 END) GROUP BY HASH l_shipmode
 1200243     30837  INNER JOIN HASH ON l_orderkey = o_orderkey
 1200243     30837  │└DISTRIBUTE HASH ON l_orderkey
 1200243     30837   FILTER (((((l_shipmode = 'AIR') OR (l_shipmode = 'TRUCK')) AND (l_receiptdate > l_commitdate)) AND (l_shipdate < l_commitdate)) AND (l_receiptdate >= DATE'1994-01-01')) AND (l_receiptdate < DATE'1995-01-01')
 6001215   6001215   TABLE SCAN lineitem WHERE (((((l_shipmode = 'AIR') OR (l_shipmode = 'TRUCK')) AND (l_receiptdate > l_commitdate)) AND (l_shipdate < l_commitdate)) AND (l_receiptdate >= DATE'1994-01-01')) AND (l_receiptdate < DATE'1995-01-01')
 1500000   1500000  DISTRIBUTE HASH ON o_orderkey
 1500000   1500000  TABLE SCAN orders WHERE CASE MOD(HASH_REPARTITION o_orderkey,10) WHEN 0 THEN (((o_orderkey >= 833) AND (o_orderkey <= 5999072)) AND TRUE) WHEN 1 THEN (((o_orderkey >= 1316) AND (o_orderkey <= 5999394)) AND TRUE) WHEN 2 THEN (((o_orderkey >= 3073) AND (o_orderkey <= 5999202)) AND TRUE) WHEN 3 THEN (((o_orderkey >= 1031) AND (o_orderkey <= 5994951)) AND TRUE) WHEN 4 THEN (((o_orderkey >= 1985) AND (o_orderkey <= 5999493)) AND TRUE) WHEN 5 THEN (((o_orderkey >= 66) AND (o_ord...
ClickHouse
Estimate    Actual  Operator
       -         2  PROJECT l_shipmode, a1 AS high_line_count, a2 AS low_line_count
       -         2  SORT l_shipmode
       -         2  PROJECT l_shipmode, a1, a2
       -         2  AGGREGATE countIf a13 AS a1, countIf a16 AS a2 GROUP BY HASH l_shipmode
       -     30837  PROJECT l_shipmode, (o_orderpriority = '1-URGENT') OR (o_orderpriority = '2-HIGH') AS a13, (o_orderpriority <> '1-URGENT') AND (o_orderpriority <> '2-HIGH') AS a16
       -     30837  PROJECT l_shipmode, o_orderpriority
       -     30837  INNER JOIN HASH ON PROJECTION_317.l_orderkey = PROJECTION_314.o_orderkey
       -   1500000  │└PROJECT o_orderkey, o_orderpriority
       -   1500000   PROJECT o_orderpriority, o_orderkey
       -   1500000   TABLE SCAN orders
       -     30837  PROJECT l_orderkey, l_shipmode
       -     30837  PROJECT l_orderkey, l_shipmode
       -     30837  TABLE SCAN lineitem WHERE (l_commitdate < '1995-01-01') AND (l_shipdate < '1995-01-01') AND (l_receiptdate >= '1994-01-01') AND (l_receiptdate < '1995-01-01') AND l_shipmode IN('AIR','TRUCK') AND (l_shipdate < l_commitdate) AND (l_commitdate < l_receiptdate)

Commentary

Selectivity of Filters

FilterSelectivityCardinality
l_receiptdate >= '1994-01-01' AND l_receiptdate < '1995-01-01'15%909K
l_shipmode IN ('AIR', 'TRUCK')28%1.7M
l_commitdate < l_receiptdate63%3.8M
l_shipdate < l_commitdate49%2.9M

The actual selectivity of all the filters taken together is about 0.5%, leaving roughly 31K rows.

Optimal Join Order

If the optimizer can estimate the filters on lineitem correctly, the best join order is orderslineitem.

Databricks, SQL Server, and PostgreSQL get this right. DuckDB also ends up with the right join order, but largely by luck: its estimates are off by roughly 440x.

Estimating Disjunctions

The filter l_shipmode IN ('AIR', 'TRUCK') can be treated as:

If the engine can estimate equality predicates, it can often estimate this by adding the selectivity of the two individual values together.

Strictly speaking, this does not require histograms. If the engine knows the number of distinct values in l_shipmode, it can approximate:

That breaks down when data is skewed, which is where histograms or Most Common Values become important.

Histograms and Range Estimation

Like Query 14, Query 12 contains a range predicate that is straightforward to estimate with a histogram:

Without histograms, an optimizer cannot estimate that range accurately.

Correlation Between Columns

The hard part of Query 12 is not the date range, but the comparisons between columns:

Simple histograms do not help much here. In principle, an engine could keep statistics on computed expressions such as l_receiptdate - l_commitdate, then estimate the fraction of rows where the result is greater than zero. As far as the SQL Arena engines go, none of them appears to do that.

As a result, even strong optimizers often fall back to a wild guess for predicates like these.

Correlation Between Filters

Even with a good estimate for each individual filter, the engine still has to estimate the combined selectivity.

One common approach is to assume independence and simply multiply the selectivities. That gives about 1.3%, which is close to what PostgreSQL and Databricks estimate.

Another approach is to dampen later filters using a square-root progression, effectively assuming rough correlation between predicates. That lands closer to SQL Server's estimate.

More advanced options exist, such as correlated statistics or sampling before execution, but they are more expensive and harder to apply robustly.