PlannerTPC-H — TPCH-Q14

Engine Compare

Accuracy chart, rows processed ?
Scan
Scan
Seek
Seek
Join Probe
Join
Sort
Sort
Hash Build
Hash
Aggregate
Agg
Distribute
Dist
ClickHouse
Estimation Error
Est Err
Row Operations
Ops
271,636
272K
Rank
Estimation Error
Est Err
Row Operations
Ops
71,636
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
200,000
200K
Rank
Estimation Error
Est Err
Row Operations
Ops
71,636
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
Databricks
Estimation Error
Est Err
Row Operations
Ops
268,713
269K
Rank
Estimation Error
Est Err
Row Operations
Ops
197,077
197K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
71,636
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
71,637
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
DataFusion
Estimation Error
Est Err
Row Operations
Ops
6,201,215
6.2M
Rank
Estimation Error
Est Err
Row Operations
Ops
71,636
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
200,000
200K
Rank
Estimation Error
Est Err
Row Operations
Ops
71,646
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
DuckDB
Estimation Error
Est Err
Row Operations
Ops
271,636
272K
Rank
Estimation Error
Est Err
Row Operations
Ops
71,636
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
200,000
200K
Rank
Estimation Error
Est Err
Row Operations
Ops
71,636
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
PostgreSQL
Estimation Error
Est Err
Row Operations
Ops
271,636
272K
Rank
Estimation Error
Est Err
Row Operations
Ops
71,636
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
200,000
200K
Rank
Estimation Error
Est Err
Row Operations
Ops
23,881
24K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
SQL Server
Estimation Error
Est Err
Row Operations
Ops
131,846
132K
Rank
Estimation Error
Est Err
Row Operations
Ops
71,636
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
71,636
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
71,636
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
Trino
Estimation Error
Est Err
Row Operations
Ops
6,201,215
6.2M
Rank
Estimation Error
Est Err
Row Operations
Ops
71,636
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
200,000
200K
Rank
Estimation Error
Est Err
Row Operations
Ops
71,637
72K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank

Actual Query Plans

Query Plan per Engine ?
Query Plan
Trino
Estimate    Actual  Operator
       1         1  PROJECT expr_3 AS promo_revenue
       1         1  PROJECT (sum * 100.0) / sum_2 AS expr_3
       1         1  DISTRIBUTE ROUND ROBIN
       1         1  AGGREGATE SUM(sum_4) AS sum, SUM(sum_5) AS sum_2
   68924         1  DISTRIBUTE GATHER
   68924         1  AGGREGATE SUM(expr) AS sum_4, SUM(expr_1) AS sum_5
   68924     71636  PROJECT CASE WHEN (p_type LIKE 'PROMO%') THEN (l_extendedprice * (1.0 - l_discount)) ELSE 0.0 END AS expr, l_extendedprice * (1.0 - l_discount) AS expr_1
   68924     71636  INNER JOIN HASH ON p_partkey = l_partkey
  200000    200000  │└TABLE SCAN part
   68924     71636  DISTRIBUTE HASH ON l_partkey
 6001215     71636  FILTER (l_shipdate >= '1996-02-01') AND (l_shipdate < '1996-03-01')
 6001215   6001215  TABLE SCAN lineitem
SQL Server
Estimate    Actual  Operator
       1         1  PROJECT 100.00 * Expr1006 / Expr1007 AS Expr1008
       1         1  PROJECT CASE WHEN Expr1021 = 0 THEN NULL ELSE Expr1022 END AS Expr1006, CASE WHEN Expr1023 = 0 THEN NULL ELSE Expr1024 END AS Expr1007
       1         1  AGGREGATE COUNT(*) AS Expr1021, SUM(Expr1020) AS Expr1022, COUNT(Expr1009) AS Expr1023, SUM(Expr1009) AS Expr1024
   71586     71636  PROJECT CASE WHEN p_type LIKE 'PROMO%' THEN l_extendedprice * (1. - l_discount) ELSE 0.0000 END AS Expr1020
   71586     71636  INNER JOIN HASH ON p_partkey = l_partkey
   71586     71636  │└PROJECT l_extendedprice * (1. - l_discount) AS Expr1009
   71586     71636   TABLE SCAN lineitem WHERE l_shipdate >= '1996-02-01' AND l_shipdate < '1996-03-01'
   20000     60210  TABLE SCAN part WHERE BLOOM(p_partkey)
PostgreSQL
Estimate    Actual  Operator
       1         1  AGGREGATE (100.00 * SUM(CASE WHEN (p_type LIKE 'PROMO%') THEN (l_extendedprice * ('1' - l_discount)) ELSE '0' END)) / SUM(l_extendedprice * ('1' - l_discount))
       3         3  AGGREGATE PARTIAL SUM(CASE WHEN (p_type LIKE 'PROMO%') THEN (l_extendedprice * ('1' - l_discount)) ELSE '0' END), PARTIAL SUM(l_extendedprice * ('1' - l_discount))
   29675     23878  INNER JOIN HASH ON l_partkey = p_partkey
  249999    200000  │└TABLE SCAN part AS part
   89025     71636  TABLE SCAN lineitem AS lineitem WHERE (l_shipdate >= '1996-02-01') AND (l_shipdate < '1996-03-01')
DuckDB
Estimate    Actual  Operator
       1         1  PROJECT promo_revenue
       -         1  AGGREGATE SUM(#0), SUM(#1)
 1258538     71636  PROJECT CASE WHEN (prefix(p_type,'PROMO')) THEN (l_extendedprice * (1.00 - l_discount)) ELSE 0.0000 END, l_extendedprice * (1.00 - l_discount)
 1258538     71636  INNER JOIN HASH ON l_partkey = p_partkey
  200000    200000  │└TABLE SCAN part
 1200243     71636  TABLE SCAN lineitem WHERE l_shipdate >= '1996-02-01' AND l_shipdate < '1996-03-01'
Databricks
Estimate    Actual  Operator
       1         1  AGGREGATE SUM(casewhen(startswith(p_type,'PROMO'collate UTF8_BINARY),(l_extendedprice * (1BD - l_discount)),0.0000BD)), SUM(l_extendedprice * (1BD - l_discount))
       1         1  DISTRIBUTE GATHER
       1         1  AGGREGATE SUM(casewhen(startswith(p_type,'PROMO'collate UTF8_BINARY),(l_extendedprice * (1BD - l_discount)),0.0000BD)), SUM(l_extendedprice * (1BD - l_discount))
   71500     71636  INNER JOIN HASH ON l_partkey = p_partkey
   71500     71636  │└DISTRIBUTE GATHER
  200000     71636   TABLE SCAN lineitem WHERE (l_shipdate >= DATE'1996-02-01') AND (l_shipdate < DATE'1996-03-01')
 6000000    197077  TABLE SCAN part
DataFusion
Estimate    Actual  Operator
       1         1  PROJECT promo_revenue
       1         1  AGGREGATE SUM(CASE WHEN p_type LIKE 'PROMO%' THEN __common_expr_1 ELSE 0.0000 END), SUM(__common_expr_1)
 1200243        10  DISTRIBUTE GATHER
 1200243        10  AGGREGATE SUM(CASE WHEN p_type LIKE 'PROMO%' THEN __common_expr_1 ELSE 0.0000 END), SUM(__common_expr_1)
 1200243     71636  PROJECT __common_expr_1, p_type
 1200243     71636  INNER JOIN HASH ON p_partkey = l_partkey
  200000    200000  │└DISTRIBUTE HASH ON p_partkey
  200000    200000   TABLE SCAN part
 1200243     71636  DISTRIBUTE HASH ON l_partkey
 1200243     71636  FILTER (l_shipdate >= DATE'1996-02-01') AND (l_shipdate < DATE'1996-03-01')
 6001215   6001215  TABLE SCAN lineitem WHERE ((l_shipdate >= DATE'1996-02-01') AND (l_shipdate < DATE'1996-03-01')) AND CASE MOD(HASH_REPARTITION l_partkey,10) WHEN 0 THEN (((l_partkey >= 5) AND (l_partkey <= 200000)) AND TRUE) WHEN 1 THEN (((l_partkey >= 6) AND (l_partkey <= 199996)) AND TRUE) WHEN 2 THEN (((l_partkey >= 10) AND (l_partkey <= 199997)) AND TRUE) WHEN 3 THEN (((l_partkey >= 1) AND (l_partkey <= 199998)) AND TRUE) WHEN 4 THEN (((l_partkey >= 12) AND (l_partkey <= 199994)) AND ...
ClickHouse
Estimate    Actual  Operator
       -         1  PROJECT (100. * a4) / a5 AS promo_revenue
       -         1  AGGREGATE sumIf(a10,a12) AS a4, SUM(a10) AS a5
       -     71636  PROJECT l_extendedprice * (1 - l_discount) AS a10, startsWith(p_type,'PROMO') AS a12
       -     71636  PROJECT l_extendedprice, l_discount, p_type
       -     71636  INNER JOIN HASH ON PROJECTION_343.l_partkey = PROJECTION_340.p_partkey
       -    200000  │└PROJECT p_partkey, p_type
       -    200000   PROJECT p_type, p_partkey
       -    200000   TABLE SCAN part
       -     71636  PROJECT l_partkey, l_extendedprice, l_discount
       -     71636  PROJECT l_partkey, l_extendedprice, l_discount
       -     71636  TABLE SCAN lineitem WHERE (l_shipdate >= '1996-02-01') AND (l_shipdate < '1996-03-01')

Commentary

Filter Selectivity

FilterSelectivityCardinality
l_shipdate >= '1996-02-01' AND l_shipdate < '1996-03-01'1%71636

The part table has 200K rows.

Optimal Join Order

After filtering, lineitem is smaller than part, so the best join order is partlineitem: build the hash table on filtered lineitem and probe it from part.

To choose that plan, the optimizer has to estimate the range l_shipdate >= '1996-02-01' AND l_shipdate < '1996-03-01' correctly.

SQL Server and Databricks get this right. DataFusion, Trino, ClickHouse, and DuckDB mis-estimate the filtered cardinality and reverse the join order. DuckDB, for example, estimates about 1.2M rows from lineitem, around 15x higher than reality.

PostgreSQL estimates the row count well, but still chooses to build on part. That is surprising because part is both larger and wider than the filtered lineitem rows needed for the join.

Histograms

The key feature behind a good estimate here is a histogram on l_shipdate.

Histograms let the optimizer reason about range predicates by splitting the domain into buckets and approximating how much of each bucket overlaps the requested range. PostgreSQL also combines histogram data with Most Common Values and null fractions when producing its estimate.

Without a histogram-like structure, an engine cannot estimate this predicate accurately.

Range Estimation Limits

DuckDB uses distinct-counting sketches such as HyperLogLog, which are useful for cardinality estimation but do not help with range predicates. Knowing the number of distinct values in a column is not enough to infer how those values are distributed across a date interval.

The broader lesson from Query 14 is simple: to estimate range predicates correctly, an optimizer needs some form of histogram statistics.