Planner — TPC-H — TPCH-Q07

Engine Compare

Accuracy chart, rows processed ?
	Scan Scan	Join Probe Join	Sort Sort	Hash Build Hash	Aggregate Agg	Distribute Dist
ClickHouse	Estimation Error Est Err Row Operations Ops 3,375,198 3.4M Rank	Estimation Error Est Err Row Operations Ops 7,916,027 7.9M Rank	Estimation Error Est Err Row Operations Ops 50 50 Rank	Estimation Error Est Err Row Operations Ops 7,738,029 7.7M Rank	Estimation Error Est Err Row Operations Ops 1,537,150 1.5M Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank
DuckDB	Estimation Error Est Err Row Operations Ops 3,224,816 3.2M Rank	Estimation Error Est Err Row Operations Ops 3,358,906 3.4M Rank	Estimation Error Est Err Row Operations Ops 4 4 Rank	Estimation Error Est Err Row Operations Ops 292,909 293K Rank	Estimation Error Est Err Row Operations Ops 5,457 5.5K Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank
PostgreSQL	Estimation Error Est Err Row Operations Ops 2,009,312 2M Rank	Estimation Error Est Err Row Operations Ops 2,143,470 2.1M Rank	Estimation Error Est Err Row Operations Ops 5,455 5.5K Rank	Estimation Error Est Err Row Operations Ops 296,055 296K Rank	Estimation Error Est Err Row Operations Ops 5,455 5.5K Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank
SQL Server	Estimation Error Est Err Row Operations Ops 1,670,963 1.7M Rank	Estimation Error Est Err Row Operations Ops 43,844 44K Rank	Estimation Error Est Err Row Operations Ops 4 4 Rank	Estimation Error Est Err Row Operations Ops 1,671,922 1.7M Rank	Estimation Error Est Err Row Operations Ops 10,963 11K Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank

Actual Query Plans

Query Plan per Engine ?
	Query Plan
ClickHouse	Estimate Actual Operator - 50 PROJECT SUM(volume) - 50 SORT supp_nation, cust_nation, l_year - 50 GROUP BY HASH supp_nation, cust_nation, l_year AGGREGATE SUM(volume) - 1537150 PROJECT EXTRACT(YEAR FROM l_shipdate), n_name, n_name, 1 - l_discount, EXTRACT(YEAR FROM l_shipdate), l_extendedprice * (1 - l_discount), l_extendedprice * (1 - l_discount) - 1537150 FILTER (((n_name = 'GERMANY') AND (n_name = 'FRANCE')) OR ((n_name = 'FRANCE') AND (n_name = 'GERMANY'))) AND ((l_shipdate >= '1995-01-01') AND (l_shipdate <= '1996-12-31')) - 1537150 INNER JOIN HASH ON c_nationkey = n_nationkey - 25 │└TABLE SCAN nation - 1550168 INNER JOIN HASH ON s_nationkey = n_nationkey - 25 │└TABLE SCAN nation - 1550168 INNER JOIN HASH ON o_custkey = c_custkey - 150000 │└TABLE SCAN customer - 1550197 INNER JOIN HASH ON l_suppkey = s_suppkey - 10000 │└TABLE SCAN supplier - 1550346 INNER JOIN HASH ON l_orderkey = o_orderkey - 1500000 │└TABLE SCAN orders - 1715148 TABLE SCAN lineitem WHERE (l_shipdate >= '1995-01-01') AND (l_shipdate <= '1996-12-31')
DuckDB	Estimate Actual Operator - 4 SORT supp_nation, cust_nation, l_year 55216 4 GROUP BY HASH #0, #1, #2 AGGREGATE SUM(#3) 55217 5457 PROJECT supp_nation, cust_nation, l_year, volume 55217 5457 PROJECT supp_nation, cust_nation, l_year, volume 55217 5457 FILTER ((n_name = 'GERMANY') AND (n_name = 'FRANCE')) OR ((n_name = 'FRANCE') AND (n_name = 'GERMANY')) 55217 10959 INNER JOIN HASH ON o_custkey = c_custkey 28846 11904 │└INNER JOIN HASH ON c_nationkey = n_nationkey 25 2 │ │└FILTER (n_name = 'FRANCE') OR (n_name = 'GERMANY') 25 25 │ │ TABLE SCAN nation WHERE n_name = 'FRANCE' OR n_name = 'GERMANY' 150000 11904 │ TABLE SCAN customer WHERE c_custkey >= 3 271595 134140 INNER JOIN HASH ON o_orderkey = l_orderkey 269191 134158 │└INNER JOIN HASH ON l_suppkey = s_suppkey 1923 785 │ │└INNER JOIN HASH ON s_nationkey = n_nationkey 5 2 │ │ │└FILTER (n_name = 'GERMANY') OR (n_name = 'FRANCE') 25 25 │ │ │ TABLE SCAN nation WHERE n_name = 'GERMANY' OR n_name = 'FRANCE' 10000 785 │ │ TABLE SCAN supplier 1199764 1712292 │ FILTER l_suppkey >= 1 1199764 1712292 │ TABLE SCAN lineitem WHERE l_shipdate >= '1995-01-01' AND l_shipdate <= '1996-12-31' 1500000 1499785 TABLE SCAN orders
PostgreSQL	Estimate Actual Operator 5728 4 GROUP BY SORT n_name, n_name, EXTRACT(EXTRACT(YEAR FROM l_shipdate) AGGREGATE SUM(l_extendedprice * ('1' - l_discount)) 7160 5455 SORT n_name, n_name, EXTRACT(EXTRACT(YEAR FROM l_shipdate) 7160 5455 INNER JOIN HASH ON o_custkey = c_custkey AND (((n_name = 'GERMANY') AND (n_name = 'FRANCE')) OR ((n_name = 'FRANCE') AND (n_name = 'GERMANY'))) 5000 11905 │└INNER JOIN HASH ON c_nationkey = n_nationkey 2 2 │ │└TABLE SCAN n2 WHERE (n_name = 'FRANCE') OR (n_name = 'GERMANY') 62500 150000 │ TABLE SCAN customer 172050 134160 INNER JOIN LOOP ON o_orderkey = l_orderkey 172050 134160 │└INNER JOIN HASH ON l_suppkey = s_suppkey 4000 3925 │ │└INNER JOIN HASH ON s_nationkey = n_nationkey 2 2 │ │ │└TABLE SCAN n1 WHERE (n_name = 'GERMANY') OR (n_name = 'FRANCE') 10000 10000 │ │ TABLE SCAN supplier 2156855 1715150 │ TABLE SCAN lineitem WHERE (l_shipdate >= '1995-01-01') AND (l_shipdate <= '1996-12-31') 134158 134158 TABLE SEEK orders
SQL Server	Estimate Actual Operator 2 4 PROJECT CASE WHEN globalagg1030 = 0 THEN NULL ELSE globalagg1032 END AS Expr1013 2 4 GROUP BY HASH , , AGGREGATE SUM(partialagg1029) AS globalagg1030, SUM(partialagg1031) AS globalagg1032 8 4 SORT n_name, n_name, Expr1011 8 4 INNER JOIN HASH ON n_nationkey as n_nationkey = c_nationkey AND (n_name as n_name = 'GERMANY' AND n_name as n_name = 'FRANCE' OR n_name as n_name = 'FRANCE' AND n_name as n_name = 'GERMANY') 2 2 │└TABLE SEEK nation WHERE n_name as n_name = 'FRANCE' OR n_name as n_name = 'GERMANY' 10 8 GROUP BY HASH AGGREGATE COUNT(Expr1014) AS partialagg1029, SUM(Expr1014) AS partialagg1031 13597 10959 INNER JOIN HASH ON c_custkey = o_custkey 150000 150000 │└TABLE SEEK customer 13597 10959 INNER JOIN HASH ON n_nationkey as n_nationkey = s_nationkey 2 2 │└TABLE SEEK nation WHERE n_name as n_name = 'FRANCE' OR n_name as n_name = 'GERMANY' 16997 10959 INNER JOIN HASH ON o_orderkey = l_orderkey 1500000 1500000 │└TABLE SEEK orders 16997 10959 INNER JOIN HASH ON s_suppkey = l_suppkey 10000 10000 │└TABLE SEEK supplier 16998 10959 PROJECT datepart(EXTRACT(YEAR FROM l_shipdate) AS Expr1011, l_extendedprice * (1. - l_discount) AS Expr1014 16998 10959 TABLE SEEK lineitem WHERE l_shipdate >= '1995-01-01' AND l_shipdate <= '1996-12-31' AND BLOOM(l_orderkey) AND BLOOM(l_suppkey)

Commentary

## Turning OR/AND Expressions into filters

This expression is crucial to understanding how to optimise Q07:

  (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')
  OR 
  (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')

We can see that irrespective of which part of the OR we look at, it must be the case that we only want customer (via n2) that are in GERMANY or FRANCE. Similarly, we can know that we can only want supplier (via n1 in either) GERMANY or FRANCE.

We must either defer the combined check for the OR until later in the query when we recombine rows from supplier and customer or we can understand that both instances of nation can be prejoined into a 2-row results setse

In other words, we can say the following:

Only pick supplier in FRANCE and GERMANY
Only pick customer in FRANCE and GERMANY

Pre-joining the two `nation` tables

We can realise that this join has an upper boundary on the number of rows it can produce (=2).

SELECT n1.n_nationkey, n1.n_name, n2.n_nationkey, n2.n_name
FROM tpch.nation AS n1
         CROSS JOIN tpch.nation AS n2
WHERE (
          (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')
              OR (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')
          );

This allows us to join n1 with n2 first, producing a tiny 2-row result set than can then be used to filter both supplier and customer.

Selectivity

Filter	Selectivity	Cardinality
`n1.n_name IN ('GERMANY', 'FRANCE)` (customer)	8%	2
`n2.n_name IN ('GERMANY', 'FRANCE)` (supplier)	8%	2
`l_shipdate BETWEEN '1995-01-01' AND '1996-12-31'`	28%	1.7M

Join Order

We know that lineitem is the largest table, even after we apply the filter on l_shipdate. That means lineitem should be used to probe into hash tables (or loop look into B-trees) from the other tables in the query.

Inferring Filter Selectivity on `orders`

Our filters on nation results in reductions in customer - which in turn will result in the same reduction of orders. How can we know this?

There is a primary key from customer to nation
We know that n_nationkey is a key - so it must have the same cardinality as nation (25 different rows, we pick 2)
Since there is a foreign key from customer to nation via c_nationkey we know that c_nationkey in customer is a subset of all n_nationkey.
If we have distinct values stats (sorry Iceberg people), then we can know there are 25 distinct values in c_nationkey. The same number of distinct values as we have in n_nation
If we also have histograms (again, sorry Iceberg), we can know that the distribution of c_nationkey is roughly uniform.
That means that picking two nations in n1 will result in a 2/25 reduction to customer via c_customerkey because all data is uniform.
We can use the same reasoning to infer that these two nations (via their transitive closure of primary/foreign keys) must result in the same reduction into orders
Hence, it must be the case that the 8% filter on n1 (=nation) is also an 8% filter on orders

By using primary and foreign keys as well as statistics, the query optimiser should be able to infer what effects filters on one table have on other tables in the same query.

With this inference, we now know that orders, after we apply the filter via nation (alias n1) becomes an even smaller stream.

Inferring the Filter Selectivity on `lineitem`

Using exactly the same line of reasoning as for orders we can infer what happens with the filter from nation (alias n2) via supplier to lineitem

By joining via nation --> supplier --> lineitem we have reduced the stream by 8%. But we've already reduced the stream by 28% via the filter on l_shipdate.

The optimiser must now make a challenging decision. What is the combined selectivity of the two filters? There are a few ways the optimiser can make an educated guess:

The selectivity of filters is the product of the filter (i.e. the answer is 2% = 28% * 8%)
The selectivity of the filters is the largest (i.e. the answer is 8%)
The selectivity is some progression of filters with each new filter becomes less valuable than the previous one
The selectivity of the combined filter can perhaps be known through composite statistic
I will eventually learn this selectivity by running some queries and adapting to the conditions

Of these methods, the most common are the first 3. But some databases allow composite stats (option 4) to get a good answer for filters that typically occur together. PostgreSQL has supported these kinds of composite statistics since version 14, SQL Server has had them since around the year 1997. Many cloud databases don't have this feature yet.

Fortunately for most query engines - TPC-H uses very friendly filters they behave nicely when you combine them. All filters are completely uncorrelated, so method 1 is good enough to deliver a decent estimate.

The actual selectivity of lineitem once all filters have been applied is 2%. Since lineitem is related to orders via a primary/foreign key - the optimiser knows that reducing lineitem to 2% of the rows will result in the same reduction to orders. It can know this using similar reasoning that we've seen before.

Since a reduction to 2% is better than the 8% reduction we get via nation --> customer --> orders it is better to join lineitems with orders before joining to customer.

Best Query Plan

We now know what the best join order is for Q07 if we want to minimise the amount of join operations (which we do).

Scan or seek lineitem taking the filter on l_shipdate
Reduce lineitem by joining to supplier and then nation taking the filter on n_name (for a total reduction to 2%)
Using this stream, join to orders, reducing orders to the same 2%
Finally, join to customer and then nation to take the last 8% reduction. Optionally using a pre-join of n1 and n2