PlannerTPC-H — TPCH-Q07

Engine Compare

Accuracy chart, rows processed ?
Scan
Scan
Seek
Seek
Join Probe
Join
Sort
Sort
Hash Build
Hash
Aggregate
Agg
Distribute
Dist
ClickHouse
Estimation Error
Est Err
Row Operations
Ops
3,488,454
3.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
5,507,073
5.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
4
4
Rank
Estimation Error
Est Err
Row Operations
Ops
3,954,438
4M
Rank
Estimation Error
Est Err
Row Operations
Ops
5,924
5.9K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
Databricks
Estimation Error
Est Err
Row Operations
Ops
3,474,731
3.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
3,456,395
3.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
4
4
Rank
Estimation Error
Est Err
Row Operations
Ops
564,808
565K
Rank
Estimation Error
Est Err
Row Operations
Ops
5,932
5.9K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
DuckDB
Estimation Error
Est Err
Row Operations
Ops
2,121,009
2.1M
Rank
Estimation Error
Est Err
Row Operations
Ops
2,279,841
2.3M
Rank
Estimation Error
Est Err
Row Operations
Ops
4
4
Rank
Estimation Error
Est Err
Row Operations
Ops
316,220
316K
Rank
Estimation Error
Est Err
Row Operations
Ops
5,924
5.9K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
PostgreSQL
Estimation Error
Est Err
Row Operations
Ops
738,478
738K
Rank
Estimation Error
Est Err
Row Operations
Ops
588,476
588K
Rank
Estimation Error
Est Err
Row Operations
Ops
832,162
832K
Rank
Estimation Error
Est Err
Row Operations
Ops
5,924
5.9K
Rank
Estimation Error
Est Err
Row Operations
Ops
1,029,544
1M
Rank
Estimation Error
Est Err
Row Operations
Ops
5,924
5.9K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank
SQL Server
Estimation Error
Est Err
Row Operations
Ops
145,857
146K
Rank
Estimation Error
Est Err
Row Operations
Ops
157,576
158K
Rank
Estimation Error
Est Err
Row Operations
Ops
4
4
Rank
Estimation Error
Est Err
Row Operations
Ops
267,177
267K
Rank
Estimation Error
Est Err
Row Operations
Ops
5,928
5.9K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
Rank

Actual Query Plans

Query Plan per Engine ?
Query Plan
ClickHouse
Estimate    Actual  Operator
       -         4  PROJECT supp_nation, cust_nation, l_year, a1 AS revenue
       -         4  SORT supp_nation, cust_nation, l_year
       -         4  PROJECT supp_nation, cust_nation, l_year, a1
       -         4  AGGREGATE SUM(volume) AS a1 GROUP BY HASH supp_nation, cust_nation, l_year
       -      5924  PROJECT n_name AS supp_nation, n_name AS cust_nation, EXTRACT(YEAR FROM l_shipdate) AS l_year, l_extendedprice * (1 - l_discount) AS volume
       -      5924  FILTER ((n_name_left = 'GERMANY') AND (n_name_right = 'FRANCE')) OR ((n_name_left = 'FRANCE') AND (n_name_right = 'GERMANY'))
       -     11723  PROJECT l_shipdate, l_extendedprice, l_discount, n_name, n_name
       -     11723  INNER JOIN HASH ON PROJECTION_167.s_suppkey = PROJECTION_146.l_suppkey
       -    148370  │└PROJECT l_suppkey, l_shipdate, l_extendedprice, l_discount, n_name AS n_name_right
       -    148370   PROJECT l_shipdate, l_extendedprice, l_discount, l_suppkey, n_name
       -    148370   INNER JOIN HASH ON PROJECTION_152.c_nationkey = PROJECTION_149.n_nationkey
       -         2   │└PROJECT n_nationkey, n_name
       -         2    PROJECT n_name, n_nationkey
       -         2    TABLE SCAN nation WHERE (n_name = 'FRANCE') OR (n_name = 'GERMANY')
       -   1828450   PROJECT c_nationkey, l_shipdate, l_extendedprice, l_discount, l_suppkey
       -   1828450   PROJECT l_shipdate, l_extendedprice, l_discount, l_suppkey, c_nationkey
       -   1828450   INNER JOIN HASH ON PROJECTION_158.o_custkey = PROJECTION_155.c_custkey
       -    150000   │└PROJECT c_custkey, c_nationkey
       -    150000    PROJECT c_custkey, c_nationkey
       -    150000    TABLE SCAN customer
       -   1828450   PROJECT o_custkey, l_shipdate, l_extendedprice, l_discount, l_suppkey
       -   1828450   PROJECT l_shipdate, l_extendedprice, l_discount, l_suppkey, o_custkey
       -   1828450   INNER JOIN HASH ON PROJECTION_164.l_orderkey = PROJECTION_161.o_orderkey
       -   1500000   │└PROJECT o_orderkey, o_custkey
       -   1500000    PROJECT o_orderkey, o_custkey
       -   1500000    TABLE SCAN orders
       -   1828450   PROJECT l_orderkey, l_shipdate, l_extendedprice, l_discount, l_suppkey
       -   1828450   PROJECT l_orderkey, l_shipdate, l_extendedprice, l_discount, l_suppkey
       -   1828450   TABLE SCAN lineitem WHERE (l_shipdate >= '1995-01-01') AND (l_shipdate <= '1996-12-31')
       -       798  PROJECT s_suppkey, n_name AS n_name_left
       -       798  PROJECT s_suppkey, n_name
       -       798  INNER JOIN HASH ON PROJECTION_173.s_nationkey = PROJECTION_170.n_nationkey
       -         2  │└PROJECT n_nationkey, n_name
       -         2   PROJECT n_nationkey, n_name
       -         2   TABLE SCAN nation WHERE (n_name = 'GERMANY') OR (n_name = 'FRANCE')
       -     10000  PROJECT s_nationkey, s_suppkey
       -     10000  PROJECT s_suppkey, s_nationkey
       -     10000  TABLE SCAN supplier
DuckDB
Estimate    Actual  Operator
       -         4  SORT shipping.supp_nation, shipping.cust_nation, shipping.l_year
   47237         4  AGGREGATE SUM(#3) GROUP BY HASH #0, #1, #2
   47238      5924  PROJECT supp_nation, cust_nation, l_year, volume
   47238      5924  PROJECT supp_nation, cust_nation, l_year, volume
   47238      5924  FILTER ((n_name = 'GERMANY') AND (n_name = 'FRANCE')) OR ((n_name = 'FRANCE') AND (n_name = 'GERMANY'))
   47238     11723  INNER JOIN HASH ON o_custkey = c_custkey
   28846     12008  │└INNER JOIN HASH ON c_nationkey = n_nationkey
       5         2   │└FILTER (n_name = 'FRANCE') OR (n_name = 'GERMANY')
      25        25    TABLE SCAN nation WHERE n_name = 'FRANCE' OR n_name = 'GERMANY'
  150000    149999   TABLE SCAN customer WHERE c_custkey <= 149999
  232742    145696  INNER JOIN HASH ON o_orderkey = l_orderkey
  230815    145703  │└INNER JOIN HASH ON l_suppkey = s_suppkey
    1923       798   │└INNER JOIN HASH ON s_nationkey = n_nationkey
       5         2    │└FILTER (n_name = 'GERMANY') OR (n_name = 'FRANCE')
      25        25     TABLE SCAN nation WHERE n_name = 'GERMANY' OR n_name = 'FRANCE'
   10000     10000    TABLE SCAN supplier
 1200243   1828450   TABLE SCAN lineitem WHERE l_shipdate >= '1995-01-01' AND l_shipdate <= '1996-12-31'
 1500000    132510  TABLE SCAN orders
Databricks
Estimate    Actual  Operator
  501000         4  SORT shipping.supp_nation ASC NULLS FIRST, shipping.cust_nation ASC NULLS FIRST, shipping.l_year ASC NULLS FIRST
  501000         4  AGGREGATE SUM(shipping.volume) GROUP BY HASH shipping.supp_nation, shipping.cust_nation, shipping.l_year
  501000         8  DISTRIBUTE HASH ON shipping.supp_nation, shipping.cust_nation, shipping.l_year
  501000         8  AGGREGATE SUM(shipping.volume) GROUP BY HASH shipping.supp_nation, shipping.cust_nation, shipping.l_year
  501000      5924  INNER JOIN HASH ON l_orderkey = o_orderkey
  501000    145511  │└INNER JOIN HASH ON s_suppkey = l_suppkey
  501000       798   │└DISTRIBUTE GATHER
  180000       798    INNER JOIN HASH ON s_nationkey = n1.n_nationkey
 6000000      6248    │└TABLE SCAN supplier
  180000         2    DISTRIBUTE GATHER
  150000         2    TABLE SCAN nation WHERE (n1.n_name = 'GERMANY'collate UTF8_BINARY) OR (n1.n_name = 'FRANCE'collate UTF8_BINARY)
   10000   1826022   TABLE SCAN lineitem WHERE (l_shipdate >= DATE'1995-01-01') AND (l_shipdate <= DATE'1996-12-31')
  501000    121324  DISTRIBUTE GATHER
     800    121324  INNER JOIN HASH ON c_custkey = o_custkey
     800     12008  │└DISTRIBUTE GATHER
   12000     12008   INNER JOIN HASH ON c_nationkey = n2.n_nationkey
      25    146214   │└TABLE SCAN customer
 1500000         2   TABLE SCAN nation WHERE (n1.n_name = 'GERMANY'collate UTF8_BINARY) OR (n1.n_name = 'FRANCE'collate UTF8_BINARY)
      25   1496243  TABLE SCAN orders
PostgreSQL
Estimate    Actual  Operator
    6058         4  AGGREGATE SUM(l_extendedprice * ('1' - l_discount)) GROUP BY SORT n_name, n_name, EXTRACT(year FROM l_shipdate)
    7572      5924  SORT n_name, n_name, EXTRACT(year FROM l_shipdate)
    2524      1974  INNER JOIN HASH ON o_orderkey = l_orderkey AND (((n_name = 'GERMANY') AND (n_name = 'FRANCE')) OR ((n_name = 'FRANCE') AND (n_name = 'GERMANY')))
  436716    437109  │└INNER JOIN LOOP ON l_suppkey = s_suppkey
    2400      2394   │└INNER JOIN LOOP ON n_nationkey = s_nationkey
   30000     30000    │└TABLE SEEK supplier AS supplier
      75        75    TABLE SEEK nation AS n1 WHERE (n1.n_name = 'GERMANY') OR (n1.n_name = 'FRANCE')
  438102    437120   TABLE SEEK lineitem AS lineitem WHERE (l_shipdate >= '1995-01-01') AND (l_shipdate <= '1996-12-31')
  150000    121323  INNER JOIN LOOP ON o_custkey = c_custkey
    5000      4002  │└INNER JOIN HASH ON c_nationkey = n_nationkey
       2         2   │└TABLE SCAN nation AS n2 WHERE (n2.n_name = 'FRANCE') OR (n2.n_name = 'GERMANY')
  187500    150000   TABLE SCAN customer AS customer
  216144    121280  TABLE SEEK orders AS orders
SQL Server
Estimate    Actual  Operator
       2         4  AGGREGATE SUM(partialagg1032) AS Expr1018 GROUP BY SORT n_name, n_name, Expr1016
      10         4  SORT n_name, n_name, Expr1016
      10         4  AGGREGATE SUM(Expr1019) AS partialagg1032 GROUP BY HASH n_name, n_name, Expr1016
    5383      5924  INNER JOIN HASH ON o_orderkey = l_orderkey AND (n_name as n_name = 'GERMANY' AND n_name as n_name = 'FRANCE' OR n_name as n_name = 'FRANCE' AND n_name as n_name = 'GERMANY')
  189129    121324  │└INNER JOIN HASH ON o_custkey = c_custkey
   12000     12008   │└INNER JOIN HASH ON c_nationkey = n2.n_nationkey
       2         2    │└TABLE SCAN nation AS n2 WHERE n_name as n_name = 'FRANCE' OR n_name as n_name = 'GERMANY'
   15000     12008    TABLE SCAN customer WHERE BLOOM(c_nationkey)
  150000    121324   TABLE SCAN orders WHERE BLOOM(o_custkey)
   14554     11723  INNER JOIN HASH ON l_suppkey = s_suppkey
     800       798  │└INNER JOIN HASH ON s_nationkey = n1.n_nationkey
       2         2   │└TABLE SCAN nation AS n1 WHERE n_name as n_name = 'FRANCE' OR n_name as n_name = 'GERMANY'
    1000       798   TABLE SCAN supplier WHERE BLOOM(s_nationkey)
   18190     11723  PROJECT datepart(year,l_shipdate) AS Expr1016, l_extendedprice * (1. - l_discount) AS Expr1019
   18190     11723  TABLE SCAN lineitem WHERE l_shipdate >= '1995-01-01' AND l_shipdate <= '1996-12-31' AND BLOOM(l_suppkey) AND BLOOM(l_orderkey)

Commentary

Turning OR/AND Expressions into filters

This expression is crucial to understanding how to optimise Q07:

  (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')
  OR 
  (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')

We can see that irrespective of which part of the OR we look at, it must be the case that we only want customer (via n2) that are in GERMANY or FRANCE. Similarly, we can know that we can only want supplier (via n1 in either) GERMANY or FRANCE.

We must either defer the combined check for the OR until later in the query when we recombine rows from supplier and customer or we can understand that both instances of nation can be prejoined into a 2-row results setse

In other words, we can say the following:

Pre-joining the two nation tables

We can realise that this join has an upper boundary on the number of rows it can produce (=2).

SELECT n1.n_nationkey, n1.n_name, n2.n_nationkey, n2.n_name
FROM tpch.nation AS n1
         CROSS JOIN tpch.nation AS n2
WHERE (
          (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')
              OR (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')
          );

This allows us to join n1 with n2 first, producing a tiny 2-row result set than can then be used to filter both supplier and customer.

Selectivity

FilterSelectivityCardinality
n1.n_name IN ('GERMANY', 'FRANCE) (customer)8%2
n2.n_name IN ('GERMANY', 'FRANCE) (supplier)8%2
l_shipdate BETWEEN '1995-01-01' AND '1996-12-31'28%1.7M

Join Order

We know that lineitem is the largest table, even after we apply the filter on l_shipdate. That means lineitem should be used to probe into hash tables (or loop look into B-trees) from the other tables in the query.

Inferring Filter Selectivity on orders

Our filters on nation results in reductions in customer - which in turn will result in the same reduction of orders. How can we know this?

By using primary and foreign keys as well as statistics, the query optimiser should be able to infer what effects filters on one table have on other tables in the same query.

With this inference, we now know that orders, after we apply the filter via nation (alias n1) becomes an even smaller stream.

Inferring the Filter Selectivity on lineitem

Using exactly the same line of reasoning as for orders we can infer what happens with the filter from nation (alias n2) via supplier to lineitem

By joining via nation --> supplier --> lineitem we have reduced the stream by 8%. But we've already reduced the stream by 28% via the filter on l_shipdate.

The optimiser must now make a challenging decision. What is the combined selectivity of the two filters? There are a few ways the optimiser can make an educated guess:

  1. The selectivity of filters is the product of the filter (i.e. the answer is 2% = 28% * 8%)
  2. The selectivity of the filters is the largest (i.e. the answer is 8%)
  3. The selectivity is some progression of filters with each new filter becomes less valuable than the previous one
  4. The selectivity of the combined filter can perhaps be known through composite statistic
  5. I will eventually learn this selectivity by running some queries and adapting to the conditions

Of these methods, the most common are the first 3. But some databases allow composite stats (option 4) to get a good answer for filters that typically occur together. PostgreSQL has supported these kinds of composite statistics since version 14, SQL Server has had them since around the year 1997. Many cloud databases don't have this feature yet.

Fortunately for most query engines - TPC-H uses very friendly filters they behave nicely when you combine them. All filters are completely uncorrelated, so method 1 is good enough to deliver a decent estimate.

The actual selectivity of lineitem once all filters have been applied is 2%. Since lineitem is related to orders via a primary/foreign key - the optimiser knows that reducing lineitem to 2% of the rows will result in the same reduction to orders. It can know this using similar reasoning that we've seen before.

Since a reduction to 2% is better than the 8% reduction we get via nation --> customer --> orders it is better to join lineitems with orders before joining to customer.

Best Query Plan

We now know what the best join order is for Q07 if we want to minimise the amount of join operations (which we do).