PlannerTPC-H — TPCH-Q07

Engine Compare

Accuracy chart, rows processed ?
Scan
Scan
Join Probe
Join
Sort
Sort
Hash Build
Hash
Aggregate
Agg
Distribute
Dist
ClickHouse
Estimation Error
Est Err
Row Operations
Ops
3,375,198
3.4M
Rank
Estimation Error
Est Err
Row Operations
Ops
7,916,027
7.9M
Rank
Estimation Error
Est Err
Row Operations
Ops
50
50
Rank
Estimation Error
Est Err
Row Operations
Ops
7,738,029
7.7M
Rank
Estimation Error
Est Err
Row Operations
Ops
1,537,150
1.5M
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
DuckDB
Estimation Error
Est Err
Row Operations
Ops
3,224,816
3.2M
Rank
Estimation Error
Est Err
Row Operations
Ops
3,358,906
3.4M
Rank
Estimation Error
Est Err
Row Operations
Ops
4
4
Rank
Estimation Error
Est Err
Row Operations
Ops
292,909
293K
Rank
Estimation Error
Est Err
Row Operations
Ops
5,457
5.5K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
PostgreSQL
Estimation Error
Est Err
Row Operations
Ops
2,009,312
2M
Rank
Estimation Error
Est Err
Row Operations
Ops
2,143,470
2.1M
Rank
Estimation Error
Est Err
Row Operations
Ops
5,455
5.5K
Rank
Estimation Error
Est Err
Row Operations
Ops
296,055
296K
Rank
Estimation Error
Est Err
Row Operations
Ops
5,455
5.5K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
SQL Server
Estimation Error
Est Err
Row Operations
Ops
1,670,963
1.7M
Rank
Estimation Error
Est Err
Row Operations
Ops
43,844
44K
Rank
Estimation Error
Est Err
Row Operations
Ops
4
4
Rank
Estimation Error
Est Err
Row Operations
Ops
1,671,922
1.7M
Rank
Estimation Error
Est Err
Row Operations
Ops
10,963
11K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank

Actual Query Plans

Query Plan per Engine ?
Query Plan
ClickHouse
Estimate    Actual  Operator
       -        50  PROJECT SUM(volume)
       -        50  SORT supp_nation, cust_nation, l_year
       -        50  GROUP BY HASH supp_nation, cust_nation, l_year AGGREGATE SUM(volume)
       -   1537150  PROJECT EXTRACT(YEAR FROM l_shipdate), n_name, n_name, 1 - l_discount, EXTRACT(YEAR FROM l_shipdate), l_extendedprice * (1 - l_discount), l_extendedprice * (1 - l_discount)
       -   1537150  FILTER (((n_name = 'GERMANY') AND (n_name = 'FRANCE')) OR ((n_name = 'FRANCE') AND (n_name = 'GERMANY'))) AND ((l_shipdate >= '1995-01-01') AND (l_shipdate <= '1996-12-31'))
       -   1537150  INNER JOIN HASH ON c_nationkey = n_nationkey
       -        25  │└TABLE SCAN nation
       -   1550168  INNER JOIN HASH ON s_nationkey = n_nationkey
       -        25  │└TABLE SCAN nation
       -   1550168  INNER JOIN HASH ON o_custkey = c_custkey
       -    150000  │└TABLE SCAN customer
       -   1550197  INNER JOIN HASH ON l_suppkey = s_suppkey
       -     10000  │└TABLE SCAN supplier
       -   1550346  INNER JOIN HASH ON l_orderkey = o_orderkey
       -   1500000  │└TABLE SCAN orders
       -   1715148  TABLE SCAN lineitem WHERE (l_shipdate >= '1995-01-01') AND (l_shipdate <= '1996-12-31')
DuckDB
Estimate    Actual  Operator
       -         4  SORT supp_nation, cust_nation, l_year
   55216         4  GROUP BY HASH #0, #1, #2 AGGREGATE SUM(#3)
   55217      5457  PROJECT supp_nation, cust_nation, l_year, volume
   55217      5457  PROJECT supp_nation, cust_nation, l_year, volume
   55217      5457  FILTER ((n_name = 'GERMANY') AND (n_name = 'FRANCE')) OR ((n_name = 'FRANCE') AND (n_name = 'GERMANY'))
   55217     10959  INNER JOIN HASH ON o_custkey = c_custkey
   28846     11904  │└INNER JOIN HASH ON c_nationkey = n_nationkey
      25         2   │└FILTER (n_name = 'FRANCE') OR (n_name = 'GERMANY')
      25        25    TABLE SCAN nation WHERE n_name = 'FRANCE' OR n_name = 'GERMANY'
  150000     11904   TABLE SCAN customer WHERE c_custkey >= 3
  271595    134140  INNER JOIN HASH ON o_orderkey = l_orderkey
  269191    134158  │└INNER JOIN HASH ON l_suppkey = s_suppkey
    1923       785   │└INNER JOIN HASH ON s_nationkey = n_nationkey
       5         2    │└FILTER (n_name = 'GERMANY') OR (n_name = 'FRANCE')
      25        25     TABLE SCAN nation WHERE n_name = 'GERMANY' OR n_name = 'FRANCE'
   10000       785    TABLE SCAN supplier
 1199764   1712292   FILTER l_suppkey >= 1
 1199764   1712292   TABLE SCAN lineitem WHERE l_shipdate >= '1995-01-01' AND l_shipdate <= '1996-12-31'
 1500000   1499785  TABLE SCAN orders
PostgreSQL
Estimate    Actual  Operator
    5728         4  GROUP BY SORT n_name, n_name, EXTRACT(EXTRACT(YEAR FROM  l_shipdate) AGGREGATE SUM(l_extendedprice * ('1' - l_discount))
    7160      5455  SORT n_name, n_name, EXTRACT(EXTRACT(YEAR FROM  l_shipdate)
    7160      5455  INNER JOIN HASH ON o_custkey = c_custkey AND (((n_name = 'GERMANY') AND (n_name = 'FRANCE')) OR ((n_name = 'FRANCE') AND (n_name = 'GERMANY')))
    5000     11905  │└INNER JOIN HASH ON c_nationkey = n_nationkey
       2         2   │└TABLE SCAN n2 WHERE (n_name = 'FRANCE') OR (n_name = 'GERMANY')
   62500    150000   TABLE SCAN customer
  172050    134160  INNER JOIN LOOP ON o_orderkey = l_orderkey
  172050    134160  │└INNER JOIN HASH ON l_suppkey = s_suppkey
    4000      3925   │└INNER JOIN HASH ON s_nationkey = n_nationkey
       2         2    │└TABLE SCAN n1 WHERE (n_name = 'GERMANY') OR (n_name = 'FRANCE')
   10000     10000    TABLE SCAN supplier
 2156855   1715150   TABLE SCAN lineitem WHERE (l_shipdate >= '1995-01-01') AND (l_shipdate <= '1996-12-31')
  134158    134158  TABLE SEEK orders
SQL Server
Estimate    Actual  Operator
       2         4  PROJECT CASE WHEN globalagg1030 = 0 THEN NULL ELSE globalagg1032 END AS Expr1013
       2         4  GROUP BY HASH , ,  AGGREGATE SUM(partialagg1029) AS globalagg1030, SUM(partialagg1031) AS globalagg1032
       8         4  SORT n_name, n_name, Expr1011
       8         4  INNER JOIN HASH ON n_nationkey as n_nationkey = c_nationkey AND (n_name as n_name = 'GERMANY' AND n_name as n_name = 'FRANCE' OR n_name as n_name = 'FRANCE' AND n_name as n_name = 'GERMANY')
       2         2  │└TABLE SEEK nation WHERE n_name as n_name = 'FRANCE' OR n_name as n_name = 'GERMANY'
      10         8  GROUP BY HASH AGGREGATE COUNT(Expr1014) AS partialagg1029, SUM(Expr1014) AS partialagg1031
   13597     10959  INNER JOIN HASH ON c_custkey = o_custkey
  150000    150000  │└TABLE SEEK customer
   13597     10959  INNER JOIN HASH ON n_nationkey as n_nationkey = s_nationkey
       2         2  │└TABLE SEEK nation WHERE n_name as n_name = 'FRANCE' OR n_name as n_name = 'GERMANY'
   16997     10959  INNER JOIN HASH ON o_orderkey = l_orderkey
 1500000   1500000  │└TABLE SEEK orders
   16997     10959  INNER JOIN HASH ON s_suppkey = l_suppkey
   10000     10000  │└TABLE SEEK supplier
   16998     10959  PROJECT datepart(EXTRACT(YEAR FROM l_shipdate) AS Expr1011, l_extendedprice * (1. - l_discount) AS Expr1014
   16998     10959  TABLE SEEK lineitem WHERE l_shipdate >= '1995-01-01' AND l_shipdate <= '1996-12-31' AND BLOOM(l_orderkey) AND BLOOM(l_suppkey)

Commentary

## Turning OR/AND Expressions into filters

This expression is crucial to understanding how to optimise Q07:

  (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')
  OR 
  (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')

We can see that irrespective of which part of the OR we look at, it must be the case that we only want customer (via n2) that are in GERMANY or FRANCE. Similarly, we can know that we can only want supplier (via n1 in either) GERMANY or FRANCE.

We must either defer the combined check for the OR until later in the query when we recombine rows from supplier and customer or we can understand that both instances of nation can be prejoined into a 2-row results setse

In other words, we can say the following:

Pre-joining the two nation tables

We can realise that this join has an upper boundary on the number of rows it can produce (=2).

SELECT n1.n_nationkey, n1.n_name, n2.n_nationkey, n2.n_name
FROM tpch.nation AS n1
         CROSS JOIN tpch.nation AS n2
WHERE (
          (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')
              OR (n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY')
          );

This allows us to join n1 with n2 first, producing a tiny 2-row result set than can then be used to filter both supplier and customer.

Selectivity

FilterSelectivityCardinality
n1.n_name IN ('GERMANY', 'FRANCE) (customer)8%2
n2.n_name IN ('GERMANY', 'FRANCE) (supplier)8%2
l_shipdate BETWEEN '1995-01-01' AND '1996-12-31'28%1.7M

Join Order

We know that lineitem is the largest table, even after we apply the filter on l_shipdate. That means lineitem should be used to probe into hash tables (or loop look into B-trees) from the other tables in the query.

Inferring Filter Selectivity on orders

Our filters on nation results in reductions in customer - which in turn will result in the same reduction of orders. How can we know this?

By using primary and foreign keys as well as statistics, the query optimiser should be able to infer what effects filters on one table have on other tables in the same query.

With this inference, we now know that orders, after we apply the filter via nation (alias n1) becomes an even smaller stream.

Inferring the Filter Selectivity on lineitem

Using exactly the same line of reasoning as for orders we can infer what happens with the filter from nation (alias n2) via supplier to lineitem

By joining via nation --> supplier --> lineitem we have reduced the stream by 8%. But we've already reduced the stream by 28% via the filter on l_shipdate.

The optimiser must now make a challenging decision. What is the combined selectivity of the two filters? There are a few ways the optimiser can make an educated guess:

  1. The selectivity of filters is the product of the filter (i.e. the answer is 2% = 28% * 8%)
  2. The selectivity of the filters is the largest (i.e. the answer is 8%)
  3. The selectivity is some progression of filters with each new filter becomes less valuable than the previous one
  4. The selectivity of the combined filter can perhaps be known through composite statistic
  5. I will eventually learn this selectivity by running some queries and adapting to the conditions

Of these methods, the most common are the first 3. But some databases allow composite stats (option 4) to get a good answer for filters that typically occur together. PostgreSQL has supported these kinds of composite statistics since version 14, SQL Server has had them since around the year 1997. Many cloud databases don't have this feature yet.

Fortunately for most query engines - TPC-H uses very friendly filters they behave nicely when you combine them. All filters are completely uncorrelated, so method 1 is good enough to deliver a decent estimate.

The actual selectivity of lineitem once all filters have been applied is 2%. Since lineitem is related to orders via a primary/foreign key - the optimiser knows that reducing lineitem to 2% of the rows will result in the same reduction to orders. It can know this using similar reasoning that we've seen before.

Since a reduction to 2% is better than the 8% reduction we get via nation --> customer --> orders it is better to join lineitems with orders before joining to customer.

Best Query Plan

We now know what the best join order is for Q07 if we want to minimise the amount of join operations (which we do).