PlannerTPC-H — TPCH-Q06

Engine Compare

Accuracy chart, rows processed ?
Scan
Scan
Join Probe
Join
Sort
Sort
Hash Build
Hash
Aggregate
Agg
Distribute
Dist
ClickHouse
Estimation Error
Est Err
Row Operations
Ops
121,014
121K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
121,014
121K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
DuckDB
Estimation Error
Est Err
Row Operations
Ops
121,011
121K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
121,011
121K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
PostgreSQL
Estimation Error
Est Err
Row Operations
Ops
120,990
121K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
120,995
121K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
SQL Server
Estimation Error
Est Err
Row Operations
Ops
121,014
121K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank
Estimation Error
Est Err
Row Operations
Ops
121,014
121K
Rank
Estimation Error
Est Err
Row Operations
Ops
0
0
Rank

Actual Query Plans

Query Plan per Engine ?
Query Plan
ClickHouse
Estimate    Actual  Operator
       -         1  PROJECT SUM(l_extendedprice * l_discount)
       -         1  GROUP BY HASH AGGREGATE SUM(l_extendedprice * l_discount)
       -    121014  PROJECT l_extendedprice * l_discount
       -    121014  TABLE SCAN lineitem WHERE (l_shipdate >= '1994-01-01') AND (l_shipdate < '1995-01-01') AND (l_discount >= 0.04) AND (l_discount <= 0.06) AND (l_quantity < 24)
DuckDB
Estimate    Actual  Operator
       -         1  GROUP BY SIMPLE AGGREGATE sum_no_overflow #0
 1199764    121011  PROJECT l_extendedprice * l_discount
 1199764    121011  TABLE SCAN lineitem WHERE l_shipdate >= '1994-01-01' AND l_shipdate < '1995-01-01' AND l_discount >= 0.040 AND l_discount <= 0.060 AND l_quantity < 24.000
PostgreSQL
Estimate    Actual  Operator
       1         1  GROUP BY SIMPLE AGGREGATE SUM(l_extendedprice * l_discount)
       5         5  GROUP BY SIMPLE AGGREGATE PARTIALSUM(l_extendedprice * l_discount)
  154135    120990  TABLE SCAN lineitem WHERE (l_shipdate >= '1994-01-01') AND (l_shipdate < '1995-01-01') AND (l_discount >= 0.04) AND (l_discount <= 0.06) AND (l_quantity < '24')
SQL Server
Estimate    Actual  Operator
       1         1  PROJECT CASE WHEN Expr1012 = 0 THEN NULL ELSE Expr1013 END AS Expr1003
       1         1  GROUP BY HASH AGGREGATE COUNT(Expr1007) AS Expr1012, SUM(Expr1007) AS Expr1013
  385065    121014  PROJECT l_extendedprice * l_discount AS Expr1007
  385065    121014  TABLE SEEK lineitem WHERE l_shipdate >= CONVERT_IMPLICIT(date,@1,0) AND l_shipdate < CONVERT_IMPLICIT(date,@2,0) AND l_discount >= @3 AND l_discount <= @4 AND l_quantity < CONVERT_IMPLICIT(decimal(15,2),@5,0)

Commentary

## Selectivity of Filters

FilterSelectivityCardinality
l_shipdate >= '1994-01-01' AND l_shipdate < '1995-01-01'14%856332
l_discount BETWEEN 0.03 AND 0.0530%1799915
l_quantity < 2446%2817735

We notice that l_quantity is significantly more selective than the two other filters.

Due to the way TPC-H is generated, these filters are independent. That means that as you use all filters, you will reduce the rowcount coming out of lineitem to: 14% * 30% * 46% ~ 2%.

Predicate Evaluation order

We aren't allowed indexes on any of these columns in TPC-H. We're however allowed to partition data in l_shipdate if we so desire. In our test, there is no ordering in the insertion of record. All data is completely randomized.

We should evaluate the best filters first - taking l_quantity < 24 before anything else.

Short-circuit evaluation speeds up the query if we bail out of evaluating the other filters early.

Given decent statistics - particularly a histogram - a query optimiser should be able to determine that this ordering is the best one.

Short-Circuit Evaluation vs. Branch Elimination

When you use SIMD vectorisation for execution, you have to ask an important question:
Is it worth short-circuiting filter evaluation or is it better to just evaluate all filters and combine them?

A CPU will try to "guess" what the next instruction to run is — and it will get this instruction from memory (this is called "prefetching"). If the CPU guesses right, the next instruction can immediately be executed. But if the CPU makes the wrong guess - then it will need to wait for that instruction to be ready to execute. A wait like this can take a very long time (in CPU terms) - hundreds of CPU cycles. It might have been better to make the guessing simpler - and then pay the CPU cost of doing too many integers compares.

Depending on the execution engine, SIMD evaluation may be faster than bailing out early

Narrow Scans and Column Stores

Here is the DDL for lineitem, with rough comments on how large each column is on disk

CREATE TABLE tpch.lineitem
(
    l_orderkey      INT,            -- 4B
    l_partkey       INT,            -- 4B
    l_suppkey       INT,            -- 4B
    l_linenumber    INT,            -- 4B
    l_quantity      DECIMAL(15, 2), -- 8B (more of Postgres) 
    l_extendedprice DECIMAL(15, 2), -- 8B
    l_discount      DECIMAL(15, 2), -- 8B
    l_tax           DECIMAL(15, 2), -- 8B
    l_returnflag    VARCHAR(1),     -- 3B
    l_linestatus    VARCHAR(1),     -- 3B
    l_shipdate      DATE,           -- 3-4B
    l_commitdate    DATE,           -- 3-4B
    l_receiptdate   DATE,           -- 3-4B
    l_shipinstruct  VARCHAR(25),    -- 27B
    l_shipmode      VARCHAR(10),    -- 12B
    l_comment       VARCHAR(44)     -- 46B
);

The size of a row in this table is roughly ~150B

But, for this query we only need:

Total ~28B - which is only around 20% of the total row size.

Column stores are a killer optimization for this query because it allows us to visit only 20% of the data, even if we fully scan the table.