Planner — TPC-H — TPCH-Q06

Engine Compare

Accuracy chart, rows processed ?
	Scan Scan	Join Probe Join	Sort Sort	Hash Build Hash	Aggregate Agg	Distribute Dist
ClickHouse	Estimation Error Est Err Row Operations Ops 121,014 121K Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank	Estimation Error Est Err Row Operations Ops 121,014 121K Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank
DuckDB	Estimation Error Est Err Row Operations Ops 121,011 121K Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank	Estimation Error Est Err Row Operations Ops 121,011 121K Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank
PostgreSQL	Estimation Error Est Err Row Operations Ops 120,990 121K Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank	Estimation Error Est Err Row Operations Ops 120,995 121K Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank
SQL Server	Estimation Error Est Err Row Operations Ops 121,014 121K Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank	Estimation Error Est Err Row Operations Ops 121,014 121K Rank	Estimation Error Est Err Row Operations Ops 0 0 Rank

Actual Query Plans

Query Plan per Engine ?
	Query Plan
ClickHouse	`Estimate Actual Operator - 1 PROJECT SUM(l_extendedprice * l_discount) - 1 GROUP BY HASH AGGREGATE SUM(l_extendedprice * l_discount) - 121014 PROJECT l_extendedprice * l_discount - 121014 TABLE SCAN lineitem WHERE (l_shipdate >= '1994-01-01') AND (l_shipdate < '1995-01-01') AND (l_discount >= 0.04) AND (l_discount <= 0.06) AND (l_quantity < 24)`
DuckDB	`Estimate Actual Operator - 1 GROUP BY SIMPLE AGGREGATE sum_no_overflow #0 1199764 121011 PROJECT l_extendedprice * l_discount 1199764 121011 TABLE SCAN lineitem WHERE l_shipdate >= '1994-01-01' AND l_shipdate < '1995-01-01' AND l_discount >= 0.040 AND l_discount <= 0.060 AND l_quantity < 24.000`
PostgreSQL	`Estimate Actual Operator 1 1 GROUP BY SIMPLE AGGREGATE SUM(l_extendedprice * l_discount) 5 5 GROUP BY SIMPLE AGGREGATE PARTIALSUM(l_extendedprice * l_discount) 154135 120990 TABLE SCAN lineitem WHERE (l_shipdate >= '1994-01-01') AND (l_shipdate < '1995-01-01') AND (l_discount >= 0.04) AND (l_discount <= 0.06) AND (l_quantity < '24')`
SQL Server	`Estimate Actual Operator 1 1 PROJECT CASE WHEN Expr1012 = 0 THEN NULL ELSE Expr1013 END AS Expr1003 1 1 GROUP BY HASH AGGREGATE COUNT(Expr1007) AS Expr1012, SUM(Expr1007) AS Expr1013 385065 121014 PROJECT l_extendedprice * l_discount AS Expr1007 385065 121014 TABLE SEEK lineitem WHERE l_shipdate >= CONVERT_IMPLICIT(date,@1,0) AND l_shipdate < CONVERT_IMPLICIT(date,@2,0) AND l_discount >= @3 AND l_discount <= @4 AND l_quantity < CONVERT_IMPLICIT(decimal(15,2),@5,0)`

Commentary

## Selectivity of Filters

Filter	Selectivity	Cardinality
`l_shipdate >= '1994-01-01' AND l_shipdate < '1995-01-01'`	14%	856332
`l_discount BETWEEN 0.03 AND 0.05`	30%	1799915
`l_quantity < 24`	46%	2817735

We notice that l_quantity is significantly more selective than the two other filters.

Due to the way TPC-H is generated, these filters are independent. That means that as you use all filters, you will reduce the rowcount coming out of lineitem to: 14% * 30% * 46% ~ 2%.

Predicate Evaluation order

We aren't allowed indexes on any of these columns in TPC-H. We're however allowed to partition data in l_shipdate if we so desire. In our test, there is no ordering in the insertion of record. All data is completely randomized.

We should evaluate the best filters first - taking l_quantity < 24 before anything else.

Short-circuit evaluation speeds up the query if we bail out of evaluating the other filters early.

Given decent statistics - particularly a histogram - a query optimiser should be able to determine that this ordering is the best one.

Short-Circuit Evaluation vs. Branch Elimination

When you use SIMD vectorisation for execution, you have to ask an important question:
Is it worth short-circuiting filter evaluation or is it better to just evaluate all filters and combine them?

A CPU will try to "guess" what the next instruction to run is — and it will get this instruction from memory (this is called "prefetching"). If the CPU guesses right, the next instruction can immediately be executed. But if the CPU makes the wrong guess - then it will need to wait for that instruction to be ready to execute. A wait like this can take a very long time (in CPU terms) - hundreds of CPU cycles. It might have been better to make the guessing simpler - and then pay the CPU cost of doing too many integers compares.

Depending on the execution engine, SIMD evaluation may be faster than bailing out early

Narrow Scans and Column Stores

Here is the DDL for lineitem, with rough comments on how large each column is on disk

CREATE TABLE tpch.lineitem
(
    l_orderkey      INT,            -- 4B
    l_partkey       INT,            -- 4B
    l_suppkey       INT,            -- 4B
    l_linenumber    INT,            -- 4B
    l_quantity      DECIMAL(15, 2), -- 8B (more of Postgres) 
    l_extendedprice DECIMAL(15, 2), -- 8B
    l_discount      DECIMAL(15, 2), -- 8B
    l_tax           DECIMAL(15, 2), -- 8B
    l_returnflag    VARCHAR(1),     -- 3B
    l_linestatus    VARCHAR(1),     -- 3B
    l_shipdate      DATE,           -- 3-4B
    l_commitdate    DATE,           -- 3-4B
    l_receiptdate   DATE,           -- 3-4B
    l_shipinstruct  VARCHAR(25),    -- 27B
    l_shipmode      VARCHAR(10),    -- 12B
    l_comment       VARCHAR(44)     -- 46B
);

The size of a row in this table is roughly ~150B

But, for this query we only need:

l_extendedprice (8B)
l_discount (8B)
l_quantity (8B)
l_shipdate (3-4B)

Total ~28B - which is only around 20% of the total row size.

Column stores are a killer optimization for this query because it allows us to visit only 20% of the data, even if we fully scan the table.