First Databricks in the Arena and more updates

Today I am putting Databricks in the Arena.

Getting there has been... difficult...

I finally found a way to get estimates and actual row counts out of Databricks. Since Databricks uses Spark for planning, we can also view this the results as a proxy for what Spark can do.

A blog will go out on Floe about the technical details of how to get these query plans out of Databricks.

And that's not all for today...

LLM assistance and contributing

I have been adding various context documents to dbprove to help LLMs do a better job writing drivers and improving the tool.

Codex does quite well with the repo now.

It is now easier than ever to contribute to dbprove. With the drivers available, there is enough information for the LLM to do a decent job at writing new drivers (with a bit of adult supervision). If you want to contribute a driver or just provide a tuned docker image you would like to see tested: get in touch! You know where to find me.

Docker Containers for dbprove

The engines that can be run locally now have docker containers in the <root>/docker folder in dbprove.

There is a launch script that simplifies rerunning dbprove if you want to do your own testing.

General Fixes and Improvement

Postgres 18

SQL Server 2022

ClickHouse

ClickHouse remains the ugly child in the EXPLAIN plan family. The parsing code is seriously messy even after a bit of cleaning.

But the real challenge is getting actual/expected row values out of the plan. Once ClickHouse supports this, I will update the parser.

Trino

TPC-H Query Analysis

The TPC-H query analysis will first be published on the Database Doctor site and then a summary will be added here to help you interpret plans (without my grumpy ravings).