Benchmarks

Overview

Benchmarks are exciting because they level the playing field for everyone. SDF aspires to conform to as many DB benchmarks as possible. Crucially, SDF execution semantics are built on top of Apache Datafusion so we expect performance to closely mirror that engine. Configuration is minimal since all dependency management of when to load which tables, what to keep in memory, and in what order to execute queries is handled automatically by SDF. For each benchmark, simply download the data with the provided hydrate.sh script and execute sdf run on your terminal. No configuration necessary.

Presto / Trino / AWS Athena

There are many SQL engine products that have their root in the Facebook developed Presto engine. The Trino dialect (a fork off of Presto) is the default execution dialect of SDF. Trino is the engine powering AWS Athena.

TPCH

The TPC-H benchmark is a standard benchmark used to evaluate the performance of various analytical database engines. It consists of a suite of business-oriented ad-hoc queries and concurrent data modifications. These queries are designed to model real-world decision support scenarios. Overview of TPC-H Benchmark

Scale Factor: TPC-H defines different scale factors (SF) to represent the database size, ranging from SF 1 (1 GB) to SF 10000 (10 TB) and beyond.
Queries: There are 22 queries in the TPC-H benchmark, each designed to test different aspects of SQL engine performance, such as join operations, aggregations, and sorting.

To get started with SDF’s TPCH benchmark, please download the workspace from here.

Download the TPCH SDF workspace
Download the data via the supplied hydrate.sh script
You have two scale factors available to you 0.1GB, and 10GB
To execute the benchmark, sdf run -e [full | tiny] depending on the scale factor

IMDB - Join Order Benchmark

The IMDb SQL database benchmark, similar to the TPC-H benchmark, is designed to evaluate analytical database performance. It is also known as the join order benchmark It uses data derived from the Internet Movie Database (IMDb), which contains comprehensive information about movies, television programs, actors, production crew personnel, and other related information. It is based on a 2015 VLDB paper titled How Good Are Query Optimizers Really?

Scale: The compressed data size is only ~1.2GB
Queries: There are 113 queries in total, divided into several sets.

To get started with SDF’s IMDB benchmark, please download the workspace from here.

Download the IMDB SDF workspace
Download the data via the supplied hydrate.sh script
To execute the benchmark, execute sdf run

Clickbench

The ClickBench benchmark is designed to evaluate the performance of database systems using a dataset and queries derived from the real-world use cases of ClickHouse, a leading analytical database. This benchmark aims to measure how well different database systems handle large-scale analytical workloads. You can see a large scale report of Clickbench results here

Scale: The data in this clickbench set is ~1GB compressed
Queries: There are 43 queries in total

To get started with SDF’s Clickbench benchmark, please download the workspace from here.

Download the Clickbench SDF workspace
Download the data via the supplied hydrate.sh script
To execute the benchmark, execute sdf run

SDF DB

​Overview

​Presto / Trino / AWS Athena

​TPCH

​IMDB - Join Order Benchmark

​Clickbench

Overview

Presto / Trino / AWS Athena

TPCH

IMDB - Join Order Benchmark

Clickbench