Leveraging DuckDB for enhanced performance in dbt projects

dbt is a robust tool that allows data analysts and engineers to transform data within their warehouses more effectively. However, the efficiency of dbt projects often depends on the choice of the underlying data warehouse. While platforms like Google’s BigQuery offer impressive per-query speed and scalability, the time taken for sequentially executing hundreds or thousands of small queries can still add up to several minutes. An efficient solution to this challenge is DuckDB.

An Introduction to DuckDB

DuckDB is an in-memory analytical database (OLAP) designed to bring SQLite-like simplicity to the world of analytics. DuckDB can directly read data from popular analytical file formats like Parquet and CSV, eliminating the need for cumbersome data import and export processes. Since DuckDB is running locally on your machine there is no need to send data over the network or wait to allocate resources on services like BigQuery.

With a dbt adapter, it is possible to run dbt models against a DuckDB instance. This integration allows analysts and engineers to leverage the efficiency of DuckDB, cutting down on query execution time and significantly improving the development iteration time. DuckDB makes for a great local environment, ‘the default target’ one, on which you work before you deploy your models further.

No Setup and No Maintenance

One of the most appealing aspects of DuckDB is its simplicity in setup and maintenance. Unlike database systems like Postgres which require configuring users before launching or involve maintenance procedures, setting up DuckDB is as simple as defining the path to a single file. There is no server to launch or any continuous maintenance required. Once you have the file path, you can start interacting with the database immediately.

Advantages of Using DuckDB in dbt Projects

Integrating DuckDB into dbt projects brings a host of benefits. Firstly, DuckDB’s execution speed is significantly superior to most data warehouses, which leads to faster development iterations. This speed advantage can be a game-changer for projects in need of pace of development with the simultaneous low-to-medium amount of data.

Speed comparison on the jaffle shop dbt example extended with a couple of additional models (link to the repo at the end of the article).

Screenshot of execution log from dbt run using DuckDB, showing 2.6 seconds execution time

In this example, it’s more than an order of magnitude (10x) improvement in execution speed.

(The differences in model numbers in “Finished …” log stems from differences in treating incremental and ephemeral types of models across adapters, most probably. Each dbt run performed 113 executions.)

Another advantage lies in its single-file architecture, which enables each developer on the team to have their local file. This eliminates the issues commonly associated with shared development or sandbox environments, such as overwriting each other’s data. Every developer can work independently, improving the overall productivity of the team.

DuckDB CLI

To enhance the user experience further, there is a Command Line Interface (CLI) package, duckcli, that manages the connection to the database. duckcli allows you to interact with DuckDB using SQL commands, providing a convenient, command-line-based interface for performing database operations.

After installing the package (for example through pip) to launch the CLI you only need one command:

duckcli -D <path>

Where <path> is the path to your single file DuckDB instance. It will be created if it doesn’t exist, therefore you can use just the filename like db.duckdb.

duckcli goes out of its way to offer a smooth experience with unusual-style CLI autocompletion.

DuckDB limitations

For starters, it’s important to note that DuckDB didn’t have its 1.0.0 release yet. This means that a lot might change and that some of the parts of DuckDB might not work as expected.

Considering its infancy, DuckDB is not supported by some of the other tools in the dbt ecosystem, for example elementary.

By definition, DuckDB will not replace your production data warehouse. For its data processing capabilities, it uses a single machine it works on, which means that TB-sized datasets will be too much for it to handle.

Getting Started with DuckDB and dbt

For those interested in experiencing the integration of DuckDB and dbt firsthand, we have got you covered with a GitHub repository similar to the jaffle shop one with some customizations. This repository offers a dbt project setup designed around DuckDB and provides a step-by-step guide on how to utilize DuckDB with dbt.

To begin, simply follow the QuickStart instructions detailed in the repository’s README. These instructions guide you through the process of setting up DuckDB, configuring the dbt project, and running your first dbt model on a DuckDB instance. The repository offers a user-friendly introduction to using DuckDB in a dbt setting, demonstrating the advantages and efficiency of this powerful combination.

Conclusion

If you need help with long execution times for your development dbt executions, consider introducing DuckDB. It can significantly improve the performance of dbt runs. Use duckcli to access the data in the database instance.

 


 

If you would like support with setting up this or other aspects of the Data Stack at your organization, do not hesitate to reach out to sales@astrafy.io !

Leave a Comment