Every successful individual knows that his or her achievement depends on a community of persons working together. — Paul Ryan
An analogy with software engineering seems the best introduction to talk about dbt packages:
Software engineers frequently modularize code into libraries. These libraries help programmers operate with leverage: they can spend more time focusing on their unique business logic, and less time implementing code that someone else has already spent the time perfecting.
dbt packages are probably one of the main benefits of dbt as it allows data code to be modularized (i.e. reused within your dbt project to keep your code DRY) and shared across the community. It is one of the tenets of software engineering that was brought to the data world. You should always keep in mind while coding to not reinvent the wheel; chances are that someone smart has already done the job or something similar than what you are trying to achieve. This doesn’t mean that there is nothing left to code but on the contrary you can jump start your work with robust packages and build on top of those. We also see packages as an excellent way to learn dbt best-practices; our best learning experience at Astrafy has been done while deep-diving in some excellent dbt packages. No courses or tutorials will replace the hands-on experience you get from getting your hands dirty with state-of-the-art code.
In this article we will go over the dbt packages we use in all our dbt projects. Those are grouped into different categories based on the use case they cover.
- dbt utils: you can’t get away in dbt without this package; It is the mother of all packages and is a prerequisite to most of the other dbt packages. This package contains generic macros that can be (re)used across dbt projects. For instance gone are the days where you had to get your head around tricky sql to get an arranged list of dates; use date_spin for this. We always have a look in the dbt-utils package before writing a new macro to check if there is already one available that can do the job.
You should start small with vanilla testing features from dbt; it is certainly more than enough for most of your use cases.
- dbt-expectations: this package is inspired from the open-source Great Expectations library. You can see it as a great extension of the dbt generic tests. It is really easy to use and fully integrated within your dbt property files. The documentation of the package is also well detailed and you should always have a look for a test there before implementing your own version.
- dbt-unit-testing: unit testing has been there forever in software development and this practice is slowly entering the data world. Unit testing makes sure that the SQL you have written leads to expected results. Those are deterministic tests that should be run only when new code is deployed. dbt-unit-testing provides you a unit testing framework to mock data and run unit tests on any of your models.
- dbt-coverage: is a single CLI tool which checks your dbt project for missing documentation and tests. This tool is really convenient to see how well your dbt project is documented and tested. You can also use it as a step in your CI pipeline and make it fail in case the coverage is below a certain threshold.
- dbt-meta-testing: this is a dbt package that does a similar job as the previous CLI package. With this package, you can integrate enforcement of documentation and testing via your “dbt_project.yml”.
- dbt-profiler: implements dbt macros for profiling database relations and creating doc blocks and table schemas (schema.yml) containing said profiles. You can see this package as a panda profiler that will automatically integrate the results of the profiling as dbt documentation within your property files.
- dbt-project-evaluator: recently released by the dbt labs team, this package checks your entire dbt project against best practices established by dbt. This package is really exhaustive in terms of checks and you will most likely not pass all the checks. We recommend starting with this package at the very beginning of your dbt project and enforce your analytics engineers to follow those best practices. They can run the package locally and correct any exceptions; this package can then run as a stage in your CI pipeline.
- audit-helper: priceless package to compare tables between them. The package comes with six macros that will allow you to go very granular in comparing columns and values between two tables.
- dbt-invoke: you should not write dbt properties files. You can have those easily generated with this package. There are other tools that can generate those files like lightdash CLI for instance but dbt-invoke focuses only on that aspect.
- dbt-erdiagram-generator: this package will translate your dbt code into dbml files that you can then use to generate ERD diagrams. This makes it really convenient for data modelling purposes. From this dbml file a ERD diagram can be created with a tool like dbdocs or dbml-renderer.
- dbt-codegen: This package can generate as well propoerties files for models but most interestingly it can generate property files for your sources. It also has some convenient macros to generate base models from your source or to refactor your models into structured models with CTEs on top.
- re_data: this package is an open-source data reliability framework for the modern data stack. Observability is key to all your db projects. You need to be proactive on the potential issues with your models and re_data gives you that layer to stay on top of your project.
- dbt-artifacts: this package builds various datamart tables and views describing the project it is installed in. It logs all the metadata about your dbt operations into tables defined around a well designed data model. You can then plug alerts and dashboarding tools on top of those tables.
- dbt-model-usage: This dbt package provides tests to let you know whether your models are still relevant to your users (this package is only compatible with BigQuery at the moment). The output it gives you is really powerful as you can immediately see if some tables are not used at all and then deprecate the associated models.
There are many more useful packages and the aforementioned list is definitely not exhaustive. It is just a list based on our working experience with dbt at Astrafy. Those packages have greatly increased our productivity and robustness of our dbt projects and we start each of our dbt projects with those.
Great thing about open-source is that you can either contribute to those packages in case you miss a feature or you can develop an entirely new package that the community will benefit from. We recently contributed to the dbt-artifacts package to make it compatible with BigQuery and this served us a lot on our projects on top of the learning experience gained.
If you enjoyed reading this article, stay tuned as we regularly publish articles. Follow Astrafy on LinkedIn to be notified for the next article ;).
If you are looking for some support on your “Modern Data Stack” implementation, feel free to reach out to us at email@example.com .
You can also check out our website at https://www.astrafy.io