dbt at scale on Google Cloud — Part 1

Innovation is the ability to see change as an opportunity — not a threat.

In this series of articles split in three parts, we will cover an end-to-end data engineering architecture on Google Cloud with dbt as the backbone. Part 1 will go over the following points:

Overall architecture
Brief word on ingestion
dbt project
Cloud Composer as Orchestrator
Distribution of data

We take a holistic approach in this first article and cover the entire stack that comes before and after dbt. In order to understand a system in its entirety, one needs to have the full picture. Part two and three will go in more details on certain aspects of the stack linked directly to dbt.

Overall architecture

The following diagram describes the architecture of a modern data engineering stack using open source technologies and Google Cloud Big Data products.

An infographic outlining a data engineering pipeline on Google Cloud, where data moves from sources like PostgreSQL and MongoDB through tools like Airbyte and dbt for transformation in BigQuery, to be finally distributed by applications like Looker and Vertex AI, orchestrated by Apache Airflow.

From left to right:

At the very left lies the transactional databases and applications. This is the genesis of all your data pipeline and any corrupted data at this stage will propagate downstream in your pipeline (“garbage in, garbage out”).
In order to move this data towards the analytical database (BigQuery), an ingestion tool comes into play. The last few years have seen the emergence of outstanding open-source and enterprise applications to handle seamlessly this move of data from OLTP [1] databases and applications towards OLAP databases. At Astrafy, we are open-source lovers and therefore have a tendency to recommend Airbyte or Meltano hosted on Google Cloud Kubernetes. Those ingestion tools have BigQuery as a sink and dump the data directly into BigQuery tables (data is an “AS IS” copy from the transactional databases / applications and arrives in BigQuery in a layer commonly called “Landing Zone”).
Once the data is in BigQuery, the extract & load part is done and the transformation part picks it up. We cannot find a better definition of dbt than the one from the official website (https://www.getdbt.com/): dbt is a development framework that combines modular SQL with software engineering best practices to make data transformation reliable, fast, and fun. The game changer here is “software engineering best practices”; data has for too long lacked robust and consistent practices that have been in place for decades in the software engineering world. dbt helps analytics engineers [2] define SQL that is DRY and robust through the use of Jinja syntax. It also gives you out-of-the-box automatic lineage and documentation about your models. Main goal of dbt is to move the data from the landing zone to the datamart area where the data will be served to various downstream applications.
During this transformation phase, testing is introduced with two flavours:

Data quality/integrity is checked on the landing zone data dumped from the different source systems. At Astrafy, we recommend using “Great Expectations’’ which is an open-source library that is becoming the open-source reference for data quality checks.
Unit testing is carried out on a sample of data isolated via dbt seeds. Those unit testing verify that the different SQL transformations yield the expected results on those samples of data.

At the far right lies the consumers of those curated data. Those consumers take mainly three forms: Dashboards, ML models and API products.

Every arrow is a potential point of failure (POF) and it is therefore essential to have the least of those and to use a dedicated framework for each of those operations that move data from one layer to another.

Ingestion

Chances are there is an existing connector on either Airbyte or Meltano for your data source so no need to reinvent the wheel with a custom connector. It will cost you significant time and effort to build something from scratch and you will certainly miss features such as “Change Data Capture”, monitoring and auto-scaling that come out of the box with those ingestion framework.

It is also of major importance to stick to the EL paradigm of extracting the data and loading the data “AS IS” in BigQuery; Airbyte and Meltano do exactly that. Transforming your data before it reaches BigQuery is a bad practice for the following main reasons:

If there is a data quality issue in the landing zone, you cannot be sure that the error comes from the source or from the transformation that has been applied in between.
It adds complexity to your ingestion tool. Each tool must do one thing and excel at it; BigQuery excels at transforming data and it should have the sole ownership of this operation.

One size does not fit all. Depending on the experience of your team, you might want to go with an enterprise-grade solution like Fivetran. In case your team has some experience with Kubernetes, then it is a no-brainer to self-host those ingestion tools on GKE using their official helm charts. You can check here a detailed article on how to deploy Airbyte on a private GKE cluster with an ingress and Identity-aware proxy set up. This gives you full protection on your deployment while having the flexibility to give access easily to your data team.

An architectural diagram showing the flow from the Airbyte UI Pod, through an Ingress, then passing through an Internal HTTPS Load Balancer, before going through an Identity-Aware Proxy, and finally reaching the end users.

dbt project

There are quite some good opinions on how to structure your dbt project and we recommend reading the following two discourse posts that give you best practices on how to structure your project and pros and cons of using one repository versus multiple repository.

https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355

https://discourse.getdbt.com/t/how-to-configure-your-dbt-repository-one-or-many/2121

With our experience, we strongly recommend to start with one repository and to follow the structure:

A hierarchical directory structure for a dbt (Data Build Tool) project showcasing the arrangement of YML configuration files, macros, and models across different domains such as Google Analytics, Finance, and Marketing, along with shared resources and documentation.

This structure fits well with the data mesh paradigm that treats data related to a specific domain as an independent data product. In the structure above each domain (google_analytics, finance and marketing) is considered as a data product and is a self-contained dbt project. Analytics engineers can iterate quickly on their data products without losing the interoperability factor as each dbt project is a package that can be imported via the ‘packages.yml’. This means that a model within the finance folder can reference a model from marketing with a simple {{ ref(‘marketing’, ‘staging_emails’) }}.

This structure has two dbt projects not related to any data domains:

shared: this project will contain all the macros and seeds that are not common to a specific flow and are shared to all other dbt projects. All data domain dbt projects should include this “shared” dbt project in their “packages.yml” in order to use those shared macros and reference those mapping tables.
documentation: this project is used to have a global documentation for all the different dbt projects. One single pane of glass to access documentation and lineage for all the data domains. More on dbt documentation will be said in part 3 of this article.

This structure also makes the project very scalable as dbt compiles and runs are isolated from each other. In part 2 of this article, we will see how to go one step further and version those packages in order to have a neat deployment process of new code into production.

Hereafter is representation of a data mesh (taken from Martin Fowler website) and as one can notice, the dbt structure above abides to those principles. Infrastructure is shared but data domains are independent of each other while remaining interoperable. Also one can easily add a new domain without interfering with other data domains. Global governance will be covered in part 2 of this article.

A diagram of data mesh architecture showcasing domain-specific data management by cross-functional teams and data infrastructure as a platform, emphasizing interoperability through global governance and open standards.

We will now cover the “profiles.yml” file that we recommend to store at the root of your dbt project. In order for dbt to find this file, it will look in the environment variable “DBT_PROFILES_DIR” and you can set it up correctly by following this documentation. The file “profiles.yml” tells dbt how to connect to your database as well as some static configurations. At Astrafy, we recommend to create the following profiles:

One profile for your sandbox environment. This profile should have the parameter “impersonate_service_account” and your analytics engineer must have the role to impersonate this service account. This allows you to centralise access via a service account on your sandbox projects. This profile looks like this:

A configuration snippet for BigQuery in YAML format, specifying connection settings such as OAuth method, execution project, thread count, timeout, data location in the EU, query priority, number of retries, and service account impersonation.

For project and schema, you will note that we haven’t defined those as we will cover below how to dynamically generate those in order to leverage multiple Google projects and datasets. A last important parameter is the “execution_project” that we set as our Google Cloud Compute project where we have our Cloud Composer hosted (more on this set up in the infrastructure section in part 2 of this article). This centralises all BigQuery costs on only one project and it makes monitoring and budget outburst easier to detect.

We then recommend having a profile for UAT and a profile for PRD. Those are similar to the sandbox profile except that the “impersonate_service_account” is removed as dbt will use the authentication from the service account of the GKE node pool running the dbt operations.

As mentioned above, we recommend having multiple projects and datasets and not to use a static one in the “profiles.yaml” file. This has many benefits such as decoupling data layers (landing zone, staging area, data warehouse area and datamarts) to allow granular IAM permissions but also to have your models materialised into their own datasets in order to have a clear separation of your data product. dbt provides you with three built-in macros that you can override to achieve this:

Macro generate_database_name: the logic you write in this macro will define the Google project ID for your models materialization.

Snippet of code showing a dbt macro in Jinja template language named generate_database_name, which determines the Google Cloud project ID for dbt model materialization based on the environment (sandbox, UAT, production) and data phase (staging, data warehouse, data mart).

The project ID will be based on the data layer folder within which your model is located and the profile target.

Macro generate_schema_name: the logic you write in this macro will define the BigQuery dataset ID for your models materialization.

A dbt macro in Jinja template language named generate_schema_name which dynamically generates BigQuery dataset IDs by appending a specified dataset prefix to a node attribute, helping to automate and customize schema creation in data modeling.

The dataset ID is determined by the top folder within which your models are located. In the structure above, that would be “google_analytics”, “finance” and “marketing”.

Macro generate_alias_name: the logic you write in this macro will define the BigQuery table ID for your models materialisation.

A Jinja template for a dbt macro named generate_alias_name, which is used to dynamically create alias names for database objects based on the user environment or custom alias, contributing to organized and personalized data management in dbt workflows

We recommend using alias to set the name of a table (see documentation here). The rationale is that model names need to be unique across your project while you might want to have tables that have similar names in between data layers. This is possible using the alias parameter in the config block of your model and your tables will be materialised in different project IDs and datasets and therefore not collide. The macro also prepends the username of the developer to the table in the sandbox environment; this allows multiple analytics engineers to work on the same models without interfering with each other as their BigQuery tables ID will have different names.

Last word before closing this dbt section is the use of env_var and vars. Both serve the same purpose of setting variables to be used in your different dbt files but pay attention to use env_var in case those variables need to change across environments. You then set up those environment variables as environment variables when mounting the KubernetesPodOperator that will execute your dbt operations. dbt vars are to be used for static variables.

Cloud Composer as Orchestrator

Ingestion and transformation need something that glue it all together and that is the responsibility of the orchestrator. Airflow has imposed itself as the reference in recent years and its fully managed service on Google Cloud through Cloud Composer makes it an obvious choice for orchestration. It takes around 20–30 minutes to spin up an instance and you are then immediately ready to deploy some DAGs. Having it as a fully managed service has great benefits such as native integrations with other Google Cloud products, UI protected by IAP and auto-scaling among others.

Airflow DAGs contain various stages and the first one is to trigger the ingestion. This stage will make a call to Airbyte (or the ingestion tool of your choice) in order to fetch the latest data for your landing zone tables. Thereafter comes the dbt operations and each data layer stage runs as a separator Airflow stage. Those tasks run on a KubernetesPodOperator in order to separate orchestration from the execution.

Flowchart detailing an Airflow DAG with stages for data ingestion, quality checks, and dbt transformations, leading to various operational tasks like PubSub and Python execution.

There is another flavour to run the dbt stages by running each dbt model as an Airflow stage. This can be done by exploding the manifest.json rendered by dbt. This gives a lot of granularity but comes with some complexity. My colleague Andrea Bombino has written a detailed article on that subject that you can find here. This article goes over a DAG factory that we have built at Astrafy and that generates DAGs automatically based on yaml configuration files. Your analytics engineers do not need to know anything about Airflow and their DAGs will be generated automatically.

A caveat on Airlfow security is that all the users that have access to Airflow UI can run any DAGs by default. Airflow has a RBAC feature that can easily be exploited to limit access to DAGs to only authorised engineers. We will write a dedicated article on how to set this up; stay tuned.

Distribution of data

At the end of the day, all what matters is what you can get from this huge amount of data. The key factor here is to select a tool that integrates seamlessly with your BigQuery datamarts and dbt.

As far as BI is concerned, Lightdash has gained a lot of traction recently as it has been built with dbt in mind. If you want to get data quickly to your business while keeping fully your semantic layer on BigQuery, then you should consider Lightdash. At Astrafy, we have started to use Apache Superset for our internal dashboards and it has been a breeze to work with. We will soon write an article on how to deploy it easily on GKE using terraform; stay tuned. Then for an enterprise-grade solution, we recommend Looker. It certainly has a high price but its many features from LookML to handle the semantic layer, granular IAM policies to convenience of doing simple and advanced visualisations offset somehow the price you pay.

For the advanced analytics part, we might be biassed as we have only been using Google Cloud but Vertex AI will be your best friend. Google Cloud has differentiated itself from the mainstream Cloud providers mainly with its AI product offering and Vertex AI has it all included.

Last but not least with the emergence of the data mesh paradigm, your data is becoming a data product and as such there is a need to distribute it data via an API. BigQuery is not made for serving APIs so at first you will need to mirror your data on a SQL or noSQL database (for instance Google Cloud SQL or Google Cloud Datastore). Then we recommend building your backend engine using Cloud Run and then use Google Cloud API Gateway as your API framework. As a lot can be said on this topic, a dedicated article will be written on this topic; stay tuned.

References

[1] OLTP: Online transaction processing — OLAP: Online Analytical processing

[2] Analytics engineers provide clean data sets to end users, modeling data in a way that empowers end users to answer their own questions. While a data analyst spends their time analyzing data, an analytics engineer spends their time transforming, testing, deploying, and documenting data.

If you enjoyed reading this first part, do not wait and go to the second part of this article that will cover the following topics:

Infrastructure
dbt versioning
dbt style guide
dbt macros
Data Quality
Data Governance

If you are looking for some support on your dbt implementation, feel free to reach out to us at sales@astrafy.io.

dbt at scale on Google Cloud — Part 1