Data Vault 2.0 — a hands-on approach with dbt and BigQuery (Part 2)

Welcome back to our exploration of Data Vault 2.0, with a particular focus on putting it into actions with dbt. While our first part delved into the theoretical underpinnings and the foundational concepts of Data Vault 2.0, this second installment will bring those theories to life with a real-world example.

In this segment, we will walk you through a practical scenario, step by step, illustrating how dbt can be effectively used to implement Data Vault 2.0 structures. We’ll explore challenges, solutions, and the tangible benefits derived from this integration. By grounding our discussion in a hands-on example, our aim is to bridge the gap between theoretical knowledge and practical application, ensuring you can confidently harness the power of dbt within the Data Vault 2.0 landscape.

Join us as we take the journey from abstract concepts to tangible solutions, and see firsthand the transformative capabilities of dbt in the world of Data Vault 2.0.

Our Tech Stack: A Cohesive Ensemble for Modern Data Management

Navigating the intricacies of Data Vault 2.0, especially with the hands-on approach we’re emphasizing, requires a solid technological foundation. Let’s dive into the stack that underpins our operations and explore how each component plays its part in our data management strategy:

Google Cloud: As our primary cloud service provider, Google Cloud offers the infrastructure and scalability required to store and process vast amounts of data. Its suite of tools, integration capabilities, and security features form the backbone of our data operations.

BigQuery: Operating within the Google Cloud ecosystem, BigQuery stands out as our choice for a serverless, highly-scalable data warehouse. With its fast querying capabilities and pay-as-you-go pricing model, it effortlessly handles large-scale data analytics, ensuring we can process and analyze data seamlessly.

dbt: Acting as the bridge between our raw data and analytical-ready formats, dbt (data build tool) allows us to transform data effectively within BigQuery. By abstracting complex logic and facilitating modular code practices, dbt ensures our Data Vault 2.0 structures are both efficient and maintainable.

Airflow: Workflow management becomes critical as data operations grow in complexity. Apache Airflow enables us to define, schedule, and monitor workflows with precision. Whether it’s orchestrating dbt transformations or managing dependencies, Airflow offers a visual interface and extensive integrations to ensure everything runs smoothly.

Airbyte: As a data integration platform, Airbyte plays a pivotal role in syncing data from various sources into our primary data warehouse in BigQuery. Its adaptable connectors, both for mainstream and niche data sources, ensure that we can funnel data into our system with consistency and reliability.

In essence, our chosen tech stack, anchored by Google Cloud and amplified by tools like dbt, Airflow, and Airbyte, provides a robust, agile, and scalable environment. It not only supports our Data Vault 2.0 implementations but also ensures that we can derive insights, monitor operations, and adapt to changing business needs with unparalleled ease.

Adopting the Data Mesh Paradigm with Astrafy’s Sales & Marketing Data

As we journey further into the world of data, it’s important to understand that our approach is firmly rooted in the Data Mesh paradigm. This decentralized data architecture moves away from the traditional monolithic data platforms and leans into domain-oriented decentralized data ownership and architecture. It’s an approach that aligns data solutions closely with the domain teams, promoting autonomy, faster decision-making, and more aligned data products.

Focusing on Astrafy’s Sales & Marketing Data

For the context of this article, we will zoom into a specific domain: the sales and marketing data from our company, Astrafy. Why this focus? Sales and marketing data is often rich with insights that can transform business strategies, optimize customer journeys, and unveil growth opportunities. Given its significance, understanding how to manage, transform, and derive insights from this data using the Data Vault 2.0 approach is invaluable.

Our exclusive data source for this segment is Hubspot, a leading sales and marketing platform. All the sales and marketing data points, from lead interactions, email campaigns to customer conversion metrics, originate from this platform.

Data Ingestion via Airbyte

To ensure a seamless flow of this data into our primary data warehouse in BigQuery, we employ Airbyte. Acting as our primary data integration conduit, Airbyte has a specialized connector for Hubspot, which ensures that data syncs are reliable, timely, and free from discrepancies. It’s a tool that eliminates the traditional friction points associated with data extraction and loading, paving the way for smooth data operations.

Visualizing the Data Relationships with ERD

To truly grasp the interrelationships between different data entities and their hierarchies, we rely on Lucidchart. Lucidchart offers an intuitive platform to create Entity Relationship Diagrams (ERDs) that visually map out the data landscape.

In the subsequent sections, we will deep-dive into the specifics of transforming this data using dbt, structuring it with Data Vault 2.0 conventions, and deriving actionable insights. By focusing on Astrafy’s sales & marketing data from Hubspot, we aim to provide readers with a practical, hands-on view of the entire process, from data ingestion to insight generation. Using an ERD from Lucidchart, we will provide a detailed view of how each data entity within our sales and marketing segment is interconnected.

Every well-structured data architecture consists of layers that serve distinct functions in the data processing pipeline. This modularity ensures both flexibility and robustness in managing data. Our architectural design at Astrafy abides by this principle, and it’s worthwhile to break down each layer for a clearer understanding:

Sources

This is the foundational layer where data initially lands, ingested directly from our various data sources, such as HubSpot. It contains raw, unprocessed data in its original form. There’s no transformation or cleanup done at this stage, ensuring that the rawest version of data is always preserved for reference.

Raw

At this layer, data from the sources is slightly structured to fit into a common schema, but remains largely unprocessed. It acts as a landing zone for incoming data, prepped for further processing. Typically, any system-specific metadata or fields will be mapped or renamed to fit a standard schema, but without altering the core data.

Staging

The staging layer is where the raw data undergoes its first round of substantial transformations. Redundancies are removed, data types are standardized, and necessary calculations are performed. The result is a cleaner, more organized dataset ready for deeper processing. This is especially where dbt shines, allowing for scalable transformations.

Raw Data Vault

This layer, often regarded as the cornerstone in modern data architecture discussions, leans heavily on the principles and structures outlined in the Data Vault 2.0 methodology. But what truly makes this layer stand out?

At its core, the Raw Data Vault layer serves as a repository. However, it’s not just any repository. The precision with which data is organized into Hubs, Links, and Satellites sets it apart. This organizational structure is the heart of the methodology, acting as a resilient framework for data ingestion, storage, and retrieval.

Let’s delve a bit deeper:

Creation of Hubs: Hubs play a pivotal role as they serve as the unique business keys. The process of creating a Hub involves identifying these unique business entities and ensuring they are effectively indexed. By doing so, we ensure rapid and efficient data retrieval, forming the backbone of the layer.
Establishing Links: Once Hubs are in place, it’s essential to understand and map the intricate web of relationships between these business keys. Links come into play here. The creation of Links is an exercise in understanding the data’s relational aspects. By meticulously crafting these connections, the Data Vault layer offers a rich tapestry of interrelated data points, making the landscape both expansive and interconnected.
Populating Satellites: Satellites can be seen as the contextual layer that adds depth and richness to the Hubs and Links. The process of creating Satellites involves appending descriptive, temporal, and sometimes historical data to the already established Hubs and Links. This step is where the data truly comes alive, giving it dimension and detail. Beyond its organizational prowess, the Raw Data Vault layer underscores the importance of longevity in data storage. It’s built for the long haul, prioritizing both scalability and adaptability. Its architecture is such that even if source systems undergo substantial changes, the downstream applications remain unaffected. It’s this buffer, this layer of insulation, that makes the Data Vault approach so compelling. By maintaining this solid foundation, not only does it ensure that data is consistently accessible, but it also preconditions it for more nuanced, business-centric transformations in subsequent layers. And in a more technical point of view, we’ll be diving into the intricate mechanics of crafting Hubs, Links, and Satellites using dbt packages in the subsequent section.

Business Data Vault (optional)

The Business Data Vault is an integral yet optional extension in the Data Vault architecture spectrum. Situated between the Raw Data Vault and the Information Data Vault, it acts as a mediator, enhancing the raw data with business context without heavily transforming it. Let’s explore some of the core components and processes associated with this layer.

Bridge Tables: In scenarios where complex many-to-many relationships need to be portrayed, Bridge Tables come into play. They allow for a flexible structure that bridges the gap between different Hubs or business keys, offering a nuanced representation of data relationships without compromising the integrity of the original raw data.

Point in Time Tables (PIT): When historical context is vital, PIT tables serve as the linchpin. They allow stakeholders to retrieve a snapshot of data at any given point in time, aiding in trend analysis, historical comparisons, and evaluating data’s temporal evolution. By providing a comprehensive view of how data attributes change over time, PIT tables are instrumental for businesses that need to make decisions based on past trends and patterns.

Soft Business Rules: While the Raw Data Vault remains agnostic to business rules, ensuring data is in its most genuine, unaltered state, the Business Data Vault introduces soft business rules. These rules are not as rigid as hard-coded transformations but provide a layer of business logic, filtering, or enrichment to the raw data. This ensures that data is presented in a slightly more refined form, making it more aligned with business perspectives without over-transforming it.

In essence, the Business Data Vault serves as a semi-processed layer. It strikes a balance between the pristine nature of the Raw Data Vault and the highly tailored data in the Information Data Vault. By introducing components like Bridge Tables, PIT tables, and soft business rules, it presents data in a manner that is more aligned with business semantics, thereby facilitating a smoother transition to the final consumable data structures in the Information Data Vault.

Information Data Vault

At the pinnacle of our architecture, the Information Data Vault layer transforms the data from the raw Data Vault into more consumable formats, tailored to business needs. This is where data is made actionable. The raw, granular data from the Raw Data Vault is transformed into user-friendly formats, often through the creation of summaries, KPIs, or other business-relevant structures. The goal here is to make data insights accessible to various stakeholders, from analysts to decision-makers.

Situated at the apex of our architecture is the Information Data Vault layer. This layer transitions from the traditional data modeling of the Raw Data Vault and adopts the “star schema” approach, further enhancing data accessibility and interpretation. By utilizing the star schema, data from the Raw Data Vault is transformed into more consumable formats specifically tailored to business requirements, effectively bridging the technical and business realms.

This transition marks the phase where data evolves from being merely stored to becoming actionable. The inherent granularity of the data from the Raw Data Vault undergoes a transformation, rendering it into user-friendly formats. This metamorphosis often materializes through the development of summaries, KPIs, or other structures that hold significant business value. The overarching objective of this layer is to pave the way for clear and coherent data insights, ensuring they’re within easy reach for a diverse range of stakeholders, from data analysts to high-level decision-makers.

Implementation with dbt and automate_dv

Incorporating the principles of Data Vault 2.0 into our sales-marketing data product has proven to be an innovative endeavor. Let’s break down the implementation journey, highlighting the essential steps and the type of code snippets that best illustrate each phase.

In the data world, efficiency is the key. While it’s tempting to load all the data every time, it’s far from efficient, especially when dealing with large datasets. Thus, incremental loading is a game-changer. Here’s a snippet that highlights how we achieve this in our sales-marketing data product using dbt in tandem with data ingested from Airbyte:

Screenshot of a database query script using SQL language, with code configuring incremental materialization and partitioning by load date. The script includes a WITH clause defining raw company data and a subquery to unnest contacts from these companies, followed by a SELECT statement to extract the data with applied filters based on load date.

Understanding the Code

Incremental Configuration: At the very beginning, we use dbt’s config method to set the model to use incremental materialization. We’ve chosen the insert_overwrite strategy, which means that when new data comes in, it overwrites the existing data based on the partition specified. Here, the partitioning is done on the airbyte_load_date, with a granularity of a day.

Data Extraction: Within the raw_companies Common Table Expression (CTE), we pull the necessary fields from our raw HubSpot source, such as company ID, its archival status, associated contacts, and their creation and update timestamps.

Unnesting Arrays: The HubSpot source might contain nested arrays, like the list of contacts associated with a company. We handle this in the unnested_contacts CTE by employing the UNNEST function, ensuring each contact gets its separate row.

Final Selection: Post unnesting, we select all columns and also append some metadata for traceability.

Incremental Filtering: Finally, the data is filtered based on the incremental strategy, ensuring that only the new or updated data gets loaded into the final table, optimizing performance and efficiency.

Now let’s take a peek at how we’ve set up our Hubs, Links and Satellites for the deals coming from HubSpot using dbt and the automate_dv package:

Hub

Link

At the core is the automate_dv.t_link function. Based on the provided metadata parameters, this function dynamically generates the SQL required to create or update a Transactional Link table. This dynamic generation ensures the resulting Link table correctly captures relationships between Hubs, adjusting for any changes in the source data or its structure.

Satellite

Let’s now dive into the details of these 3 different objects:

Configuration: First and foremost, we specify that our dbt models are materialized as ‘incremental’. This means, upon subsequent runs, only new or changed data gets processed, offering enhanced performance.
Setting the Parameters: Before invoking the creation logic for each of the 3 objects, we define a few parameters such as:
source_model: Refers to the staging model.
src_pk: Specifies the primary key.
src_nk: Represents the natural key from our source data.
src_ldts: Denotes the timestamp indicating when the data was loaded.
src_source: Indicates the source of the data.
Automating Objects Creation: Finally, we invoke the macros from the automate_dv package. By feeding the macro our previously set parameters, it goes on to generate the necessary SQL to construct the Hubs, Links and Satellites. This simplifies and standardizes the process, ensuring consistency across all objects.

Orchestration via Airflow

In modern data management, consistent and timely updates are crucial. To this end, ensuring that our data product, ‘sales-marketing’, operates like clockwork is a non-negotiable priority. Apache Airflow, a platform to programmatically author, schedule, and monitor workflows, is our chosen orchestrator for this task. Here’s how we’ve harnessed its power to run our data product at the desired frequency.

Designing the Airflow DAG

A Directed Acyclic Graph (DAG) is at the heart of any Airflow operation. For our ‘sales-marketing’ data product, we’ve designed a DAG that captures the entire lifecycle, from fetching the latest data from Hubspot to running transformations via dbt and loading the results into our BigQuery data warehouse.

Containerizing dbt

Incorporating the power of dbt for data transformations, we’ve embraced Data Vault 2.0 modeling practices. While these dbt commands can be executed locally, within our Airflow setup, we harness the Kubernetes Pod Operators. By doing this, dbt runs within isolated, containerized environments on Kubernetes, ensuring both scalability and reliability for our data transformations.

Building in Resilience

One of the beauties of Airflow is its capacity for handling failures. We’ve configured retry policies that automatically attempt to re-run failed tasks. By doing so, we can ensure data consistency even in the face of transient errors or system hiccups.

Dependencies and Parallelism

Airflow’s capability to run tasks in parallel, while also honoring task dependencies, ensures that our data product is processed efficiently. We’ve defined the interdependencies between various tasks (e.g., data extraction from Hubspot should precede dbt transformations). Concurrently, independent tasks are set to run simultaneously, leveraging Airflow’s parallel execution capabilities, thus optimizing the overall run time.

Logging and Auditing

To maintain transparency and aid in troubleshooting, all operations related to the ‘sales-marketing’ data product within Airflow are logged. These logs offer invaluable insights into the execution timeline, potential bottlenecks, and areas of optimization.

Pros and Cons

Data Vault 2.0 and dbt packages offer a compelling combination for modern data warehousing solutions. But as with any technology stack, there are both advantages and potential drawbacks to consider. Let’s delve into the pros and cons of adopting this duo for your data needs.

Pros

Flexibility and Scalability

Data Vault 2.0’s architecture ensures that as data grows and changes, the system can easily adapt. This scalability is further enhanced when using dbt, as it can efficiently manage and transform massive datasets.

Historical Data Preservation

One of the standout features of Data Vault 2.0 is its ability to keep an accurate history of data changes. This is crucial for auditing, compliance, and understanding data evolution.

Modular Development

dbt promotes modularity, allowing teams to work on different parts of the data model concurrently. This modular structure aligns perfectly with Data Vault 2.0’s separate components, like Hubs, Satellites, and Links.

Streamlined Transformations

With dbt’s focus on ‘transform’ in the ELT process, it simplifies and augments Data Vault 2.0’s transformation logic, making it easier to maintain and iterate on the data model.

Native SQL Environment

Both dbt and Data Vault 2.0 leverage SQL, a ubiquitous language in the data world. This promotes easier collaboration and smoother integration between the two.

Cons

Complexity

One of the primary criticisms of Data Vault 2.0 is its complexity. While it’s rich in features, it may be overkill for simpler projects or smaller datasets.

Learning Curve

For teams unfamiliar with Data Vault 2.0 or dbt, there can be a steep learning curve. Proper training and resources are essential to harness their full potential.

Longer Implementation Time

Initial setup and configuration, especially when integrating multiple tools like dbt, Data Vault 2.0, and a cloud data platform like BigQuery, can be time-consuming.

Dependency on Tooling

While dbt enhances Data Vault 2.0’s capabilities, there’s an inherent risk in becoming too reliant on a specific tool or package. If the package is deprecated or doesn’t receive updates, it could impact the data workflow.

Costs

Especially for larger teams or massive datasets, the combined use of Data Vault 2.0, dbt, and cloud platforms can escalate costs. Monitoring and managing these costs becomes essential.

Empowering Data Decisions

Before concluding our exploration, we shift our focus to the Business Intelligence (BI) component in our data ecosystem, specifically highlighting the role of Lightdash. Our journey has taken us from data collection, transformation, and storage to the culmination of using this data for actionable insights. The integrity and full historical context maintained by our Data Vault architecture ensures that stakeholders can trust the information presented.

Lightdash has emerged as a robust platform, adept at transforming our dbt models into interactive dashboards filled with insights. The close integration of Lightdash with dbt ensures a harmonious transition from data transformation to visualization, effectively reducing discrepancies and bolstering trust in our reports. Moreover, the collaborative nature of Lightdash allows teams to collectively define and refine metrics, fostering a cohesive understanding of KPIs throughout the organization. A key advantage to our visualizations in Lightdash is the comprehensive historical data drawn from our Data Vault, which assures stakeholders of the data’s completeness and veracity.

However, Lightdash is merely one facet of the broader data consumption landscape at Astrafy. Other tools have their distinct roles tailored to our diverse needs:

TinyBird steps in when we talk about real-time analytics via APIs. It’s more than just quick querying. By integrating a Datamart table with TinyBird, departments gain the ability to directly query our BigQuery tables through TinyBird, offering a dynamic and responsive way to interact with data.

For our predictive analytics, Vertex AI is our go-to platform, providing the machine learning capabilities needed to interpret and predict based on our vast datasets.

In tying together the visualization strengths of Lightdash with our dbt and Data Vault 2.0 framework, we’ve curated a comprehensive data solution. This synergy guarantees data coherence and a structured approach, enabling actionable, data-driven insights for all Astrafy stakeholders. As our data needs diversify across BI, API-driven products, or machine learning, our suite of tools, including Lightdash, TinyBird, and Vertex AI, assures that we remain at the forefront of data innovation.

Conclusion

As we wrap up this second chapter of our exploration into dbt and Data Vault 2.0, it’s evident how these tools revolutionize the way we perceive, manipulate, and store data. Our hands-on dive into the sales-marketing data product showcased the synergy of combining best-in-class tools and methodologies, from leveraging the power of BigQuery and the automation capabilities of Airflow to harnessing the efficiency of dbt and the structured approach of Data Vault 2.0.

By walking through the real-world example of Astrafy’s sales and marketing data from HubSpot, we illuminated the practical aspects and intricacies involved. The code snippets served as a testament to how dbt packages, particularly the likes of automate_dv, can demystify complex data modeling tasks, transforming them into understandable and manageable units.

However, it’s essential to note that while these tools and methods are potent, the true strength arises from a well-thought-out implementation. Choices between packages, design decisions, and even nuances in coding can deeply impact the final data product.

Looking forward, as the world of data engineering and analytics continues to evolve, the marriage of tools and methodologies like dbt and Data Vault 2.0 will undoubtedly play a pivotal role in driving data-driven decisions and innovations. The journey we’ve embarked on with Astrafy is just one testament to the possibilities ahead.

Thank you

Thank you for reading this article! If you are looking for some support on your data vault implementation, feel free to reach out to us at sales@astrafy.io.

Data Vault 2.0 — a hands-on approach with dbt and BigQuery (Part 2)