Introduction

Data Quality has been actively debated in recent months/years as it has yet to be a core pillar in any organization's data ecosystem. Apart from the issues overlooked while planning the data workflows and respective tests, there is still a major issue omnipresent in all major data applications, the ever-changing nature of business and technical rules of data sources. These issues arise from potentially different scenarios:

  • Source schema evolution — altered field names, new data types, and any conflicting change on the expected schema.

  • Newly supported/unsupported business logic — The rules from the past can change. Still, if the downstream consumers are not alerted, any unused filter/hard rule check may create some issues.

  • Unplanned changes and miscommunication between consumers and producers — A producer might be focused on their product and delivery agenda, disregard the implications of the changes, or consumers might not be able to deal with the upcoming changes in the given timeframe.

Many other scenarios are for sure bound to happen, but the key point here is that if your application or business decisions are data-driven, you should adhere to some rules and agreements between who produces the data and who consumes it. The illustration below depicts very well the vicious circle generated by upstream changes in data.

Diagram showcasing all the typical issues and conflicts that occur when a change on an upstream pipeline happens.
A typical cascade of issues when Data changes. Image sourced from opendatacontract.

These agreements are usually called “Data Contracts”, assuming a virtual handshake between the two parties, on the expectations of incoming data. Those expectations have the following characteristics:

  • Schema — Structural rules regarding field names, data types, and any other similar aspects.

  • Producer/Owner — Who owns the data, and should be responsible for changes and communication of the same. Direct contacts or groups can be attached to a particular data source.

  • Consumers/Subscribers — Who consumes the data? The involved owning parties and projects that in case of a change will be impacted and should be followed up. Take into consideration that each subscriber will have the capacity to react to changes. Any problems with the execution of the data pipeline should alert all subscribers. This will help any security team to apply the access policies.

  • Freshness/Scheduling — When is this data expected to be refreshed? Any change in the schedule will impact downstream workflows that expect this data to be fresh on a given date and time.

  • Versioning — As referred before data is ever-changing, and as such, so will be their respective contracts. Be sure to keep in mind minor changes and breaking changes. And make sure to provide a list of changes on each release.

  • Business Rules — These can be used as core values in the orchestrated functional quality tests.

  • Metadata — Any details that will contribute to the data governance of this data asset.

The following illustration depicts in a simple form a data contract between a data provider and a data consumer:

A diagram displaying the interactions between data producers and consumers and their respective agreement in a data contract.
Data contract and respective Data Provider and Consumer interactions. Image sourced from datamesh-manager.

Quality between producers and consumers

In the data battlefield, when something breaks, no one wins. Everyone gets their fair share of headaches, time is spent debugging and contacting the involved parties; consumers will either lose trust in the data or take actions based on false information.

This problem amplifies as more dependencies are created for a particular subset of data. Let’s assume the following scenario. Hereafter you will find a simple dependency graph of a Data Mesh use case.

A lineage graph showing the node dependency between source systems and downstream data products.
Simple Data Mesh representation of a Source system and downstream dependent data products.

Many of the data quality issues focus on the local impact between a producer and a consumer but rarely emphasize the real impact of a highly dependent system. This is where the “Data Mesh” principle can become a data mess.

  • How can we communicate error events or upcoming changes from Data Product #1 to the team managing Data Product #4?

  • Do we add changes product-by-product? But the latest ones are relying on clean data from the sources!

  • What if incremental strategies are used in these intermediate data products?

Versioned datasets can be life-saving, but always ensure that your consumers can plan and implement changes on their data pipelines without creating a chain of chaos.

Additionally, data contracts should drive a more transparent workflow; no more obscure black boxes on what is done, tested, and expected. Open a communication channel and find a common ground to define those data contracts. Every producer and consumer of data needs to be aligned.

Data Quality tools

A few popular tools are available in the market and have proven their value:

  • Great Expectations — A Python-based open-source library that allows users to write declarative tests, called expectations, to validate, document, and profile their data.

  • Soda — A data monitoring platform that provides tools to automate monitoring, quality checks, and alerts for data anomalies directly within your data warehouse or data pipelines.

  • dbt — An SQL-based tool that transforms data in your warehouse by enabling data modeling, testing, and documentation through simple configuration files.

So why would you choose one over the other? Here are a few of our pros and cons on each.

A matrix table with the pros and cons of some of the referenced data quality tools (dbt, Great Expectations, and Soda).
Data quality tool Pros and Cons based on personal experience.

Note that the comparison is based on personal experience. Other additional services enhance the overall data quality experience in the form of Cloud SaaS products for each of the mentioned tools.

Data Contract CLI

So what is the correct tool to define a Data Contract? There is no absolute rule or perfect tool and it might even be necessary to build a custom solution for your use cases.

You can refer to many of the popular data quality testing tools such as the aforementioned, but each has its own configuration and execution methods, which might not be business-friendly and will create ambiguity/confusion for less technical collaborators. Additionally, when choosing a particular tool to define and execute, you will lock your data quality stack to that particular tool, with all the caveats behind it, and in case the tool no longer suits all your requirements, then a big effort is required to migrate all the tests.

So what would be an ideal tool? Let’s see some core features that could enhance our data contract declaration:

  • Easy to understand — all involved parties, no matter the business/technical background, should fully understand a data contract definition. Additionally, it should be easy to review any changes in a versioning system, such as a git repository.

  • Tool-agnostic — No matter what tool the data contracts are running on, it should be flexible enough to be changed.

  • Support simple and advanced test cases — Should be able to test basic schema definitions and business rules.

  • Able to import from existing data/schema definitions — To increase adoption and decrease effort on implementing data quality checks on existing data pipelines.

So here is a quite recent tool that popped into our scopes. The Data Contract CLI, developed by INNOQ. This tool seems to meet most of our requirements and is easy to use. It allows you to:

  • Generate data contracts from existing schema definitions (SQL DDL, JSON schema, Avro, protobuf)

  • Data contract definition in a simple YAML format

  • Built-in execution with many of the popular data solutions (BigQuery, Postgres, Kafka, etc. )

  • Ability to export the data contract definition to controls for popular data quality testing tools (Great Expectations, Soda, dbt)

Image that displays all the features and integrations of the Data Contract CLI tool.
Data Contract CLI supported frameworks and features. Image sourced from cli.datacontract.

Additionally, it provides other useful features that could enhance your overall experience such as:

  • Breaking change detection

  • Changelog generation

  • Programmatic execution (in Python)

  • Publishing integration with Data Mesh Manager and OpenTelemetry

It’s a refreshing tool that can be a huge game changer in the data landscape. And we recommend that you check all their other blogs/products:

Usage

Data Contract declaration

So let’s assume a real-life scenario where you want to build your data product which uses a table from the Sales team data product. Our data product will focus on forecasts based on historical sales data, and keep in mind that the Sales team can edit your data sources and create breaking changes as they own the product itself.

Example scenario with a source system, a downstream Sales data product, that is also a source of data for the Forecast data product.
Forecast data product dependency lineage.

As such you express your need for establishing a Data Contract to secure stability and communication of any upcoming changes.

So first let us see what is the schema of one of the tables of interest. We can use an example data product from dbt-labs, the “orders” table. You can find this data product example at dbt-labs/jaffle_shop. As demonstrated in the picture below, we want to map every field without any transformation in our staging layer, so any schema or granularity changes will impact our underlying data product.

Source table mapping from the source Sales data product to the consumer Forecast data product staging layer table.

As a suggestion, the Data Contract should include all the fields in the source table, even fields that you might not be using currently as an addition/deletion of a field could also change granularity and context, or if you want to add a previously unused field, you have the Data Contract constraints on it as well, diminishing the number of iterations on the contract.

So let’s agree on the rules with our Data provider:

  • Schema details should be static, with no data type or naming changes

  • In this scenario, no negative values are allowed in the amount field

  • The table should be populated with data, no empty sources are allowed

  • Order dates should always be within a valid range

  • Status should only support values such as placed, shipped, completed, return_pending, returned

Once the rules are agreed upon, let’s write our Data Contract Specification file using the Data Contract CLI tool. First, set up your work environment by following the instructions on the official documentation page. Once your environment is ready, start by creating your first contract file.

$ datacontract

This first command will generate our template. We can have our first data contract details starting with the basic definition.

# datacontract.yaml
dataContractSpecification: 0.9.3
id: sales-contract
info:
   title: Sales Data Product Contract
   version: 0.0.1
   description: |
      Data Contract regarding all data sources from the Sales Data Product
   owner: Sales Team
   contact:
      name

A single data contract file supports multiple tables, but you can have a data contract file per table. So let’s define our orders model and expected schema.

# datacontract.yaml
...

models:
   orders:
      description: |
        Fact table that contains all orders per client purchase.
      type: table
      fields:
        order_id:
          type: int
          description: Order identifier number.
        customer_id:
          type: int
          description: Customer identifier number. 
        order_date:
          type: date
          description: Date of the order event.
        status: 
          type: string
          description: Order status.
        amount:
          type: decimal
          description

A quick note, many of the displayed features in the data contract specification documentation are not available in the Data Contract CLI tool, at the time of writing this article. We can now verify the data contract integrity by executing the following command.

$ datacontract

For some of the more advanced data quality checks you might have to decide on a data quality tool to define them, hopefully, in the future, the tool will also have a method to abstract these quality tests to be fully usable no matter the executing data quality tool.

As the data contract quality checks are declared with the SodaCL type, if you later export to another tool such as Great Expectations, only the initial model validations will be exported.

# datacontract.yaml
...

quality:
  type: SodaCL
  specification:
    checks for orders:

      - row_count > 0

      - invalid_count(status) = 0:
          name: Check if status is valid
          valid values: 
            - placed
            - shipped
            - completed
            - return_pending
            - returned
          
      - invalid_count(amount) = 0:
          name: Check if the amount is at least 0
          valid min: 0

      - failed rows:
          name: Check if orders are within valid dates
          fail condition: order_date BETWEEN CURRENT_DATE() AND '2000-01-01'

Finally, set your backend connection details, in this case, we will use Google Cloud’s BigQuery. But feel free to refer to the supported database in the official documentation.

# datacontract.yaml
...

servers:
  production:
     type: bigquery
     project: <GCP-PROJECT-ID>
     dataset

You can have multiple servers/environments you can add them, and then when executing the test just select the proper server name.

Now you should be able to start your tests, also be sure to verify if you are authenticated to your data platform.

$ datacontract test --server

You should see from the logs the execution status as the one depicted in the following figure. If any rule is breached, the respective test will displayed as failed. All tests executed on the BigQuery table can be audited in the BigQuery Job History panel.

Table of results with all the data quality checks executed from the data contract definition. Displays 20 checks and that the contract is valid.
Data contract CLI test execution log

Optionally, to export the data contract specification as a dedicated SodaCL YAML file you can use the export command.

$ datacontract export --format

Development

Apart from the data contract creation, developers should also be heavily involved in adopting best practices while developing the data contract repository. Here are a few hints or suggestions regarding the development, either by adding CI pipelines or pre-commit hooks by using some of the CLI commands:

  • Perform linting on all changed/new data contracts

  • Generate a changelog for each tag/release

  • Identify breaking changes by comparing changed files between commits/tags. When a breaking change occurs, bump up the major version of the data contract

  • Optionally, export to your target data quality tool format

Orchestration

For a reliable execution of the data contracts, building a docker image with all the required dependencies and executing it with an orchestration tool such as Airflow is recommended.

You can have a dedicated folder in your git repository for all the data contracts such as the contracts path.

Then before any data transformation stage, your pipeline should execute a data contract test per file, or if you have all the related data sources in a single data contract file, just execute that single file.

After validating the source data, the pipeline can proceed with the data transformation workflow.

Example workflow displaying stages from the code development, CI/CD, image building and data validation/transformation execution.
Example Data Contract and Data Transformation workflow.

Conclusion

In conclusion, maintaining high data quality is crucial for any project and organization that relies on data-driven decision-making. This article focused on the dynamic nature of data and the inevitable changes in business and technical rules that could compromise the integrity of data workloads, and how data contracts, and any related data quality methods and tools, between data producers and data consumers allow for more efficient and robust data change processes.

Remember that the tools and methodologies work if all parties involved take full responsibility for the pipelines and products generated, and that data quality should always be a first-class citizen in your data stack. A data contract in the end is a trust bridge between the operational plane and the analytical plane.

Data contract as an agreement between an operational plane, as a data producer, and an analytical plane, as the consumer.
Data contracts agreement between operational and analytical planes. Image sourced from Data Contracts Wrapped.

Furthermore, every tool has its pros and cons and the decision behind selecting them should be based on your team skillset and preference. Also worth mentioning is that the introduction of a tool like Data Contract CLI represents a significant advancement toward simplifying and abstracting data contract implementation and that the open source approach will hopefully push collaborative efforts to create a tool that everyone would appreciate using in their daily activities.

Additionally, an interesting blog piece to read is available on the official page of the Open Data Contract concept at opendatacontract, on which some of these previous topics were based, and it goes deeper into the current aspects and dives into the Problem with data and the Solution that derived the data contract concept.

Thank you

If you enjoyed reading this article, stay tuned as we regularly publish technical articles on data stack technologies, concepts, and how to put it all together to have a smooth flow of data end to end. Follow Astrafy on LinkedIn to be notified of the next article.

If you are looking for support on Modern Data Stack or Google Cloud solutions, feel free to reach out to us at sales@astrafy.io.