Our stack is divided into five pillars and within each pillar we specialize in a few technologies that can fit globally into any of your use cases. The exact selection of the best stack for your company is determined by multiple factors:
- Skills of your data team
- Big Data triangle (velocity, variety, volume)
- Predisposition to open-source technologies
We ingest data from your transactional databases and applications in a seamless manner using point-to-point and no-code technologies. Our preferred tools are:
Our de-facto tool for ingestion. It is open-source, runs on Kubernetes, has 100+ of native connectors and handles “Change Data Capture”.
Our reference solution for a license tool. No-code eco-system and user-friendly UI make it very accessible to use.
- Custom connector:
We develop as well custom connectors using open-source libraries like “singer.io” using python, golang and rust.
Backbone pillar where the data is transformed from the transactional systems into a new data model that will feed several downstream applications. Having those data transformations orchestrated, operated at a fast pace and with visibility require mastering a combination of best-in-class tools.
Analytical database of Google Cloud. It has all the ingredients for hosting massive operations:
- - Advanced SQL language
- - Serverless and fully managed
- - Pay-as-you-go default pricing model
- - Scale to the infinity
Queries to move data from the landing zone up to the datamart layer require a robust engineering ecosystem. DBT brings this to the table by keeping your queries DRY, documented and integrated with the mainstream modern data stack tools.
Analytical processing with BigQuery is concerned with batch and near-real-time processing. We leverage Dataflow, fully managed (Apache Beam backend) and serverless product from Google Cloud, in order to tackle streaming real-time use cases.
Our de-facto orchestrator to put it all together. We use airflow to run dbt, trigger dataflow jobs, notify stakeholders about new data, etc. Airflow glues together the ingestion, transformation and distribution in a unified way.
- Data modelling:
We believe data must be treated as products and need to be modeled accordingly. Data mesh is the emerging new modelling paradigm in that regards and we use tools such as dbdiagram, dbdocs and dbml to model your data before implementing anything.
Getting value out of the data is the ultimate goal. Pillar 1 and 2 are the foundations to activate the third pillar to flourish. We focus on four forms of distribution:
- Business Intelligence:
State-of-the-art dashboards using either Looker, Apache Superset or Lightdash in order to give you descriptive and accurate insights about your business.
- Machine Learning:
Knowing the past and present of your company through data is a prerequisite nowadays; taming and acting on the future is where the competitive advantage lies. We strive on delivering robust ML models for a multitude of use cases (predictions, recommendations, etc.) using Vertex AI (fully managed and serverless product from Google Cloud) as our cornerstone to develop, train and deploy those ML models.
- Reverse ETL:
Insights from transformed data is fed back into the upstream databases/applications. We use Grouparoo for open-source and Census for enterprise-grade.
- API product:
Self-service of datamarts data via Hasura that renders Instant GraphQL on all your BigQuery data.
For a more granular approach on your API product we accompany you with Apigee.
Data quality, observability, security and privacy are becoming “must have” for data ecosystem due to stricter regulations and business needs. Data governance tools are booming and this can lead to confusion when it comes to selecting a tool. Our team has experience with the mainstream enterprise-grade solutions and also with major open-source tools. Our approach in terms of data governance is first of all a study of your data-ecosystem to determine the best selection of tools. Below is a non-exhaustive list of top-notch solutions to cover the spectrum of data governance needs:
- Data quality:
“Great Expectations” (GE) has emerged as the leading solution with its approach to bring software data engineering testing best-practices to the data world. We have advanced experience with GE and have it all integrated with DBT and Airflow.
- Data security:
A lot can be done via Google Cloud IAM, BigQuery column-level security and AEAD encryption. For an enterprise-grade approach, Immuta comes to the rescue with a fine-grained and a UI-friendly platform.
- Data Catalog:
The holy grail is achieved when you have a unique place where your data team and stakeholders can collaborate and find all the information they need about your data (Business Glossary, technical definitions, lineage, PII flags, etc.). For an open-source approach, we use Amundsen hosted on Kubernetes. For an enterprise-grade approach, we use Atlan.
- Data Observability:
You can’t control what you don’t observe. Every single of your data pipeline needs to be monitored and plugged into proper notification tools. We offer you a suite of observability tools in order to master your end-to-end data process.
The glue that puts it all together between the five pillars. None of the previous pillars can last on its own if automation is not present all along the way. We use the following tools to automate this process:
- Terraform for everything related to Infrastructure as Code
- Gitlab CI for Continuous Integration (CI) pipelines
- Spinnaker for Continuous Delivery (CD) pipelines
- Kubeflow pipelines on Vertex AI for end-to-end MLOps
We are convinced through many years of experience in the different mainstream clouds that Google Cloud is one step ahead as far as data analytics and Machine Learning are concerned. Each of our engineers is fully proficient in Google Cloud with at least five Google Cloud certifications. While we are experts in the Analytics side, other areas like networking and infrastructure have no secrets for us and won’t be lasting blockers while implementing your analytical solutions.
In order to realize your data journey efficiently, we host most of the open-source technologies we use on GKE (i.e. Google Kubernetes Engine) and this allows us to leverage scalability, high availability and fully secured features.
Transfer knowledge is a fundamental part in each of our projects. We are not just delivering a “one shot” project but are committed to make our solutions last and evolve in time through your team. We do this by involving and training your data team in each step of the implementation. At the end of the project, we then accompany your team with a mix of maintenance and custom training. The journey only ends once you can fly on your own with the modern data stack in place.
We do not sell a product, we sell a data journey with Astrafy pillars and values.
Our coding rules
- Thorough Design before any implementation
- Documentation is as important as code
- Open-source over license
- Easy to maintain over fast and complex
- Python, Golang and Rust