In this lesson we will:
- Introduce dbt, the leading tool for data transformations within the Modern Data Stack;
- Look at an example of a dbt model;
- Explain some of the high level benefits of using dbt.
What Is dbt?
dbt is an open source tool that is increasingly being deployed within modern data stacks to carry out data transformations.
Data transformations involve taking some source data and changing it in some way, into some other format, structure or derived dataset. Example transformations that we might wish to carry out include:
- Cleaning the data - e.g. removing duplicated data or changing it into a standardised format;
- Summarising and aggregating the data - e.g. turning individual events into daily summaries and aggregations;
- Combining and joining data - e.g. building a joined up view from multiple source datasets;
- Carrying out rudimentary statistical analysis - e.g. to highlight trends and enable insights into the data.
Of course, reports, dashboards and other data consumers could be built on top of raw and unprocessed data. However, the art of Data Engineering is all about transforming data so that data analysts and other business users are presented with something much cleaner, better structured, more controlled and consistent.
Companies have been doing this type of Extract, Transform and Load (ETL) work around their data warehouse for many years, and it's a very mature field with established patterns and tooling. However, dbt's approach is really a huge step forward in modernising the process and making it much more efficient and robust.
A Brief Example
The best way to begin understanding dbt is with an example.
The central concept of dbt is the model, describing one entity, such as a table or a view, which we wish to build in our database.
In the example below, we have defined a single dbt model which interrogates our customers table, and builds a view of our highest value customers:
{{ config( materialized='view' ) }}
with high_value_customers as (image
select id, name, total_spend from customers where total_spend > 500
)
select * from high_value_customers;
As you can see, the model is defined using SQL, with a few simple directives specified in a templating language called Jinja.
Here, we are selecting from source source table customers, and building a view named high_value_customers. This object will be created when we perform a dbt run.
Benefits Of Using dbt
Of course, we can define our tables and views directly within our database, so the above may seem simple and of little benefit. Already though, consider the following benefits:
- The model is properly scripted in a source controlled file. We can then use it to build and rebuild our tables in a repeatable and controled way;
- This file can be bought into a modern development process, including source control, CI/CD, pull requests, code review, audit process etc;
- Models can be executed by dbt in a strict order where we have dependencies between models;
- We can use the Jinja directives to add logic and intelligence into how our models are built, for instance building up models incrementally;
- As dbt becomes a de-facto standard, any analytics engineer can immediately begin working with the project.
In the next lesson we will discuss these benefits in more detail.
dbt Core and dbt Cloud
dbt comes in two flavours.
dbt core is an open source software that is primarily.
dbt cloud is a cloud hosted managed service ran by dbt labs. Using dbt Cloud provides an integrated development environment and