In this lesson we will:
- Learn about the materialisation options for our dbt models;
- Explain the considerations when to use each one.
Materialisation Options
As we have demonstrated, dbt works by taking our model definitions and creating or materialising database objects.
When learning dbt, the natural way to think about this is to materialise tables, whereby each model produces one table.
We do however have options, and can instead materialise Views, Ephemeral or Incremental models in addition to tables.
In this lesson we will describe each of these options.
Tables
Tables are of course the heart a relational database, consisting of a set of typed columns containing multiple rows of data.
order_id | pizza_type | num_ordered
----------+------------+-------------
1 | Margharita | 1
2 | Hawawain | 5
3 | Margharita | 6
4 | Mushroom | 2
5 | Mushroom | 4
6 | Meat Feast | 1
7 | Margharita | 4
8 | Meat Feat | 2
9 | Meat Feat | 4
(9 rows)
When we materialise a table in dbt, we are typically taking some source tables and materialising some derived table. For instance, maybe we want to aggregate the above table to count the total number of pizzas of each type ordered:
pizza_type | num_ordered
------------+-------------
Margharita | 11
Hawawain | 5
Meat Feast | 7
Mushroom | 6
(4 rows)
This is the simplest type of transformation we could carry out in dbt.
Considerations When Materialising Tables
- The materialised table is stored on disk and requires storage;
- The materialised table is a snapshot in time, meaning that the data in it will go out of date and need to be periodically refreshed by dbt;
- This can potentially speed up query performance. If the query is slow or complex to calculate then we can avoid the need to do that repeatedly by pre-calculating results in this way.
Views
Views can be thought of as named queries, giving us a convenient and semantic way to issue a query.
In a typical relational database, we could create a view of the largest pizza orders like so:
create view large_orders as
select * from pizza_orders where num_ordered >= 4;
Then we can select directly from the view:
select * from large_orders
And only be presented with the large orders.
order_id | pizza_type | num_ordered
----------+------------+-------------
2 | Hawawain | 5
3 | Margharita | 6
5 | Mushroom | 4
7 | Margharita | 4
9 | Meat Feat | 4
(5 rows)
If another order with more than 4 orders is inserted into the pizza_orders table, this would subsequently be returned in the large_orders view the next time it is queried.
Views are simple convenience functions, giving us a mechanism for defining, naming and then accessing some view into the data.
Considerations When Materialising Views
- The actual data still remains in the underlying pizza_orders table and is not duplicated into the large_orders view. There is therefore no additional storage required.
- The view is always up to date as soon as data is modified in the underlying table;
- Views are slower to query than materialising tables as there is no pre-calculation occuring. This means that views are usually more appropriate when we have smaller datasets and less complex queries;
- Materialising views is faster than materialising tables, which will speed up your dbt development and testing cycles;
- By default, dbt will materialise to views if no option is explicitly set.
Ephemeral Views
In some instances, it is useful to model an entity in our dbt code, but not explicitly create a table or a view in the database.
To continue with the example above, perhaps the large_orders model is a useful concept as we develop our data pipelines, but is not something a data analyst or business user would ever need.
In this case, it may make sense to introduce an ephemeral dbt model.
The model is defined in the usual way, but marked with ethemeral:
{{ config(materialized='ephemeral') }}
select * from pizza_orders where num_ordered >= 4;
When we then reference it int he usual way,
select * from {{ref('large_pizza_orders')}}
The select statement which defines the view will be interpolated into the model at compile time.
The advantage of Ethemeral models is that it keeps the database clean, but the dbt code is much more organised and readable as a result of explicitly modelling the entity.
Considerations When Materialising Ethemeral
- Ethemeral models can be harder to debug as there is no view or table created in the database;
- Ethemeral models could be slow, as they are built up by simply interpolating SQL which could get hard for the database to optimise;
- Ethemeral views can be tested using dbts testing framework, making it.
Incremental Models
When we execute a dbt run, the usual mode is that the entire data pipeline is executed again.
If we create a series of tables, the new one will be created in full, the old one dropped, and the names switched.
Though this has some advantages, it falls down when we have very large volumes of data and need.
Incremental models allow us to build up the model in increments each time dbt runs.
The incremental option is set in the same way:
{{ config(materialized='incremental') }}
select * from pizza_orders where num_ordered >= 4;