In this lesson we will:
- Learn about dbts seed data feature;
- Explain how seed data should appropriately be used.
What Is dbt Seed Data?
Most of our work with dbt will involve taking some source data ingested from some application, and transforming it into our views and tables in our desired destination formats and structured.
As part of this process, we may require external static data or lookup tables to enable the transformation. For instance:
- To map short country codes to country names (e.g. US > United States);
- To map a list of product codes to longer descriptions (e.g. PZ04 > Pepperoni Pizza)
- Something specific to the business domain such as a hard coded list of staff or office locations.
dbts Seed Data feature allows us to store this data directly within the dbt project formatted as a CSV file where it can be properly source controlled, and then bought into the Data Warehouse as part of the dbt run.
The advantage of managing seed data in this way is that it becomes scripted, and can be bought into source control as part of the dbt project. When we have this seed static data being loaded into the warehouse in a reliable way, we can then build our transformations, safe in the knowledge that the static data and lookup tables we need will be present prior the transformation running.
Correct Use
This seed data feature is not designed as a production quality data loading tool to bring large data extracts into your warehouse. Rather, it is there to support ad-hoc loads of small static data tables required to support and enhance the transformations which you write.
Using Seed Data
Seed data is stored within the dbt project as a CSV file inside the data folder of your project. We can create an example like so:
cd data
touch country_codes.csv
We can then edit the file:
nano country_codes.csv
And paste the following CSV formatted data:
Code, Country_Name
US, United States
UK, United Kingdom
FRA, France
AUS, Austria
CH, China
After saving the file, we can now import the seed data file into the database with the following command:
dbt seed
If we log back into Postgres, we will see that we have a table created which is named was based on the CSV file.
postgres -TBC
use pizzastore_analytics
select * from country_codes
Outputs:
TBC
Note that the column types are automically inferred from the data when seeding data. If you wish to specify a different type for the columns, this can be overridden in dbt YAML files.