In this lesson we will:

  • Learn about dbts seed data feature;
  • Explain how seed data should appropriately be used.

What Is dbt Seed Data?

Most of our work with dbt will involve taking some source data ingested from some application, and transforming it into our views and tables in our desired destination formats and structured.

As part of this process, we may require external static data or lookup tables to enable the transformation. For instance:

  • To map short country codes to country names (e.g. US > United States);
  • To map a list of product codes to longer descriptions (e.g. PZ04 > Pepperoni Pizza)
  • Something specific to the business domain such as a hard coded list of staff or office locations.

dbts Seed Data feature allows us to store this data directly within the dbt project formatted as a CSV file where it can be properly source controlled, and then bought into the Data Warehouse as part of the dbt run.

The advantage of managing seed data in this way is that it becomes scripted, and can be bought into source control as part of the dbt project. When we have this seed static data being loaded into the warehouse in a reliable way, we can then build our transformations, safe in the knowledge that the static data and lookup tables we need will be present prior the transformation running.

Correct Use

This seed data feature is not designed as a production quality data loading tool to bring large data extracts into your warehouse. Rather, it is there to support ad-hoc loads of small static data tables required to support and enhance the transformations which you write.

Using Seed Data

Seed data is stored within the dbt project as a CSV file inside the data folder of your project. We can create an example like so:

cd data
touch country_codes.csv

We can then edit the file:

nano country_codes.csv

And paste the following CSV formatted data:

Code, Country_Name
US, United States
UK, United Kingdom
FRA, France
AUS, Austria
CH, China 

After saving the file, we can now import the seed data file into the database with the following command:

dbt seed

If we log back into Postgres, we will see that we have a table created which is named was based on the CSV file.

postgres -TBC
use pizzastore_analytics
select * from country_codes 

Outputs:

TBC

Note that the column types are automically inferred from the data when seeding data. If you wish to specify a different type for the columns, this can be overridden in dbt YAML files.

Next Lesson:
23

Seed Data

In this lesson we will use dbts seed data feature to reliably populate our database with static data for use as part of dbt transformations.

0h 15m




Work With The Experts In Real-Time Analytics & AI

we help enterprise organisations deploy powerful real-time Data, Analytics and AI solutions based on ClickHouse, the worlds fastest open-source database.

Join our mailing list for regular insights:

We help enterprise organisations deploy advanced data, analytics and AI enabled systems based on modern cloud-native technology.

© 2024 Ensemble. All Rights Reserved.