In this lesson we will:
- Create our first dbt project and show how it can be administered and validated at the command line.
Creating The Project
We can begin by creating a new project with the dbt init command.
dbt init pizzastore_analytics
The CLI will ask which database the project will connect to. The list presented will depend on which specific connectors you have installed.
Which database would you like to use?
[1] clickhouse
[2] postgres
For now enter number 2 to select postgres.
Creating the project should give a succesfull output such as:
Your new dbt project "pizzastore_analytics" was created!
For more information on how to configure the profiles.yml file,
please consult the dbt documentation here:
https://docs.getdbt.com/docs/configure-your-profile
One more thing:
Need help? Don't hesitate to reach out to us via GitHub issues or on Slack:
https://community.getdbt.com/
Happy modeling!
As the output points out, one of our important next steps is to configure the dbt profile to describe which data warehouse to connect to and how.
In this instance, because we are using Postgres, we would need to set the connection details for our Postgres instance such as the host, port, username and password.
However, before doing that, lets first explore the project structure which has just been created:
cd pizzastore_analytics
ls -la
Which should give us something like this:
drwxr-sr-x 8 root root 4096 Nov 24 15:06 .thoroug
drwxr-xr-x 1 root root 4096 Dec 6 14:02 ..
-rw-r--r-- 1 root root 28 Nov 24 15:06 .gitignore
-rw-r--r-- 1 root root 571 Nov 24 15:06 README.md
drwxr-sr-x 2 root root 4096 Nov 24 15:06 analysis
drwxr-sr-x 2 root root 4096 Nov 24 15:06 data
-rw-r--r-- 1 root root 1339 Nov 24 15:06 dbt_project.yml
drwxr-sr-x 2 root root 4096 Nov 24 15:06 macros
drwxr-sr-x 3 root root 4096 Nov 24 15:06 models
drwxr-sr-x 2 root root 4096 Nov 24 15:06 snapshots
drwxr-sr-x 2 root root 4096 Nov 24 15:06 tests
The folders created here contain the following:
- Models - Models are your core transformations which take source data and output destination tables or views;
- Analysis - Analyses are temporary models which we need as intermediate steps, but which are not persisted to the database;
- Data - Sometimes you will need static and seed data for your transformations;
- Macros - Macros are reusable code blocks which are used by multiple transformations, giving us reuse;
- Snapshots - Snpashots of the database state;
- Tests - Tests define the correctness of your transformation.
We will learn more about all of these elements of dbt thorough the remainder of the course, but it is worth familiarising yourself with the project structure at this stage.
dbt_project.yml file
The dbt_project.yml is a configuration file in the root of the project, and allows to specify project level details such as the paths where various components are found.
Edit the file using the following command:
nano dbt_project.yml
And you will be presented with the following file:
name: 'my_new_project'
version: '1.0.0'
config-version: 2
profile: 'default'
source-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_modules"
models:
my_new_project:
example:
+materialized: view
It is worth replacing the project name my_new_project with our project name pizzastore_analytics. For some reason, dbt does not set this automatically.
The main reason we would usually touch this file is to set model specific configuration in the models: section. However, there are multiple and better ways to configure your models, meaning that this file is not always used.
For now, we can exit the file.
dbt debug
dbt debug is a useful tool for running at this stage to check that your project is configured correctly and that you have connectivity to the databases in the linked profile.
dbt debug
Output:
Running with dbt=0.21.0
dbt version: 0.21.0
python version: 3.8.10
python path: /usr/bin/python3
os info: Linux-5.10.47-linuxkit-x86_64-with-glibc2.29
Using profiles.yml file at /root/.dbt/profiles.yml
Using dbt_project.yml file at /ecommerce_analytics/dbt_project.yml
Configuration:
profiles.yml file [ERROR invalid]
dbt_project.yml file [OK found and valid]
Required dependencies:
- git [ERROR]
2 checks failed:
Profile loading failed for the following reason:
Runtime Error
Credentials in profile "default", target "dev" invalid: ['dbname'] is not of type 'string'
As you can see, this provided a useful summary of the project and attempted to connect to the database to validate the connectivity details. Because we haven't yet configured our database or profile, we see an error.
Populate profiles.yml script
So we can progress, lets enter some database credentials into the profiles.yml file.
nano ~/.dbt/profiles.yml
Delete the existing content and paste these details into the file:
default:
outputs:
dev:
type: postgres
threads: 1
host: localhost
port: 5432
user: postgres
pass: postgres
dbname: pizzastore
schema: dev_pizzastore
prod:
type: postgres
threads: 1
host: localhost
port: 5432
user: postgres
pass: postgres
dbname: pizzastore
schema: prod_pizzastore
target: dev
We will explain the contents of this file in more details in the next lesson. For now, this step allows us to proceed with the lesson by having a well formed profiles file.
dbt parse
dbt parse is a useful command to use during your development workflow.
It will verify that all of your source code is correctly structured, including your profiles.yml file that we just populated.
Integrating frequent calls to dbt parse into your workflow can speed up the development cycle rather than waiting for long migrations to run only to find that you accidently introduced a typo into one of your transformation files.
dbt parse
Output:
Running with dbt=0.21.0
14:28:13 | Start parsing.
14:28:13 | Dependencies loaded
14:28:13 | ManifestLoader created
14:28:14 | Manifest loaded
14:28:14 | Manifest checked
14:28:14 | Flat graph built
14:28:14 | Manifest loaded
14:28:14 | Performance info: target/perf_info.json
14:28:14 | Done.
As you can see, some timing information is also included which may be useful if you have very large projecsts and wish to optimise the build time.
dbt ls
dbt ls lists all of the "resources" defined in your model, where resources are things like models, tests, analysis, seed data. This can be a good way to get an overview of the entire project and to ensure that your expected resources are correctly identified by dbt.
dbt ls
Output:
my_new_project.example.my_first_dbt_model
my_new_project.example.my_second_dbt_model
my_new_project.schema_test.not_null_my_first_dbt_model_id
my_new_project.schema_test.not_null_my_second_dbt_model_id
my_new_project.schema_test.unique_my_first_dbt_model_id
my_new_project.schema_test.unique_my_second_dbt_model_id
These resources are examples transformations and tests which were created when we ran dbt init. They are useful to copy from as you begin to build out your own code, but will likely be deleted early in the process.