Using Multiple Databases For Your dbt Workflow

Why Do We Need Multiple Databases With dbt?

The most important database or data warehouse will be your production instance. This is the one that your live data will be loaded into and which your users access for their reports, dashboards and analytics. It is important therefore to protect the quality and accuracy of this instance.

One of the first steps to doing so is to introduce controls and processes around your production database. These might include the need to test before changes are rolled out, or locking down permissions such that only database administrators can drop or create objects.

Beyond this, one of the biggest steps a data team can take to protect their production instance is to introduce a test database. Ideally, production like data and code will be loaded into this database, and any changes can be tested there. This includes testing data loading code, transformations, performance and the end result reports and dashboards.

Because we want to keep our test database "production like", it might then make sense to introduce more databases for early stage testing or performance testing. We could also implement a model where each Data Engineer has their own database for their own work.

This sounds like an explosion of databases, and it is not without cost. However, the idea is that changes progressively get more tested as they flow from developer to test to production. This ultimately increases reliability and quality which is important for all businesses.

How Widespread Is This Practice?

To Software Engineers, the idea of multiple environments is very familiar and commonplace. In the data world, this is less common, and there are many teams that either don't do this at all or don't do it well.

For the teams that don't do it at all, they would essentially work directly in production and fixing any issues that arise on an ad-hoc basis. Though this can work for a long time, it harms the experience of your users and also presents risk of corrupting data irreperably.

There are a second class of teams who have implemented development and testing environments for their data, but use them sporadically. Sometimes people might bypass testing environments, or let the data in them become very stale. When the tables, views and data in the different environments drift out of line to such an extent it almost isn't worth doing any pre-production testing. This situation is very common, with data teams working on big, business critical systems having a flawed development workflow around their data environments.

With the move to Data Engineering and dbt, this situation should improve. For the first time, data teams are really focussed on applying these practices to their workflows.

How dbt Helps With Multiple Databases

dbt makes defining this pipeline easier than it otherwise would be:

Seperation Of Targets

dbt has the concept targets which are defined in the profiles.yml file, outside of your dbt project. When you run your project, you can select a target. This allows us to deploy the same code into multiple target databases.

As an example, we might define five profiles representing our path to production:

bens_dev:
dev:
test:
performance_test:
prod:

We can then dbt run against each of these profiles:

This seperation of project code and deployment targets, gives us the ability to deploy our code to different environments.

User local profiles.yml

When dbt is looking for connections details, it looks for targets in a profiles.yml file.

profiles.yml can be stored in multiple places, and it has an order of precedence whereby the one in your home directory overrides the one in the project.

This means that a developer can point at their own database or schema when working on their laptop, let a CI server deploy to test using a profiles.yml in the repo.

Where Ensemble CI Comes In

Though dbt has the powerful set of tools for building a developer workflow, it does not enforce. There is nothing ensure that I forget to deploy to the performance testing environment or deploy straight to production. We are then back to having a problem of environment drift.

Ensemble CI attempts to put a development process around the Data Engineering lifecycle by building around dbt Core. It means that all changes flow through a defined path to production, and there are controls in place to ensure that changes are properly testing.