How dbt Helps Data Engineers Work Like Software Engineers

Bringing Software Engineering Practices Into Data

As Software Engineering has matured, Developers have adopted practices that allow them to deliver faster without sacrificing quality. These practices include source code control, branching and merging, continuous integration, continuous delivery, automated testing and many more.

Data teams have historically been slower to adopt these practices. Though some of the reasons are down to skills and culture within the data world, one of the more significant reasons has been the heavyweight, proprietary and GUI driven tools that engineers were forced to use.

Fortunately, with the evolution of Data Engineering and with lightweight developer friendly tools such as dbt, these practices are evolving, and Data Engineering increasingly looks like Software Engineering. More rigorous, more software defined, more automated and with a high focus on quality.

In this article, we wanted to provide more colour on the specific practices of interest, and explain specifically how dbt helps to achieve them.

Transformations As Code

Many ETL tools that are used by data teams are GUI based. The ETL code is implemented within the tool, often by clicking and dragging co nnections between database tables.

dbt breaks out of these proprietary GUIs and turns transformations into readable code which is primarily SQL with a sprinkle of Jinja. This code can then be edited in any text editor or IDE in the same way a software developer can openly choose how to develop their code.

There is also a parallel here with the IT infrastructure world. Where historically operations engineers had to use awkward GUI tools, they are now moving towards scripting and code based declaration of their environments using tools such as [Terraform](https://www.terraform.io/.

The direction of travel is a world where applications, data and infrastructure defined as open declarative code.

Source Control

Once we have our transformations implemented as code, they can then be placed into a source control system such as Git in the same way that developers manage their application code.

By doing this, we then get a record and audit trail of who changed what and when within the transformation logic. If a bug is introduced, we can revert to previous versions simply by pulling the code from source control.

Source control also allows Data Engineers to use practices such as branching and merging to make the development process more efficient and allow data engineers to work in parallel in a scalable and reboust way.

Modularity

Traditional ETL scripts are known for being fragile and hard to maintain. They can often be full of interconnections and require knowledge about dependencies such as the order in which scripts should be run. It is also common to find duplicated code, meaning that if business requirements change, we often need to make changes in multiple locations which is very error prone.

dbt is designed with a much more modular structure than earlier tools, whereby we define a transformation once, then refer to transformations with *references *that keep everything nicely encapsulated in a step by step pipeline that avoids repetition.

dbt also gives us features such as macros, packages and reusable test suites, which sit outside of the core transformation code and allow for additional reuse.

Versioning

Developers will often version their code and use version numbers to tie changes to specific milestones. At all times, they will know which version of their code is running in which environment, and which specific changes an upgrade would bring, usually by consulting release notes.

This type of traceability is often present in application code, but is greatly lacking in traditional ETL environments where environments fall out of line and lose visibility of which changes are being deployed.

Because dbt uses source controlled assets, applying versions to our changes and better understanding the actual changes associated with each deployment is much more viable.

Automated Testing

Developers and QA Engineers often implement automated unit and integration testing to improve the quality of their code and identify issues earlier in the development lifecycle.

In the data world, there has been a much higher dependency on manual testing by developers and QA engineers. This can require a lot of resource, because not only is the code changing, but the data is continually evolving underneath them. Testing ends up being very ad-hoc, and production data quality inevitably suffers.

dbt allows developers to test their transformation logic in the same way, by running automated checks against the materialized data. For instance, we can check the number of rows are as expected, that no NULLs are present, and that numbers are within expected ranges. By doing this immediately and every time after the transformation runs, we can ensure that bugs are caught in both code and data before they move to production.

Continuous Integration & Continuous Delivery

Developers usually implement automatic processes to build, test and deploy their software without manual steps. High performing teams are even moving towards continuous delivery, where changes are pushed very frequently without compromising the stability of their application.

Wherehas GUI tools were hard to automate as part of an SDLC, it is much more viable to integrate dbt into this process, giving data engineers very fast feedback and getting their work into the hands of their users very quickly when automated testing steps are passed.

Our platform, Ensemble CI is a CI/CD platform which has been built specifically for this use case, namely supporting Data Engineers who use dbt.

Documentation

Software Engineers know that the ability to change and maintain code over time is critical, and more important and challenging thant the initial development.

dbt moves this forward in the data world by including a number of features to make the models self-documenting and include documentation inline alongside models. This documentation can then be extracted into an automated site which acts as a continually updated data dictionary.

Why Does This Matter?

Ultimately, adopting these practices is about building more reliable, predictable, high-quality data transformation pipelines. This improves quality and builds confidence in the data that we are putting into the hands of our business users.

Furthermore, these practices also improve the experience and productivity of the data teams. Instead of firefighting and battling quality issues, their role becomes more like "engineering", moving forward with quality and confidence.

Data Engineering is an incredibly important practice in helping businesses achieve more with their data, and dbt is the simple but powerful tool that makes this a reality.