Using Git is another example of a practice that is widely used by software engineers but less used by data teams.
And as with many of these practices, fortunately this situation is changing as the Data Engineering approach and dbt becomes a defacto standard tool.
Git is a source code management system. Developers will check their source code base into a centralised Git repository, and then continually check in changes as their project evolves. At any time, new developers can check out the code and begin contributing to the same codebase in parrallel. Git will help to merge the code of multiple developers and can help the development workflow by indicating who made which change.
Branches allow us to take a copy of the code tree, and work on it independent of the mainline. The developer would make their change on the branch, test it and then merge it back into the mainline when it is ready. This keeps the mainline healthy and stable even whilst features are being developed.
A common pattern is to cfeate a branch for each new feature that is developed, and then merge back into the mainline a feature at a time.
Pull requests can be used to facilitate code review, whereby the owner of the mainline branch can review the merge before accepting it.
When data engineering teams use dbt, this practice can be extended directly into the Data Engineering Workflow.
Data Teams can have our mainline of dbt code which is deployed to production.
When a change or new feature is developed, Data Engineers can then branch the code, implement their change and test their fix against the same dataset.
If the code produces the correct data, it can then be merged back into the mainline, perhaps using a Pull Request, where production will be updated using the merged codebase.
There are a few things which we have to do consider when using branches for Data Engineering.
dbt allows us to create different profiles and different targets that represent all of our databases. We need to develop a system whereby different branches are deployed into different targets in a predictable way. Without this, different branches could be overwriting each other and our database could end up in an unpredictable state.
One of the main advantages of branches is how they allow developers to work concurrently. If databases are shared between developers, we could also have a challenge where developers overwrite each others changes and leave the database in an unpredictable state. To avoid this, giving developers their own database to work within could help.
One of the challenges of branching in the Data World is that we may need to store and manipulate a lot of data.
The ideal situation is that each developer and each branch gets their own copy of the production data so they are working on production realistic datasets.
The challenge is that we then have significant amounts of data to store and manage, and potentially some information security challenges if we allow the production data to proliferate.
Moden databases such as Snowflake can help to solve some of these challenges using Zero Copy Clones, which allow us to create multiple copies of the data without.
Git branches and pull requests have a definite role to play in a modern development workflow for data teams.