In this lesson we will:
- Consider the common architectural and deployment patterns of Modern Data Platforms.
Architectural Tiers
Even though Modern Data Stack deployments can be comprised of lots of different tooling choices, they will typically all have the same architectural pattern:
- Extraction - Extracting data from data sources such as line of business applications, SaaS tools and operational databases;
- Transformation - Transforming the data such that it is cleaned, structured and processed in preperation for subsequent analysis;
- Storage - Storing the data in a persistent store such as a data warehouse or data lake;
- Consumption - Capabilites such as searching, reporting, dashboarding for use by Data Analysts, Data Scientists and business users;
Architecturally, these could be thought of as tiers or layers of the stack which we can consider independently.
Extraction
A typical business will likely need to extract data from various data sources around their business. Common data sources include applications, SaaS tools, operational databases, and ad-hoc data sources such as spreadsheets or data sourced through APIs.
Traditionally, this involved a lot of custom coding and scripting work to interact with APIs and Databases. As part of the modern data stack however, we would likely turn to a tool such as Fivetran or Airbyte which comes with a set of pre-packaged connectors for interacting with these data sources.
Transformation
In this layer, we will take the source data and cleanse, modify it and prepare it to meet the requirements of the business and downstream consumers such as Data Analysts and Data Scientists.
Historically, these transformations too place before data was loaded into the centralised Data Warehouse (Extract, Transform, Load). However, in the Modern Data Stack, it more typically happens after the load has taken place (Extract, Load, Transform). To achieve this, we would likely use a tool such as dbt.
Storage
The next tier is all about storing the data and making it avaialble for queries and consumption by your business.
Typically, this will include some Data Warehouse or Data Lake which will act as the long term persistent store of your data.
Historically, this would have been an on premise system, but nowadays we would likely use the cloud, potentially a service such as Snowflake or Redshift, or a data lake within something like AWS S3.
Consumption
Once we have data in our target database or data lake, we then need to serve it to end users such as Data Analysts, Data Scientists and business users.
Some consumers will likely want low level access to the data, so they can exlpore it in an ad-hoc way and build solutions on top of it. Other users will be comfortable interacting with data at a higher level, and may for instance make use of business intelligence tooling to build reports and dashboards.
The main evolution we are seeing in this area is a move to lightweight and open source business intelligence tools such as Metabase and Preset.
Underlying Infrastructure
As discussed, a key feature of Modern Data Platform tools includes the fact that they are often cloud based or delivered as a Software As A Service. This allows the tooling across the Modern Data Stack to benefit from the underlying characteristics of the cloud such as it's scalability and elasticity.
We discuss how the cloud enables the Modern Data Platform in more detail in the next lesson