The New Architecture For Cloud Native Data

The architectural patterns for building and managing data warehouses in the cloud are changing dramatically. These changes are likely to be beneficial to end users, but very disruptive to the vendors of data tools and platforms.

The New Stack

If you are building a data solution in the cloud today, the following would be a good approach:

Storage

The obvious place to store your actual data would be an object store such as AWS S3 or Azure Blob Store. It's robust, cheap, scalable and performant, so perfect for holding your underlying data files.

When we store data in this way, we are doing so independent of any particular vendor. Our data doesn't sit "within" Snowflake or BigQuery, it's simply at rest in a data lake for whatever purpose we need now or in the future.

Table Formats

Where previously we stored data using file formats such as Parquet, we now have the concept of table formats such as Delta, Iceberg and Hudi which allow us to treat those files as logical tables.

This gives us the option to insert, update and delete data in the underlying files in the same way that we would a relational table. They also support transactions which allow multiple people to read and write the files and offer the ability to roll back or time travel.

This already begins to feel like a database before a single tool is involved! We simply have our cloud environment and some library for interacting with "tables" and we have a good base to build upon.

Query Engine

Next, we would typically access these tables through some "Query engine" which knows how to take a SQL statement and query or manipulate the underlying table formats in an efficient way.

This could be something open source such as Trino or ClickHouse, or maybe a vendor tool such as Snowflake or Databricks.

In all cases, the architecture is the same. Applications or BI tools would connect to the Query Engine, and the Query Engine would connect through to query the data stored in the object store.

Benefits

This stack is very beneficial to end users as it reduces lock-in and increases interoperability.

It means that instead of data being stored in proprietary formats, it is instead stored using open-source formats and hosted on industry standard infrastructure hosted by AWS or Azure.

If we wish to change "Query Engines" then that's not necessarily a big challenge when data is stored in an open and standardised format and people are querying using ANSI SQL.

Over time, almost all tools will tend towards being able to read and write data in this way which also maximises interoperability as people align around the standard.

The clean seperation of compute and storage in this architecture is also compelling. It means that compute can be scaled horizontally as necessary to provide the optimal cost and performance tradeoff.

And because cloud object stores can be accessed with a high degree of concurrency, processing can be divided across the compute cluster then we have very scalable performance.

Data Lakehouse

This architecture is sometimes referred to as a Data Lakehouse, where we are building abstractions such as tables and a SQL API over a data lake.

This model gives the best of both worlds in that we have the scalabiltiy and flexibility of the data lake, and the governance, controls and SQL APIs associated with a data warehouse as we have described here.

Though Data Lakehouse is a term strongly associated with Databricks, it is also equally possible to build a Lakehouse with a relational data warehouse acting as the query engine and gateway into the data lake.

Implications For Vendors

This is very disruptive to vendors who will find much of their stack commoditised and the market pushed towards open standards. They will therefore need to compete higher up the stack, and in their support for this architecture.

A data warehouse will always give better performance when data is actually ingested and it fully owns the ability to manage the data. However, the market will motivate vendors to deliver the best performance and fully featured experience when using this architecture. If a given warehouse does not support, for instance, highly performant transactions on Iceberg tables located on Azure Blog Store, then it will eventually lose out to competitors.

All in all, this is a fundamental shift which is happening in the cloud data space which is likely to have implications for both end clients and vendors.