There are many situations in Enterprise IT where we need to move, copy or integrate datasets. For example, populating a centralised data warehouse or data lake, integrating two systems such as an ecommerce and CRM system, or exchanging data between partner organisations perhaps by a simple file transfer.
Moving data around in this manner is referred to as Extract, Transform and Load or ETL. This describes the end to end process of extracting data from the source system, transforming it for the required format, and inserting or updating data in the destination.
ETL is a very mature practice, and Data engineers have been kept busy for years moving this data around, writing the scripts, managing the associated ETL tools and dealing with data errors as they arise as a business as usual function.
Batch ETL
Historically, ETL has been built around exchanging data as batches. For instance, a set of files are extracted and uploaded to some target every hour or every day, containing all of the records which have been updated in the last window. This simple approach has served us well and will continue to serve us well for many use cases. However, there are a number of downsides to batch based data integration:
- It's slow - The destination system could be waiting for hours or even days to receive the most recent data;
- It's fragile - There could be various errors processing records which usually require human interaction to investigate before the data can be re-pushed;
- It impacts the customer experience - If we have a synchronous or low latency integration, we can inform the user immediately when the action has taken place. With delayed batch, this isn't possible.
Streaming ETL
Because of the increased need for speed, attention has turned to streaming Extract Transform and Load, where we perform the ETL process by capturing data at the source system as it is generated, and push it straight to the destination for immediate and continuous processing. These events are typically sent over a message broker or streaming platform such as Kafka, or perhaps through a direct API call. Changes also need to be captured at the source system, and the destination system needs to be modified to process the continuous stream of events.
The main benefit of this change is its impact on employee and customer experience. For instance, if a transaction is placed and then the customer immediately calls the call centre to amend the order, the call centre agent will see the current state of the world and give the customer the best possible service. This avoids the situation where the customer needs to call back tomorrow, or where there change should be reflected on the system in the next 30 minutes.
ETL isn't as exciting as some of the innovaitons in the data world which are happening right now. However, it is a foundational part of enterprise technology. Deploying streaming ETL to improve it's timeliness and reliability can undoubtedly improve the customer experience and help businesses to operate more intelligently and efficiently.