In this lesson we will:
- In this lesson we will look at how data orchestration tools are commonly integrated with ClickHouse.
What Is Data Orchestration?
Data Orchestration is the process of moving and manipulating data in order to meet the analytical and operational needs of the modern business.
It includes:
- Extracting data from various source systems and databases;
- Cleaning, de-duplicating and enhancing data;
- Applying analytics to the data to provide insights to the business;
- Copying data files to the correct locations and downstream systems.
These activities are likely to involve multiple systems and tools, and likely to span lots of different data sources. For this reason, orchestration can be thought of as an integration process that co-ordinates lots of different systems into a coherent end to end solution.
Operations
The orchestration tasks will typically need to be run on a continuous and periodic schedule, for instance hourly or daily in order to process new data.
This is the second major responsibility of the orchestrator, managing the jobs so that data is produced in a robust and reliable way, whilst providing operational support to administrators.
Data Pipelines
The jobs in our data orchestration world are likely to have dependencies on other jobs.
For instance, when a new batch of data is delivered, perhaps we need to run a job to de-duplicate and add new fields into the dataset. Next, we might calculate a number of different analytics suites. After that, we may need to copy the resulting files into some line of business data warehouse ready for consumption.
This gives rise to the concept of a data pipeline, where jobs are executed one after the other and the pipeline only proceeds if each step in the pipeline is succesful.
Dependency Graph
As well as executing jobs in a sequential pipeline, we may also have situations where some sections of the pipeline can run in parallel, and where the pipelines can branch depending on what is found in the data.
This gives rise to a Dependency Graph or what is sometimes referred to technically as a Directed Acyclic Graph or DAG of jobs.
Executing these DAGs is an efficient and robust way is the key capability provided by Dagster and similar tools.