ensemble

Tools

Why You Need An Orchestrator In Your Data Stack

4 Jul 2023

Why You Need An Orchestrator In Your Data Stack
Benjamin Wootton
  • Benjamin Wootton linkedIn
  • Founder & CTO, Ensemble

Data Orchestration platforms such as Airflow, Dagster and Prefect can be used to execute and co-ordinate all of the data related pipelines and jobs in your organisation.

Though most Data Engineers and people adjacent to the data space have a surface level understanding of what these do, they don't always appreciate the value of them, and tend to view or deploy them as simple script runners as an upgraded alternative to Crontab.

In reality, this is a lost opportunity. These tools are powerful, and most data teams operating with any complexity would benefit by implementing an Orchestration platform and using it to it's full potential.

Some of the key capabilities of these tools include:

  • Scheduling jobs - You can schedule your jobs to run periodically without dependency on Cron;

  • Triggers - Jobs can be triggered based on events such as when new data arrives. This moves forward from scheduled batch to more frequent and dynamic pipeline runs;

  • Resilience - Rather than being dependent on one machine to run your jobs, schedulers are typically clustered for resilience and additional capacity. Configuration can also be stored in a backed up and replicated database;

  • Source Code - Orchestrators will encourage you to break your code out of proprietary tools and move them into real code which can be checked into source control, versioned, included in CI/CD pipelines etc;

  • Clean Code - Orchestration framework will encourage you to break your workflows into seperate, maintainable and reusable code units;

  • Environment Seperation - You can seperate out your environment definitions from your environment details, making it easier to run the same code in dev/test/prod;

  • Operations - Orchestration tools support monitoring, alerting and re-running parts of the pipeline etc via their GUI to improve quality;

  • Retry logic - these tools can incorporate retry logic, such as retrying N times or when an error condition is resolved;

  • Partitioning - Orchestrators support the idea of partitioning datasets such that data can be back-filled or reloaded in small batches.

All said, a Data Orchestration platform is much more than a dumb script runner. It can improve the entire lifecycle from how code is written through to supporting reliable production operation.

ensemble

© 2024 Ensemble AI. All rights reserved