Course Overview
The Modern Data Stack

Ingesting Data Into A Modern Data Stack

Lesson #4

In this lesson we will:

  • Learn how data is typically sourced and ingested into Modern Data Platforms.

Common Data Sources

Business will ususally need to combine data from multiple sources in order to answer meet their Data and Analytics requirements.

This data is commonly found in systems such as:

  • Internal line of business applications such as CRM and ERP systems;
  • Websites and mobile applications;
  • Social media e.g. Twitter;
  • Online advertising Platforms e.g. Facebook Ads;
  • Online analytics Platforms e.g. Google Analytics;

In addition, data will be found in various data extracts and databases.

The task is to extract data from all of these sources and ingest it into the platform where it can then be cleaned up and used by data consumers such as Data Analysts and Data Scientists.

Connectors

Recognising that many businesses have this same challenge, vendors have stepped forward with tools that manage the process of extracting data from common data sources, and pushing it into a centralised repository such as a Data Warehouse.

Common tools in this space including Fivetran, Stitch, Airbyte and Meltano.

Batch and Streaming Ingestion

Historically, data loads into centralised data platforms have occured using a periodic batch process. Every hour or every day, a batch of recent updated would be extracted from the source and uploaded to the centralised location. The data would then be imported and processed when the new file arrives.

The main problem with this is the delay. Because the data in your data warehouse is out of date, any reports, dashboards and applications built on top of it will also be seeing stale data. This could impact the decisions that your business takes and the customer experience.

In recent years, businesses have been moving from batch solutions to a more real-time streaming architecture. This involves publishing and processing updates as soon as the source data is captured.

The move to streaming obviously means that data is fresher throughout the business, but streaming is more complex and involved for businesses to deliver. This said, it is the direction of travel for the industry.

Next Lesson:
04

Data Transformations

In this lesson we will describe how data is usually transformed in the Modern Data Stack.

0h 15m




Work With The Experts In Real-Time Analytics & AI

we help enterprise organisations deploy powerful real-time Data, Analytics and AI solutions based on ClickHouse, the worlds fastest open-source database.

Join our mailing list for regular insights:

We help enterprise organisations deploy advanced data, analytics and AI enabled systems based on modern cloud-native technology.

© 2024 Ensemble. All Rights Reserved.