In this lesson we will:
- Introduce Airbyte;
- Learn how to use Airbyte Cloud to ingest data from Google Analytics GA4 into ClickHouse Cloud.
What Is Airbyte?
Airbyte is an open-source platform that can help you move data from various sources to different destinations. Example sources include databases, APIs, SaaS applications and flat files and common destinations include data warehouses and data lakes.
Airbyte offers a growing library of pre-built connectors for all of these data sources and destinations. Airbyte then manages and schedules the integration runs and introduces operational tools for monitring them.
This type of ETL work is something which companies often need to do, but we can use tools such as Airbyte and other similar tools to manage this process for us.
All in all, this massively reduces the amount of bespoke work that businesses need to do to move data around.
Other Features Of Airbyte
Airbyte has some other notable features:
Data Transformation: Users can apply transformations to the data as it flows through the pipelines, allowing for data cleansing, enrichment, and mapping.
Orchestration: Airbyte provides tools for scheduling and orchestrating data extraction and synchronization tasks, ensuring that data is up-to-date and accurate.
Monitoring and Logging: It offers monitoring and logging capabilities to track the status and health of data pipelines.
Extensibility: Airbyte is highly extensible, allowing users to create custom connectors or adapt existing ones to suit their specific needs.
Security: It supports data encryption, authentication, and access control measures to ensure the security of sensitive data.
Community and Ecosystem: Being open-source, Airbyte benefits from an active community of contributors and users who continually improve and extend its capabilities.
All in all, it is a fairly compelling proposition for companies and data teams that need to do extensive ETL work.
Airbyte Cloud
Though Airbyte is an open source platform, it is also available as a managed cloud service on a commercial basis at airbyte.com.
Airbyte and ClickHouse
Airbyte has good out of the box support for ClickHouse both as a source and as a destination.
By combining Airbyte Cloud with, ClickHouse Cloud, you are able to deploy a fully managed and fully serverless stack as is our preference.
Walkthrough
We will now walk through the process of using Airbyte to integrate from a source into ClickHouse. For the purposes of this example we will connect to Google Analytics as our source.
Configuring A Source
After signing up at airbyte.com, the first thing we need to do is define our data source. To illustrate the concept, we will connect to our Google Analytics account.
Configuring A ClickHouse Cloud Destination
We will then setup a destination pointing to ClickHouse. If you are using ClickHouse cloud you will need to connect on port 8443.
Note that though it says the password is optional, this needs to be specified else you receive a misleading error message:
Configuring The Connection
The connection is the concept which connects the source to the destination. In this instance we will be connecting to our ClickHouse Cloud database:
The run took approximately 30 minutes for me:
The tables have now been created:
SHOW TABLES WHERE name LIKE '_airbyte%'
Query id: 1d49a23a-4bb2-4e1b-bcf3-49ebfa9200b1
┌─name──────────────────────────────────┐
│ _airbyte_raw_daily_active_users │
│ _airbyte_raw_devices │
│ _airbyte_raw_four_weekly_active_users │
│ _airbyte_raw_locations │
│ _airbyte_raw_pages │
│ _airbyte_raw_traffic_sources │
│ _airbyte_raw_website_overview │
│ _airbyte_raw_weekly_active_users │
└───────────────────────────────────────┘
8 rows in set. Elapsed: 0.002 sec.
Pricing Model
In this lesson we demonstrated Airbyte Cloud, a fully managed platform for carrying out this ETL work.
Airbyte Cloud has two pricing modes. When we are calling an API, as we are here with Google Analytics, you pay based on the number of rows transformed.
If you are loading data from a database such as MySQL, you pay by the GB transferred.
It is of course important to understand the aX§mount of data you will be transferring and only transform data that you will actually use in order to minimise costs.