In this lesson we will:
- Explain how to integrate ClickHouse with Apache Iceberg.
What Is Apache Iceberg?
Apache Iceberg is an open source table format. It is based around Apache Parquet, but adds features such as schemas, the ability to read and write to files, transactions and time travel.
As a first step, ClickHouse now provides a read only view over Iceberg files, though over time this is expected to become bi-directional.
Worked Example
We will now walk through the process of creating a file in Iceberg format, uploading it to S3, and accessing it from ClickHouse in order to demonstrate the process.
Rather than using AWS for real, we will use localstack to simulate S3 locally.
Create A Iceberg File
We will begin by creating a file in Iceberg format which we will later load into ClickHouse.
The easiest way to do this is using the open source version of Apache Spark. Begin by downloading Apache Spark from the Spark website.
Having downloaded and unpacked Spark, the next step is to start a spark-sql session to create our Iceberg file using SQL. In order to do this, we need to specify the Iceberg library as a package as this is not installed by default in open source Spark.
./bin/spark-sql --packages io.delta:delta-spark_2.12:3.0.0rc1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
Once the Spark SQL shell has loaded, we can create a table as normal, adding the using iceberg clause to indicate that we would like to serialize the table in Iceberg format:
create table my_iceberg_table( val integer ) using iceberg;
insert into my_iceberg_table values( 123 );
insert into my_iceberg_table values( 456 );
If you then close the shell and check the spark-warehouse folder within your Spark installation, you should see a folder named my_delta_table which contains a structure such as the following. This is our Iceberg table which will ultimately want to access from ClickHouse.
-rw-r--r--@ 1 benjaminwootton staff 12 10 Oct 20:30 .part-00000-1a252b70-dd78-4a47-b2fe-46479e8b711d-c000.snappy.parquet.crc
-rw-r--r--@ 1 benjaminwootton staff 12 10 Oct 20:29 .part-00000-95b7b4e6-dcae-4047-b942-9202db4dd2a1-c000.snappy.parquet.crc
drwxr-xr-x@ 8 benjaminwootton staff 256 10 Oct 20:30 _delta_log
-rw-r--r--@ 1 benjaminwootton staff 455 10 Oct 20:30 part-00000-1a252b70-dd78-4a47-b2fe-46479e8b711d-c000.snappy.parquet
-rw-r--r--@ 1 benjaminwootton staff 455 10 Oct 20:29 part-00000-95b7b4e6-dcae-4047-b942-9202db4dd2a1-c000.snappy.parquet
Uploading The Iceberg File To S3 (Localstack)
ClickHouse currently only provides read only support for Iceberg files, and only those that are stored on AWS S3. This situation should however evolve rapidly.
Rather than using S3 for this demo, it is easy to use localstack which simulates AWS on your local machine.
After downloading and installing localstack, it can be ran in the following way. This instantiates localstack within a Docker container, and starts an S3 service which is listening and ready to have buckets created and files uploaded.
DEBUG=1 localstack start
Next, we need to create a bucket on the localstack S3 to hold our Iceberg table:
aws s3api create-bucket --bucket my-iceberg-table --endpoint-url=http://localhost:4566 --region local --create-bucket-configuration LocationConstraint=local
We can then push our local Iceberg files from the Spark folder to the S3 bucket:
cd spark-warehouse
aws s3 cp --recursive ./iceberg s3://my-iceberg-table --endpoint-url=http://localhost:4566
Finally, we can test that this was succesful by checking the following URL:
curl http://localhost:4566/my-iceberg-table/
<?xml version='1.0' encoding='utf-8'?>
Connecting ClickHouse To The Iceberg Table
So now we have created a Iceberg table and uploaded it to the mocked S3 running on localstack on our local machine.
The next and final step is to connect to it from ClickHouse by creating a table using the DeltaLake table engine.
create table iceberg ENGINE=Iceberg('http://localhost:4566/my-delta-table/')
Once the table is created, we can select from it as normal:
select * from my_iceberg_table;
Which returns the following results:
Query id: d0d19eac-c1ec-4c2c-a0f0-23f06213d914
┌─val─┐
│ 345 │
└─────┘
┌─val─┐
│ 456 │
└─────┘
2 rows in set. Elapsed: 5.888 sec.
We now have connectivity between ClickHouse and the Iceberg file.
As mentioned, we may need to do this for integration purposes as Iceberg Lake becomes more popular, or perhaps it is a desirable end state whereby we use ClickHouse to query Iceberg files stored in S3.