Course Overview
Administering and Operating ClickHouse

Clustering ClickHouse

Lesson #4

In this lesson we will:

  • Explain the benefits of clustering your ClickHouse deployment;
  • BRiefly summarise the process for configuring a cluster;
  • Introduce the concept of replicated tables.

Clustering ClickHouse

It is possible to run ClickHouse in a clustered mode, where we have two or more server instances working together as a co-ordinated unit.

There are two reasons why we would want to do this:

Scalability - One of the primary reasons for clustering ClickHouse is to improve scalability. In a clustered environment, ClickHouse can handle more data and more queries than a single-node setup. This is particularly useful for large-scale data analytics, where the volume of data is continuously growing. Clustering allows you to distribute the data across multiple nodes, enabling the database to manage larger datasets efficiently and maintain high performance levels even under heavy load.

High Availability and Fault Tolerance - Clustering enhances the reliability of your database. By distributing the data across multiple nodes, you ensure that there is no single point of failure. If one node goes down, the others can continue to handle queries, ensuring continuous availability of your data. This redundancy is crucial for critical applications where data availability and uptime are paramount. Additionally, clustering can facilitate easier and more efficient backup processes, further securing your data against loss.

Setting Up The Cluster

If you are not using ClickHouse Cloud, you will be responsible for configuring the cluster yourself through configuration files. We will now walk through the key steps at a high level:

Install ClickHouse on All Nodes

First, you need to install ClickHouse on each node that will be part of the cluster. Ensure that all nodes have the same version of ClickHouse for compatibility.

Configure Network Access

Ensure that the nodes in your cluster can communicate with each other. This usually involves configuring firewalls and network settings to allow traffic on the ports used by ClickHouse (default is 9000 for TCP).

Configure the Cluster in config.xml

On each server node, you will need to edit the ClickHouse configuration file, typically located at /etc/clickhouse-server/config.xml. You will define your cluster in the <remote_servers> section.

Here's an example configuration for a simple two-node cluster:

<remote_servers>
    <my_cluster>
        <shard>
            <internal_replication>false</internal_replication>
            <replica>
                <host>node1-hostname</host>
                <port>9000</port>
            </replica>
        </shard>
        <shard>
            <internal_replication>false</internal_replication>
            <replica>
                <host>node2-hostname</host>
                <port>9000</port>
            </replica>
        </shard>
    </my_cluster>
</remote_servers>

This configuration defines a cluster named my_cluster with two shards, each containing one replica.

This configuration file would be duplicated on both machines in the cluster.

Configure ZooKeeper or ClickHouse Kepper (for Replicated Tables)

If you plan to use replicated tables, you need a ZooKeeper cluster. Install and configure ZooKeeper on separate servers (or use existing ZooKeeper cluster if available). Then, add ZooKeeper configuration to the ClickHouse config.xml file:

<zookeeper>
    <node>
        <host>zookeeper1-hostname</host>
        <port>2181</port>
    </node>
    <!-- Additional ZooKeeper nodes if available -->
</zookeeper>

ClickHouse also provide ClickHouse Keeper, which is a similar compatible system to Zookeeper, but optimised for the ClickHouse use case. Today, ClickHouse Keeper is the reccomended solution over Zookeeper, though either should work.

Restart ClickHouse Server

After updating the configuration, restart ClickHouse on each node to apply the changes:

sudo systemctl restart clickhouse-server

Create Distributed and Replicated Tables

With the cluster set up, you can now create distributed and replicated tables. Here’s an example SQL to create a replicated table:

CREATE TABLE mydb.my_replicated_table ON CLUSTER my_cluster (
    id UInt32,
    data String
) ENGINE = ReplicatedMergeTree('/clickhouse/

You can also create tables with the distributed table engine:

CREATE TABLE distributed_table_name AS local_table_name
ENGINE = Distributed(cluster_name, database_name, local_table_name, sharding_key);

Test The Cluster

Finally, you should test the cluster by reading and writing to the replicated tables.

Next Lesson:
04

Backups In ClickHouse

In this lesson we will learn about the process for backing up your ClickHouse instance and the data stored within it.

0h 15m




Work With The Experts In Real-Time Analytics & AI

we help enterprise organisations deploy powerful real-time Data, Analytics and AI solutions based on ClickHouse, the worlds fastest open-source database.

Join our mailing list for regular insights:

We help enterprise organisations deploy advanced data, analytics and AI enabled systems based on modern cloud-native technology.

© 2024 Ensemble. All Rights Reserved.