In this lesson we will:
- Learn about database indexes;
- Learn how database indexes are different in ClickHouse;
- Learn how to create and alter database indexes in ClickHouse;
- See how to monitor how database indexes are being used in our queries.
Introduction To Indexes
Indexes are database objects which can speed up queries by reducing the amount of data that has to be read from disk and processed.
They work in a similar way to indexes in a book. Instead of scanning the entire book (or database table) to find what you're looking for, you use the index to find the pages (or rows) where your data can be found and skip directly to them.
This can dramatically reduce the time taken to return query results and reduce the computational load on your database servers.
The Cost Of Indexes
Whilst speeding up our queries is of course desirable, there are some costs or downsides to using indexes.
Firstly, they require disk space as they are stored on disk as persistent objects. This can become significant if every row in multiple large tables is indexed.
Secondly, and more importantly, indexes slow down writes into your database because the indexees themslves have to be updated as rows are inserted or updated. As well as the extra wall-clock time for your writes, this also requires additional processing which could impact other concurrent users of the database.
Considering this, it is important to use indexes in a considered way and whilst being aware of their overhead. This is particularly true in analytical systems which typically contain large datasets with large batch updates.
Indexes In ClickHouse
ClickHouse has indexes, though they are slightly different to those used in other databases.
Sparse Indexes
ClickHouse indexes are sparse. This means that not every row in your database is indexed. Instead, it will index every granule, which is a fragment of your database that is typically 8192 rows long by default.
Sparse indexes are used over full indexes for two reasons.
Firstly, it means it keeps the index small enough to fit into memory. This makes lookups significant faster than needing to repeatedly visit the disk to traverse the index.
Secondly, the sparse index will need less frequent updates as data is inserted and as parts are merged.
Together, this means that the sparse index therefore strikes a balance between getting the benefits of an index but without the overhead.
No Secondary Indexes
Many database systems have the concept of a secondary or non-clustered index, which can be added to non-primary key fields to allow fast lookups.
This is not possible in ClickHouse due to it's column oriented nature. Instead, ClickHouse provides an alternate index structure known as the data skipping index.
Indexes In Relation To Parts and Granules
As we discuss in our course on ClickHouse internals, ClickHouse splits tables into parts. A ClickHouse part is a subset of
A ClickHouse part contains data for all of the columns in the table.
Each column is stored in a .bin file, whilst the corresponding index is stored in the .cmrk2 file. In the example below, we are showing the address column for a certain part of the uk_price_paid dataset.
-rw-r-----@ 1 benjaminwootton staff 13875573 13 Nov 16:23 addr1.bin
-rw-r-----@ 1 benjaminwootton staff 2568 13 Nov 16:23 addr1.cmrk2
Creating Indexes In ClickHouse
Indexes are created implicitly in ClickHouse when we create a table.
CREATE OR REPLACE TABLE people
(
id UInt64,
name String,
updated_at DateTime DEFAULT now()
)
ENGINE = MergeTree
PRIMARY KEY id;
Alternatively, if we don't specify a PRIMARY KEY clause, ClickHouse will use the ORDER BY clause:
CREATE OR REPLACE TABLE people
(
id UInt64,
name String,
updated_at DateTime DEFAULT now()
)
ENGINE = MergeTree
PRIMARY KEY id;
It is also possible to use a compound primary key or order by clause to order by mutliple fields in lexographical order.