In this lesson we will:
- Learn about the various mechaniss for ingesting data from local static files.
Local Static Files
The simplest and most common type of ingestion we need to carry out is taking some local static file and loading the data contained within into our ClickHouse instance.
These static files could be in formats such as CSV, JSON or XML files, or binary data formats such as Parquet or Apache Iceberg.
Regardless, our task is often to process these files and ingest the data into our ClickHouse instance.
ClickHouse Client
The simplest way to achieve this is by piping data into the clickhouse client command line interface in a very Unix like way. It involves piping data into the ClickHouse client and specifying a query to handle the INSERT.
cat orders.json | clickhouse-client --query="INSERT INTO orders FORMAT JSONEachRow"
HTTP API
As an alternative, Data can also be pushed into ClickHouse through an HTTP API using a tool such as cURL.
echo '{"order_id":"342434"}' | clickhouse-client --query="INSERT INTO test FORMAT JSONEachRow"
SQL API with INFILE
Rather than piping through ClickHouse client, we can also ingest local files through the SQL console using the INFILE operator. This is arguably a cleaner way of achieving the same end result:
INSERT INTO hackernews FROM INFILE '/data/hacknernews.csv' FORMAT CSVWithNames
Binary Formats
With binary formats such as Parquet, we can't use the methods of posting through the ClickHouse client or HTTP API. Instead, we need to execute this as SQL, specifying a file type:
CREATE TABLE imported_from_parquet
ENGINE = MergeTree
ORDER BY tuple() AS
SELECT *
FROM file('data.parquet', Parquet)
Remote Static Files
In the examples above we were working with files stored locally. It is also possible to ingest from a remote URL in a similar manner:
CREATE TABLE hackernews ENGINE = MergeTree ORDER BY tuple
(
) EMPTY AS SELECT * FROM url('https://mycompany.com/hacknernews.csv.gz', 'CSVWithNames');