In this lesson we will:
- Lesson contents 1
- Lesson contents 2
- Lesson contents 3
Ingesting Data
The first task we need to do with a new ClickHouse instance is to load or ingest some data into it.
Of course everyone is familiar with the trusty INSERT statement which can be issued at the SQL prompt, but in the real world, the main task is likely to involve taking extracts from external systems and datastores and loading them into our ClickHouse instance.
This can be a complex and messy exercise. Data might be in a variety of inconvenient formats which we have to parse before we can bring it into ClickHouse. It may need cleaning and transforming into more appropriate formats, and it may contain errors which we need to handle.
Data could arrive in infrequent and very large batches, or in very frequent small batches. Increasingly, it could also be streamed in real-time over APIs and platforms such as Kafka.
These ingestion processes also usually have to run continually as new data is generated. This means they need to be automated and monitored for errors on an ongoing basis, and we have to handle situations such as retries and late arriving data.
As ClickHouse developers we have a choice. We may choose to take this process on ourselves, or we may reach for third party ETL tools to manage some of this complexity for us.