Publishing Data To Kafka

We recommend using a free ClickHouse Cloud Account when studying this lesson

Kafka Producers

Kafka Brokers have two types of clients - producers and consumers.

Kafka Producers are the processes which send data into Kafka topics, generating the source data streams that consumers will later process.

An example producer might be a process at a stock exchange publishing price updates each time the stock value changes, or an eCommerce realted service sending a notification to other microservices each time a new order is placed.

A Kafka producer will typically be embedded in some application written in a language such as Java or Javascript. This application will make use of the Kafka client library which handles the connection and interaction with Kafka.

Though these client libraries are supplied with Kafka and tend to be used by Software Engineers, as Data Engineers we still need to understand what is happening in the producer process in order to optimise, use and suppport our data pipelines.

The kafka-console-producer script

The Kafka broker includes a script for producing ad-hoc messages and publishing them into Kafka topics from the terminal. This is a useful tool for debugging and testing purposes and will also aid our understanding here.

The script can be executed in the following way, specifying a bootstrap server (the hostname and port of the Kafka broker cluster), and a topic name to publish on:

cd ./bin/
./kafka-console-producer.sh --bootstrap-server localhost:9092 --topic new_pizza_orders

Once connected, we can then simply enter our messages directly into the console to publish them onto the topic:

ABC
123
{ "key" : "value" }

Keys

All messages in Kafka consist of a key and a value element. In the example above, we have specified a value, but not a key. The key on these messages will therefore be null.

To specify a key on our message, we need to tell the kafka-console-producer.sh script that we want to parse a key from our message, and which character to use to seperate our keys and values. This can be acheived with the parse.key and key.seperator properties.

cd ./bin/
./kafka-console-producer.sh --bootstrap-server localhost:9092 --topic new_pizza_orders --property "parse.key=true" --property "key.separator=:"

Once started, we can specify a key on the messages that we send.

1 : ABC
2 : 123
3 : { "key" : "value" }

Based on our configuration, we will treat the part before the first colon as the key, and the second part as the value.

Partitions

As we introduced in our lesson on Kafka Concepts, Kafka topics can be split into partitions for additional throughput and better performance.

Application developers have a degree of control in the partitioning strategy, but when working using the kafka-console-producer.sh script a simple hash function will be applied to the key to determine which partition a message should be placed in.

Synchronous and Asynchronous Sends

The Kafka client libraries have two models for sending messages - Synchronous or Asynchronous.

In the synchronous model, all messages are sent as soon as they are created. The source processes is blocked whilst the client library interacts with the Kafka server to request the send.

In the asynchronous model, the Kafka client will take the message, potentially add it to a batch, and then send it later on. The source process is unblocked and able to continue it's work without waiting for any interaction with the Kafka server.

Though the asynchronous process is faster and more efficient, it could add some reliability concerns if the process ends and the buffer is lost. Asynchronous semantics are also harder to develop, especially from the perspective of error handling.

The console producer enables settings for experimenting with asynchronous and batched

cd ./bin/
./kafka-console-producer.sh --bootstrap-server localhost:9092 --topic new_pizza_orders --max-batch-size 100 --max-memory-bytes 1000 --max-partition-memory-bytes 1000

Batching

As well as sending messages synchronously as they are produced, we may wish to batch them up in order to reduce the number of interactions with the Kafka server.

A maximum batch size can be set, and parameters such as the maximum memory buffer can be set to ensure that batches do not grow in an unbounded way.

These can be specified at the console producer script.

cd ./bin/
./kafka-console-producer.sh --bootstrap-server localhost:9092 --topic new_pizza_orders --max-batch-size 100 --max-memory-bytes 1000 --max-partition-memory-bytes 1000

Note that batching is closely related to asynchronous sends. If we wish to batch our messages, we are implicitly allowing sends to be asynchronous.

Compression

Kafka producers also give us the option to compress data before it is sent. This is a tradeoff in that the compression step takes more processing time at the producer, but the amount of data sent is smaller. This means that it takes less storage space and is able to be transported and consumed more efficiently.

By default, the client libraries and console producer script send messages with gzip compression, but this can be changed with the --compression-codec parameter.

./kafka-console-producer.sh --bootstrap-server localhost:9092 --topic new_pizza_orders --compression-codec none

Retries

There may be situations where a message cannot be delivered, perhaps because of a temporary failure of the network or Kafka broker.

The Kafka producer gives us options to retry a number of times, and back-off by slowing increasing the time.

At some point, the producer would need to give up and assume permanent failure.

Again, these parameters can be experimented with using the kafka-console-producer.sh script.

./kafka-console-producer.sh --bootstrap-server localhost:9092 --topic new_pizza_orders --message-send-max-retries 5 --retry-backoff-ms 100

Acknowledgements

When the Kafka producer sends a message to the broker, it has a choice about the level of reliability.

We can simply send the message, then assume that it will work properly. Most of the time, this would be fine, but there is a risk that the broker could crash before the message is accepted. The message will then be left.

For more reliability, we can ask the producer to wait for an acknowledgement that the message has been fully accepted and committed.

Optimising Vehicle Routing Using ClickHouse Cloud And Google OR-Tools