In this lesson we will:
- Discuss the challenges in building streaming solutions;
Why Is Working With Streaming Data Difficult?
As we discussed in previous lessons, moving from the traditional batch approaches towards real time data streaming solutions is a challenging undertaking.
In this lesson, we will explain these challenges in more detail.
Scalability
Streaming platforms need to process and analyse high volumes of event data. Though one stream of events could have a high volume of events, there are likely to be multiple streams all generating data in parralel. An enterprise stream processing platform therefore is likely to need a very high degree of scalability to handle the volumes of data in flight and at rest.
Variance
The volume of events in the stream can scale up and down in terms of volume, and may spike during peak hours. Streaming platforms therefore need a capability to scale up and down dynamically to accomodate these changing workloads.
Latency
In streaming scenarios, businesses often have some benefit to responding to their event streams in real time. We therefore need to ingest, process and respond to the streams of events with low latency in order to extract maximum value from the data.
Exactly Once Processing
When working with event streams it is important to never lose a message, and never double send or double process a message. We therefore need to build solutions which have a high degree of reliability in how messages are processed, even if some component in the stack was to fail.
Stateful Processing
It is relatively simple to develop stateless processors which do things such as filter out, route, or add detail to events. However, the complexity grows when we want to look for historical patterns such as “3 failed credit card transactions in the last hour.” To do this, we need to process events by considering their past state, which adds significant complexity into the stack.
Time Semantics
The notion of time becomes complex in event processing. Do we care about the time the event happened, the time it was received by the processor, or the time it was stored in the database? In most scenarios, event time is the natural choice, but then we need correct semantics to ensure that we are using the state of the world at the time in question when we come to process the event.
Security
It is important to maintain complete security around personally identifiable and commercially sensitive data. We need to encrypt all stored data in flight and at rest as it moves through the various message queues and processors. This repeated encryption and decryption has impacts on latency and operationally managing the system.