AWS Part 3: Kinesis
Previously, we analyzed serverless functions and message queues and their use in modern-day applications. We will now look at message queues used specifically for collecting, processing, and analyzing real-time streaming data, which are often called message brokers. These are an integral component for applications such as high-traffic websites, stock exchanges, and web advertising services.
Why would we need to use message brokers for streaming data?
Similar to message queues, message brokers are really good at processing messages. They can create a centralized store or processor for these messages so that other applications or users can work with the messages at their own speed, without the risk of the consumer running out of memory.
For example, say you have an application that has connected to a websockets data stream that is sending thousands of events per second. Stock market feeds or user-click events would have throughputs similar to this level. If you tried to write this data to disk, run heavy transformations on the data, or send the data over the network, it's likely that performing these tasks would be slower than the speed of the incoming data.
If this data is kept in memory, eventually your server will crash due to a stack overflow. In this scenario, you may want to decouple time-intensive data processing from the handling of incoming data using a message broker. You need a middle man, like Kafka or Kinesis, to process the incoming events quickly and allow consumers to handle their own consumption of events at their own pace.
Your old architecture could look like this:
Each request from the monolith takes time and when aggregated might result in event handling times that are significantly slower than the speed of the events coming in. So using message brokers, you could move to an architecture like this:
With the monolith decoupled from the database, data science application, and ElasticSearch, the monolith has a much lower risk of falling over by virtue of its reduced responsibility of now only having to send messages to the message broker.
With all of these problems in mind, let's take a look at AWS Kinesis.
AWS Kinesis. What is it?
Amazon Kinesis Data Streams is a data store that enables you to build applications that process or analyze streaming data for specialized needs.
But really it’s just another data store like SQS, with a few key differences. By default Kinesis provides:
- Ordering of records
- Potentially higher throughput than SQS by using multiple shards.
- The ability for multiple consumer applications to read and replay ordered messages.
- The ability to divide messages by category or consumer using partitions
Kinesis tech terms
Here is some technical vocabulary that will be useful for the rest of this post.
- What is a stream? A Kinesis data stream is a set of shards. There can be multiple consumer applications for one stream, and each application can consume data independently and concurrently.
- What is a shard? A shard is a uniquely identified sequence of data records in a stream. A shard serves two purposes: It provides a certain amount of capacity or throughput, and it keeps messages ordered within the shard.
- Producers can send messages to a specific shard using a Partition Key. Thus all messages with the same partition key will go to the same shard. But a shard might have multiple partition keys.
If you're familiar with Kafka, a stream is equivalent to a topic, and a shard is equivalent to a partition. In Kinesis, data is stored in shards. In Kafka, data is stored in partitions. However, Apache Kafka requires extra effort to set up, manage, and support. If your organization lacks Apache Kafka experts or human support, then choosing a fully managed AWS Kinesis service will let you focus on development rather than operations.
Why use Kinesis?
Kinesis is designed for large-scale data ingestion and processing, with the ability to maximize write throughput for large volumes of data.
Some use cases include:
- Log and event data collection
- Real-time data (exchange, mobile, video games, etc.)
An application that has to handle real-time data streams can send the data to Kinesis, which acts as a data buffer for downstream consumers. This decoupling allows downstream consumers to process the data at a slower rate than the throughput of the data stream, without risk of the data-stream-handling system’s running out of memory and crashing.
Why use Kinesis over SQS?
- Kinesis allows multiple consumers to read from the same shard concurrently. So a shard containing user events related to sign-up can allow different consumers to store the data, provision data science models, and fulfill any other purpose. How does it do this? Each Kinesis consumer maintains its own shard iterator, which allows it to pick off reading messages from where it left off without having to delete the message.
- SQS, on other hand, does not use iterators or any kind of checkpoint, so the only way to ensure that a consumer doesn’t process a message twice is by deleting it. This prevents multiple consumers from reading the same SQS queue; it just wasn’t built for that.
- The workaround would be to have a different SQS queue for each consumer. Having multiple queues presents a problem: it becomes more difficult to orchestrate the producers. Each producer must now understand the context of the message it is sending, so that it can submit the message to the correct queue or queues.
- Kinesis offers replay out of the box, even with multiple consumers. In Kinesis, consumers do not need to delete messages when they are done processing, so replay is possible.
- Because SQS consumers have to delete messages to complete the processing loop, SQS does not offer replay. If you need to support message replay, you will need to write messages to an alternate store as they are published, and have a mechanism to allow interested consumers to replay that history.
- For Kinesis, if your application must process all messages in order, then you must use only one shard. Think of it as a line at a bank — if there is one line, then everybody gets served in order. If you use multiple shards, ordering is guaranteed only within each shard. Alternatively, use a SQS FIFO queue.
- If you need higher write throughput, use Kinesis.
- From the SQS docs: “By default, SQS FIFO queues support up to 3,000 messages per second with batching, or up to 300 messages per second (300 send, receive, or delete operations per second) without batching.”
- From the Kinesis docs: One shard provides a capacity of 1MB/sec data input and 2MB/sec data output. One shard can support up to 1000 PUT records per second. For example, you can create a data stream with two shards. This data stream has a throughput of 2MB/sec data input and 4MB/sec data output, and allows up to 2000 PUT records per second.
- Kinesis solves the problem of mapping data in a typical map-reduce scenario for streaming data, while SQS does not. If you have streaming data that needs to be aggregated in some way, Kinesis makes sure that all the data associated with a specific key goes to a specific shard, and the shard can be consumed by a single consumer, making the aggregation on key easier compared to SQS.
Examples of when to use Kinesis or SQS:
- Kinesis: If you want to keep track of actions by many users, it makes sense to select the user ID as the partition key so that each user’s actions are on a single shard. (Note: A shard may still have actions of several users.)
- Kinesis: An e-commerce company wants to create a stream of orders, but orders must be grouped by destination in a shard. Now we can deliver them together, because all orders in a shard are part of a single delivery for optimization.
- SQS: An e-commerce company wants a stream of orders where they just need to remove the order from the queue when they start processing the order.