Amazon Kinesis
This is not an exhaustive documentation of all the existing AWS Services. These are summarized notes that I used for the AWS Certifications.
To see the complete documentation, please go to: AWS documentation
Overview
It is a big data stream tool, which allows to stream application logs, metrics, IoT data, click streams, etc.
-
Kinesis is a managed alternative to Apache Kafka
-
Compatible with many streaming frameworks (Spark, NiFi, etc.)
-
Data is automatically replicated to 3 AZ
Kinesis offers 3 types of products:
- Kinesis Streams: Low latency streaming ingest at scale
- Kinesis Analytics: Perform real-time analytics on streams using SQL
- Kinesis Firehose: Load streams into S3, Redshift, ElasticSearch
Kinesis Streams
Features:
- Streams are divided in ordered Shards/Partitions.
- For higher throughput we can increase the size of the shards.
- Data retention is 1 day by default, can go up to 365 days.
- Kinesis Streams provides the ability to reprocess/replay the data.
- Multiple applications can consume the same stream, this enables real-time processing with scale of throughput.
- Kinesis is not a database, once the data is inserted, it can not be deleted.
Kinesis Stream Shards
-
One stream is made of many different shards.
-
1MB/s
or 1000 messages at write PER SHARD. -
2MB/s
read PER SHARD. -
Billing is done per shard provisioned, we can have as many shards as we want as long as we accept the cost.
-
Ability to batch the messages per calls.
-
The number of shards can evolve over time (reshard/merge).
-
Records are ordered per shard!
Kinesis Firehose
- Fully managed service, no administration required, provides automatic scaling, it is basically serverless.
- Used for load data into Redshift, S3, ElasticSearch and Splunk.
- It is Near Real Time: 60 seconds latency minimum for non full batches or minimum 32 MB of data at a time.
- Supports many data formats, conversions, transformation and compression.
- Pay for the amount of data going through Firehose.
Kinesis Data Streams vs. Firehose
-
Streams:
- Requires to write custom code (producer/consumer).
- Real time (around 200 ms).
- Must manage scaling (shard splitting / merging).
- Can store data into stream, data can be stored from 1 to 7 days.
- Data can be read by multiple consumers.
-
Firehose:
-
Fully managed, sends data to S3, Redshift, Splunk, ElasticSearch.
-
Serverless, data transformation can be done with Lambda.
-
Near real time.
-
Scales automatically.
-
It provides no data storage.
-
Kinesis Data Analytics
- Can take data from Kinesis Data Streams and Kinesis Firehose and perform some queries on it.
- It can perform real-time analytics using SQL.
- Kinesis Data Analytics properties:
- Automatically scales
- Managed: no servers to provision
- Continuous: analytics are done in real time
- Pricing: pay per consumption rate
- It can create streams out of real-time queries.
AWS Kinesis API
Put Records
-
Data must be sent form the PutRecords API to a partition key.
-
Data with the same key goes to the same partition (helps with ordering for a specific key).
-
Messages sent get a sequence number.
-
Partition key must be highly distributed in order to avoid hot partitions.
-
In order to reduce costs, we can use batching with PutRecords API.
-
It the limits are reached, we get a ProvisionedThroughputException.
Exceptions
- ProvisionedThroughputException Exceptions:
- Happens when the data value exceeds the limit exposed by the shard.
- In order to avoid the, we have to make sure we don't have hot partitions.
- Solutions:
- Retry with back-off.
- Increase shards (scaling).
- Ensure the partition key is good.
Consumers
-
Consumers can use CLI or SDK, or the Kinesis Client Library (in Java, Node, Python, Ruby, .Net).
-
Kinesis Client Library (KCL) uses DynamoDB to checkpoint offsets.
-
KCL uses DynamoDB to track other workers and share work amongst shards.
AWS Kinesis CLI
Sending data:
aws kinesis help
aws kinesis list-streams help
aws kinesis list-streams
aws kinesis describe-streams help
aws kinesis describe-streams --stream-name <name>
## example
aws kinesis put-record --stream-name eden-stream-1 --data "Hello world. Please sign up." --partition-key user_123
aws kinesis put-record --stream-name eden-stream-1 --data "User has sign up." --partition-key user_123
aws kinesis put-record --stream-name eden-stream-1 --data "Registration complete." --partition-key user_123
Retrieving records:
aws kinesis help
aws kinesis get-shard-iterator help
Retrieves a shard iterator for a specified shard:
aws kinesis get-shard-iterator --stream-name <name> --shard-id <shard id> --shard-iterator-type TRIM_HORIZON
Displays usage instructions, options, and examples:
aws kinesis get-records help
Security
- Control access/authorization using IAM policies.
- Encryption in flight using HTTPS endpoints.
- Encryption at rest using KMS.
- Possibility to encrypt/decrypt data client side.
- VPC Endpoints available for Kinesis to be access within VPCs.
Ordering data into Kinesis
- Data with the same partition key goes to the same shard.
- Data is ordered per shard.