Skip to main content

Amazon Kinesis

Updated Jul 26, 2020 ·
NOTES

This is not an exhaustive documentation of all the existing AWS Services. These are summarized notes that I used for the AWS Certifications.

To see the complete documentation, please go to: AWS documentation

Overview

It is a big data stream tool, which allows to stream application logs, metrics, IoT data, click streams, etc.

  • Kinesis is a managed alternative to Apache Kafka

  • Compatible with many streaming frameworks (Spark, NiFi, etc.)

  • Data is automatically replicated to 3 AZ

Kinesis offers 3 types of products:

  • Kinesis Streams: Low latency streaming ingest at scale
  • Kinesis Analytics: Perform real-time analytics on streams using SQL
  • Kinesis Firehose: Load streams into S3, Redshift, ElasticSearch

Kinesis Streams

Features:

  • Streams are divided in ordered Shards/Partitions.
  • For higher throughput we can increase the size of the shards.
  • Data retention is 1 day by default, can go up to 365 days.
  • Kinesis Streams provides the ability to reprocess/replay the data.
  • Multiple applications can consume the same stream, this enables real-time processing with scale of throughput.
  • Kinesis is not a database, once the data is inserted, it can not be deleted.

Kinesis Stream Shards

  • One stream is made of many different shards.

  • 1MB/s or 1000 messages at write PER SHARD.

  • 2MB/s read PER SHARD.

  • Billing is done per shard provisioned, we can have as many shards as we want as long as we accept the cost.

  • Ability to batch the messages per calls.

  • The number of shards can evolve over time (reshard/merge).

  • Records are ordered per shard!

Kinesis Firehose

  • Fully managed service, no administration required, provides automatic scaling, it is basically serverless.
  • Used for load data into Redshift, S3, ElasticSearch and Splunk.
  • It is Near Real Time: 60 seconds latency minimum for non full batches or minimum 32 MB of data at a time.
  • Supports many data formats, conversions, transformation and compression.
  • Pay for the amount of data going through Firehose.

Kinesis Data Streams vs. Firehose

  • Streams:

    • Requires to write custom code (producer/consumer).
    • Real time (around 200 ms).
    • Must manage scaling (shard splitting / merging).
    • Can store data into stream, data can be stored from 1 to 7 days.
    • Data can be read by multiple consumers.
  • Firehose:

    • Fully managed, sends data to S3, Redshift, Splunk, ElasticSearch.

    • Serverless, data transformation can be done with Lambda.

    • Near real time.

    • Scales automatically.

    • It provides no data storage.

Kinesis Data Analytics

  • Can take data from Kinesis Data Streams and Kinesis Firehose and perform some queries on it.
  • It can perform real-time analytics using SQL.
  • Kinesis Data Analytics properties:
    • Automatically scales
    • Managed: no servers to provision
    • Continuous: analytics are done in real time
  • Pricing: pay per consumption rate
  • It can create streams out of real-time queries.

AWS Kinesis API

Put Records

  • Data must be sent form the PutRecords API to a partition key.

  • Data with the same key goes to the same partition (helps with ordering for a specific key).

  • Messages sent get a sequence number.

  • Partition key must be highly distributed in order to avoid hot partitions.

  • In order to reduce costs, we can use batching with PutRecords API.

  • It the limits are reached, we get a ProvisionedThroughputException.

Exceptions

  • ProvisionedThroughputException Exceptions:
    • Happens when the data value exceeds the limit exposed by the shard.
    • In order to avoid the, we have to make sure we don't have hot partitions.
  • Solutions:
    • Retry with back-off.
    • Increase shards (scaling).
    • Ensure the partition key is good.

Consumers

  • Consumers can use CLI or SDK, or the Kinesis Client Library (in Java, Node, Python, Ruby, .Net).

  • Kinesis Client Library (KCL) uses DynamoDB to checkpoint offsets.

  • KCL uses DynamoDB to track other workers and share work amongst shards.

AWS Kinesis CLI

Sending data:

aws kinesis help
aws kinesis list-streams help
aws kinesis list-streams
aws kinesis describe-streams help
aws kinesis describe-streams --stream-name <name>

## example
aws kinesis put-record --stream-name eden-stream-1 --data "Hello world. Please sign up." --partition-key user_123
aws kinesis put-record --stream-name eden-stream-1 --data "User has sign up." --partition-key user_123
aws kinesis put-record --stream-name eden-stream-1 --data "Registration complete." --partition-key user_123

Retrieving records:

aws kinesis help
aws kinesis get-shard-iterator help

Retrieves a shard iterator for a specified shard:

aws kinesis get-shard-iterator --stream-name <name> --shard-id <shard id> --shard-iterator-type TRIM_HORIZON 

Displays usage instructions, options, and examples:

aws kinesis get-records help 

Security

  • Control access/authorization using IAM policies.
  • Encryption in flight using HTTPS endpoints.
  • Encryption at rest using KMS.
  • Possibility to encrypt/decrypt data client side.
  • VPC Endpoints available for Kinesis to be access within VPCs.

Ordering data into Kinesis

  • Data with the same partition key goes to the same shard.
  • Data is ordered per shard.

Ordering data into SQS