AWS Storage

This is not an exhaustive documentation of all the existing AWS Services. These are summarized notes for the AWS Certifications.
To see the complete documentation, please go to: AWS documentation

Storage Comparison
Amazon S3
AWS Snowball Family
AWS Storage Gateway
- Hybrid Cloud for Storage
- Storage Gateway
Amazon FSx
- Amazon FSx for Windows
- Amazon FSx for Lustre
AWS DataSync

Storage Comparison

AWS Storage	Description
	S3	Object Storage
	Glacier	Object Archival
	EFS	Network File System for Linux instances, POSIX filesystem
	FSx for Windows	Network File System for Windows servers
	FSx for Lustre	High Performance Computing for Linux systems
	EBS volumes	Network storage for one EC2 instance at a time
	Instance Storage	Physical storage for EC2 instance (high IOPS)
	Storage Gateway	Hyrbid cloud storage
	Snowball / Snowmobile	move large data to the cloud, physical

Back to the top

Amazon S3

Amazon Simple Storage Service (Amazon S3) is an object storage service built to store and retrieve any amount of data from anywhere.

Data is stored as objects within “buckets”
Buckets are defined at the region level
Buckets must have a globally unique name
Buckets are private by default
Only the bucket owner has initial access

In reality, there is no concept of “directories” within buckets because S3 follows a flat structure. However the UI will actually show that you can have directories within directories.

|| |-|

S3 Bucket Naming Convention

No uppercase
No underscore
3-63 character long names
Name must no be an IP
Must start with a lowercase letter or number

S3 Objects

Data is stored as objects within resources called “buckets”

An object can be up to 5 terabytes in size.

Objects do have a key as an identifier, which is basically the full path of the file:

  s3://my-bucket/my_file.txt
  s3://my-bucket/my_folder/another_folder/my_file.txt      

The key is composed of the prefix + object name

  s3://my-bucket/my_folder/another_folder/my_file.txt

Object values are the content of the body:
- Max object size in S3 is 5TB
- In case of a upload bigger than 5GB, we must use multi-part upload
Each object can have metadata: list of text key/value pairs - system or user added
Each object can have tags: unicode key/value par, useful for security, lifecycle. A bucket can have up to 10 tags
If versioning is enabled each object has a version ID

Back to the top

S3 Versioning

When versionining is enabled (at the bucket level),files can have multiple versions.

If a files is uploaded with the same key (same filename) the version of the file will be changed, the existing file wont be overridden, we will have both files available with different versions
It is best practice to version the files, because:
- The files will be protected against unintended deletes
- The files can be rolled back to previous versions
Notes:
- Any file that is not versioned prior to enabling versioning will have the version “null”
- Suspending versioning does not delete the previous versions of the file

Without Versioning

Each object is identified solely by the object key, it’s name.
If you modify an object, the original of that object is replaced.
The attribute, ID of object, is set to null.

Deleting versioned files

When deleting a versioned file adds a delete marker to the file, but the file wont be deleted
The file can be restored by deleting the delete marker
Deleting the delete marker and the file together is a permanent delete, meaning the file wont be able to be restored

MFA Delete

To use MFA-Delete we have to enable versioning on the selected bucket
MFA-Delete can be enabled/disabled only by the owner of the bucket (root account)!
MFA-Delete currently can only be enabled using the CLI
MFA will be required when:
- We want to permanently delete an object version
- We want to suspend the versioning on the bucket
MFA won’t be required when:
- We want to enable versioning
- We want to list deleted versions
- We want to add a delete marker to an object

Back to the top

S3 Static Hosting

S3 can host static websites and have them accessible from the internet

The website URL will be something like this:

  <bucket-name>.s3-website-<AWS-region>.amazonaws.com
  <bucket-name>.s3-website.<AWS-region>.amazonaws.com  

In case of 403 errors we have to make sure that the bucket policy allows public reads

Use cases

Storing web content
Storing backups for Relational DB
Storing logs for analytics

Back to the top

CORS is a web browser based mechanism to allow requests to other origins while visiting the main one

An origin is a scheme (protocol), host (domain) or port

Same origin example:

  http://example.com/app1 
  http://example.com/app2  

Different origins:

  http://example.com   
  http://otherexample.com

The request wont be fulfilled unless the other origin allows for the request, using CORS headers (example: Access-Control-Allow-Origin, Access-Control-Allow-Method)

|| |-|

If a client does a cross-origin request on an S3 bucket, the correct CORS headers need to be enabled in order for the request to succeed
Request can be allowed for a specified origin (by specifying the URL of the origin) or for all origins (by using *)

Back to the top

S3 Consistency Model

Amazon S3 follows two consistency models:

Read-after-write consistency for PUTs to new objects (new key)
Eventual consistency for GETs and DELETEs of existing objects (existing key).

S3 HTTP Codes

HTTP Code	Description
200	Successful upload
300	Redirection
400	Client error
500	Server error

S3 Access Logs

For audit purposes we would want to log all access to S3 buckets
Any request made to S3, from any account, authorized or denied, will be logged into another S3 bucket
The data can be analyzed by some data analysis tools or Amazon Athena
We should never set our logging bucket to be the monitored bucket! This may create a logging loop causing the bucket to grow exponentially

Back to the top

S3 Replication

To enable replication:

Enable versioning on the source and destination buckets
Only the new objects are replicated after the replication is activated (no retroactive replication)
Buckets can be in separate accounts
Copying between replica buckets happens asynchronously (it is very quick)
In order to be ably to copy between replicas, an IAM permission has to be assigned to the source bucket

For DELETE operations:

For deletion without version ID, a delete marker is added to he object. Deletion is not replicated
For deletion with version ID, the object is deleted in the source bucket. Deletion is not replicated

There is no chaining of replication:

If bucket-A has replication into bucket-B, and bucket-B has replication into bucket-C
Then objects created in bucket-A are not replicated to bucket-C

There are 2 types of replication:

Cross Region Replication (CRR)
- Buckets are in different regions
- Used for: compliance, lower latency access, replication across accounts
Same Region Replication (SRR)
- Buckets are in the same region
- Used for: log aggregation, live replication between production and test accounts

Back to the top

S3 Pre-signed URLs

Reference: Sharing objects with presigned URLs

By default, all Amazon S3 objects are private, only the object owner has permission to access them. However, the object owner may share objects with others by creating a presigned URL. A presigned URL uses security credentials to grant time-limited permission to download objects.

The URL can be entered in a browser or used by a program to download the object.
The credentials used by the presigned URL are those of the AWS user who generated the URL.
Users given a pre-signed URL will inherit the permissions of the person who generated the URL
We can generate pre-signed URLs using the SDK and the CLI
Pre-signed URLs have a default wait time of 3600 seconds. This can be changed with --expires-in argument

For downloads, we can configure through CLI:

aws s3 presign help
Aaws configure set default.s3.signature_version s3v4
aws s3 presign s3://myfavoritefilm/Lionking.jpg --expires-in 120 --region ap-southeast-1  

Back to the top

S3 Lifecycle Policies

We can transition objects between storage classes in order to save money

Infrequently access documents should be moved to STANDARD_ID
Objects that don’t need real-time access should be moved to GLACIER or DEEP_ARCHIVE
Moving objects can be done manually or can be done via a lifecycle configuration
Rules can be applied for a certain prefix
Rules can be created for certain object tags

Transaction Actions Transaction actions define when should objects be transitioned from one storage to another

Move objects to Standard IA class 60 days after creation
Moveto Glacier for archiving after 6 months

Expiration actions Expiration actions deletes objects after a given time.

Can be used to delete old versions of files if versioning is enabled on the bucket
Can be used to clean-up incomplete multi-part uploads

Back to the top

S3 Lock Policies

S3 Object Lock Implements WORM (Write Once Read Many Model) model, meaning that it guarantees that a file is only written once and it can not be deleted until the lock is removed

Glacier Vault Lock Same WORM model is implemented, locket file can not be changed as long as the lock is active. Helpful for compliance and data retention

Back to the top

S3 Event Notifications

Amazon S3 Event Notifications can be used to receive notifications when certain events happen in your S3 bucket.

To enable notifications, add a notification configuration that identifies the events that you want Amazon S3 to publish.
Make sure that it also identifies the destinations where you want Amazon S3 to send the notifications.
Store this configuration in the notification subresource that’s associated with a bucket.

Currently, Amazon S3 can publish notifications for the following events:

New object created events
Object removal events
Restore object events
Reduced Redundancy Storage (RRS) object lost events
Replication events
S3 Lifecycle expiration events
S3 Lifecycle transition events
S3 Intelligent-Tiering automatic archival events
Object tagging events
Object ACL PUT events

Amazon S3 can send event notification messages to the following destinations.

Amazon Simple Notification Service (Amazon SNS) topics
Amazon Simple Queue Service (Amazon SQS) queues
AWS Lambda function
Amazon EventBridge

Back to the top

S3 Security

User based security

Use IAM policies to specify which calls are allowed for a specified user from IAM console

Resource based security

Use bucket policies which are bucket wide rules from the S3 console.
These rules may allow cross account access to the bucket
We also have Object Access Control Lists (ACL) and Bucket Access Control Lists which allow finer grain control over the bucket
An IAM principal can access an S3 object if:
- The user IAM permission allows it or the resource policy allows it
- The is no explicit deny

Back to the top

Bucket Policies

Bucket policies are JSON based documents
They can be applied to both buckets and objects in buckets
The effect of a statement in the bucket policy can be either allow or deny
The principle in the policy represents the account or the user for which the policy applies to
Common use cases for S3 bucket policies:
- Grant public access to the bucket
- Force objects to be encrypted at the upload
- Grant access to another account (cross account access)

Bucket Settings for Block Public Access

Relatively new settings that was created to block public access to buckets and objects if the account has some restrictions:
S3 provides 4 different kind of block public access settings:
- new access control lists
- any access control lists
- new public bucket or access point policies
- block public and cross-account access to buckets and objects through any public bucket or access point policies
These settings were created to prevent company data leaks

Other Security Features

Networking

S3 supports VPC Endpoints (for instances in VPC without internet access)

Logging and Audit

S3 Access Logs can be stored in other S3 buckets
API calls can be logged in AWS CloudTrail

User Security

MFA Delete can be required in versioned buckets in order to protect for accidental deletions
Pre-Signed URLs: ULRs that are valid only for a limited time

S3 Object Encryption

Encryption at rest

For protecting data at rest in Amazon S3, you have the following options:

SSE-S3

Encrypts S3 objects using keys handled and managed by AWS
Keys used for encryption is managed by Amazon S3
Objects are encrypted in the server side
It uses AES-256 encryption
In order to have SSE-S3 encryption clients must set a header, which is x-amz-server-side-encryption”: “AES256”

SSE-KMS

Encryption keys are handled and managed by KMS
KMS allows the manage which keys will be used for the encryption, - Uses audit trails in to see who was using the KMS key
Objects are encrypted in the server-side
In order to have SSE-S3 encryption clients must set a header, which is x-amz-server-side-encryption”: “aws:kms”

SSE-C

Server side encryption using data keys provided by the user from the outside of AWS
Amazon S3 does not store the encryption key provided by the user
Data should be transmitted through HTTPS, because a key is sent the AWS
Encryption key must be provided in the header of the request for every request
When retrieving the object, the same encryption key must be provided in the header

Client Side Encryption

Data must be encrypted before sending it to S3
This is usually accomplished by using a third party encryption library, like Amazon S3 Encryption Client
The user is solely responsible for encryption-decryption
The keys and the encryption cycle is managed by the user

Encryption in transit

Amazon S3 exposes:
- HTTP endpoint for non-encrypted data
- HTTPS endpoint for encryption in flight which relies on SSL/TLS
Most clients for S3 will use HTTPS by default
In case of SSE-C encryption HTTPS in mandatory

Back to the top

S3 Storage Classes

Amazon S3 Standard-General Purpose
Amazon S3 Standard-Infrequent Access (IA)
Amazon S3 One Zone-Infrequent Access
Amazon S3 Intelligent Tiering
Amazon Glacier
Amazon Glacier Deep Archive
AmazonS3 Reduced Redundancy Storage (deprecated)

S3 Standard - General Purpose

High durability (13 nines SLA) of objects across multiple AZ
SLA: if we store 10 million objects in S3, we can expect to loose on average a single file per 10K years
99.99% availability per year
It can sustain 2 concurrent facility failures

S3 Standard - Infrequent Access

Suitable for data that is less frequently accessed, but it should be retrieved quickly when it is needed
Same durability as General Purpose, 99.9% availability
Lower cost than General Purpose

S3 One Zone - Infrequent Access

Same as Standard IA, but data is stored in a single AZ
Same durability as Standard IA. Data can be lost if an AZ goes down
99.5% availability per year
Lower cost than IA

S3 One Zone - Intelligent Tiering

Automatically moves objects between two access tiers based on access patterns
Has a small monthly monitoring fee
Same durability as General Purpose, having availability of 99.9%

S3 Glacier

Low cost object storage for archiving/backup data
Data is retained for longer terms (10s of years)
Alternative to on-premise magnetic tape
Same durability as General Purpose
Cost per storage per month is really low, but we pay for data retrieval as well
Each item in Glacier is called an archive, archives are stored in Vaults
Provides 3 retrieval options:
- Expedited (1 to 5 minutes): costs $10
- Standard (3 to 5 hours)
- Bulk (5 to 12 hours)
Minimum storage duration for Glacier is 90 days

S3 Glacier Deep Archive

For very long term storage - cheaper than S3 Glacier
Retrieval options:
- Standard (12 hours)
- Bulk (48 hours)
Minimum storage duration is 180 days

Back to the top

S3 Performance Optimization

Baseline Performance Amazon S3 automatically scales to high request rates, having latency of 100-200ms to get the first byte out of S3

3500 PUT/COPY/POST/DELETE requests per second per prefix in a bucket
5500 GET/HEAD requests per second per prefix in a bucket

Prefix explained:

Example of a file in a bucket: my-bucket/folder/subfolder/file
Prefix in this case is: /folder/subfolder/
Request performance will apply to each prefix separately

Single PUT Upload

Objects uploaded to S3 are sent as a single stream by default.
If the stream fails, the upload fails and requires a restart of the transfer.
Single PUT upload up to 5GB

Multipart Upload

Data is broken up into smaller parts and uploaded in parallel.
Recommended to be used for files bigger than 100MB
Mandatory for files bigger than 5GB
Upload can be split into maximum of 10,000 parts.
- Each part can range between 5MB and 5GB.
- Last leftover part can be smaller than 5MB as needed.
Parts can fail in isolation and restart in isolation.
The risk of uploading large amounts of data is reduced.
Improves transfer rate to be the speed of all parts.

|| |-|

S3 Accelerated Transfer

Off by default.
Uses tAWS edge locations to speed up transfer.
Compatible with multi-part upload.
Bucket name cannot contain periods.
Name must be DNS compatible.
Benefits improve the larger the location and distance.
- The worse the start, the better the performance benefits.

|| |-|

S3 Byte-Range Fetches

Can be used to speed up downloads by parallelizing GET requests.
Can be used to retrieve only a part of the file

|| |-|

KMS Limitation

S3 Performance can be affected by KMS limits
If encryption is enabled using SSE-KMS, we get additional requests to KMS which will apply to our KMS quota
KMS could throttle performance, as of today we can not request a quota increase for it

Back to the top

S3 Select and Glacier Select

This provides a ways to retrieve parts of objects and not the entire object.

If you retrieve a 5TB object, it takes time and consumes 5TB of data. Filtering at the client side doesn’t reduce this cost.
S3 and Glacier select lets you use SQL-like statements to select part of the object which is returned in a filtered way.
The filtering happens at the S3 service itself saving time and data.
We can filter by rows and columns. SQL statements should be simple, we can not have joins
The purpose of S3 Select is to use less network traffic

Back to the top

AWS Athena

The source data is stored on S3 and Athena can read from this data. In Athena you are defining a way to get the original data and defining how it should show up for what you want to see.

Serverless service to perform analytics directly against S3 files
Uses schema-on-read, the original data is never changed and remains in its original form.
The schema which you define in advance, modifies data in flight when its read.
It provides a JDBC/ODBC driver

How Athena works

Tables are defined in advance in a data catalog and data is projected when read.
It allows SQL-like queries on data without transforming the data itself.
This can be saved in the console or fed to other visualization tools.
Original data setis optimized which reduces the amount of space used for the data and the costs for querying that data.

Pricing

We are charged per query amount of data scanned, we are billed for what are we using

Supported file formats

csv, json, orc, Avro, Parquet.
In the back-end it uses Presto query engine

Uses cases

Business intelligence
Analytics
Reporting
Log analysis

Back to the top

AWS Snowball Family

Designed to move large amounts of data IN and OUT of AWS. Physical storage the size of a suitcase or truck. Ordered from AWS, use, then return.

|| |-|

Snowball

Any data on Snowball uses KMS at rest encryption.
1 Gbps or 10 Gbps connection
- 50TB or 80TB Capacity.
- 10TB to 10PB of data is economical range.
Good for multiple locations
No compute

Snowball Edge

Includes both storage and compute
Larger capacity vs snowball.
Faster networking over Snowball
- 10 Gbps or up to 100 Gbps

Types of Snowball Edge

Storage optimized
- 80TB storage, 24 vCPU, 32 GiB RAM
- (with EC2) includes 1TB SSD
Compute optimized
- 100TB storage, 7.68 GB NVME (fast PCI bus storage),52 vCPU, 208 GiB RAM
Compute with GPU
- Same as compute, but with GPU

Snowmobile

Portable data center within a shipping container on a truck. This is a special order and is not available in high volume. Ideal for single location where 10 PB+ is required. Max is 100 PB per snowmobile.

Back to the top

AWS Storage Gateway

Hybrid Cloud for Storage

Hybrid cloud:
- Part of a company’s infrastructure is in the public cloud (like AWS)
- Part of a company’s infrastructure is on-premise
S3 is a proprietary storage technology (unlike EFS/NFS), it can be exposed to on-premise servers through a storage gateway

|| |-|

Storage Gateway

Bridge between on-premise data and cloud data in S3

Uses cases for storage gateway with S3: disaster recovery, backup and restore, tiered storage
AWS provides 3 types of storage gateways:
- File Gateway
- Volume Gateway
- Tape Gateway

|| |-|

File Gateway

Configured S3 buckets are accessible using NFS and SMB protocols
Supports S3 Standard, S3 IA, One Zone IA
Each buckets will have its own IAM roles in order to be accessed by the file gateway
Most recently used data is cached in the file gateway
File Gateway can be mounted on many servers (because of the NFS protocol)

|| |-|

Volume Gateway

Block storage using iSCSI protocol backed by S3
EBS snapshots are created time to time which are stored in S3, these will help use to restore un-premise volumes
Cached volumes: low latency access to the most recently used data
Stored volumes: entire dataset is on premise, scheduled buckets are stored in S3
Volumes are usually mounter using iSCSI protocol, for on-premise it will look like a local volume

|| |-|

Tape Gateway

Some companies have backup processes using physical tapes
With tape gateway these companies can use the same process, but the data will backed into the cloud
Virtual Tape Library (VTL) backed by Amazon S3 and Glacier
Backup processes using iSCSI interface will work as well with tape gateway

File Gateway Hardware Appliance

Using file gateway means we need virtualization, otherwise we can use a File Gateway Hardware Appliance
It is an actual hardware which can be bought from amazon.com
Helpful for daily NFS backup in small data centers

Back to the top

Amazon FSx

Amazon FSx for Windows

EFS is a shared POSIX file system for Linux, meaning it can not be used with Windows
FSx for Windows is a fully managed Windows file system share drive
Supports SMB protocol and Windows NTFS
It has Active Directory integration, ACLs and user quotas
It is built on top of SSD, it can scale up to 10s of GB/s, millions of IOPS and 100s PB of data
It can accessed from on-premise infrastructure
It can be configured to be Multi-AZ (high availability)
Data is backed-up daily to S3

Amazon FSx for Lustre

Lustre is a type of parallel distributed file system for large-scale computing
Lustre is derived from “Linux” and “cluster”
FSx for Lustre is used with machine learning and High Performance Computing (HPC), example: video processing, financial modelling, electronic design automation
Scales up to 100s GB/s, millions of IOPS, sub-ms latencies
It has seamless integration with S3
- Can “read” S3 as a file system
- Can write the output of the computations back to S3
It can be used with on-premise servers

Back to the top

AWS DataSync

Data transfer service TO and FROM AWS.
This is used for migrations or for large amounts of data processing transfers.
Designed to work at huge scales. Each agent can handle 10 Gbps and each job - can handle 50 million files.
Transfers metadata and timestamps
Each agent is about 100 TB per day.
Can use bandwidth limiters to avoid customer impact
Supports incremental and scheduled transfer options
Compression and encryption in transit is also supported
Has built in data validation and automatic recovery from transit errors.
AWS service integration with S3, EFS, FSx for Windows servers.
Pay as you use product.

AWS DataSync Components

Task

job within datasync
defines what is being synced how quickly
defines two locations involved in the job

Agent

software to read and write to on prem such as NFS or SMB
used to pull data off that store and move into AWS or vice versa

Location

every task has two locations FROM and TO
example locations:
- network file systems (NFS), common in Linux or Unix
- server message block (SMB), common in Windows environments
- AWS storage services (EFS, FSx, and S3)

Back to the top

AWS Storage

Storage Comparison

Amazon S3

S3 Bucket Naming Convention

S3 Objects

S3 Versioning

S3 Static Hosting

S3 Cross Origin Resource Sharing (CORS)

S3 Consistency Model

S3 HTTP Codes

S3 Access Logs

S3 Replication

S3 Pre-signed URLs

S3 Lifecycle Policies

S3 Lock Policies

S3 Event Notifications

S3 Security

Bucket Policies

Bucket Settings for Block Public Access

Other Security Features

S3 Object Encryption

Encryption at rest

Encryption in transit

S3 Storage Classes

S3 Standard - General Purpose

S3 Standard - Infrequent Access

S3 One Zone - Infrequent Access

S3 One Zone - Intelligent Tiering

S3 Glacier

S3 Glacier Deep Archive

S3 Performance Optimization

S3 Select and Glacier Select

AWS Athena

AWS Snowball Family

Snowball

Snowball Edge

Snowmobile

AWS Storage Gateway

Hybrid Cloud for Storage

Storage Gateway

File Gateway

Volume Gateway

Tape Gateway

File Gateway Hardware Appliance

Amazon FSx

Amazon FSx for Windows

Amazon FSx for Lustre

AWS DataSync