Amazon S3 Best Practices

2021.07.07

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

Amazon S3 stands for Simple Storage Service and it provides object (file) storage through a web interface. It’s built to store, protect and retrieve data from “buckets” at any time from anywhere on any device. As AWS describes it, an S3 environment is a flat structure. A user creates a bucket, and the bucket stores objects in the cloud.

Organisations of any size in any industry can use this service. Use cases include websites, mobile apps, archiving, data backups and restorations, IoT devices, and enterprise apps to name just a few.

Here I have listed some features of Amazon S3 as well as the best practices which we should follow to get the best out of Amazon S3.

Things which S3 Provides 

  • Scalability
  • Availability
  • Durability
  • Security
  • Manageability
  • Performance

S3 provides several applications with other services, such as streaming data with Kinesis. 

There are 6 major storage classes 

  • S3 Standard
  • Intelligent Tiering
  • Archive
  • Standard IA
  • One Zone IA
  • Glaciar
  • Glaciar Deep Archive

If you access data 1 time or less per month, IA is the best option.

Data which is not very important and is temporary in nature can be stored in One Zone IA and it saves 20% of the costs as compared to Standard IA.

In Glacier, we have expedited retrievals which lets us retrieve the data within minutes instead of hours. If we need a large amount of data from AWS Glacier then we can use bulk retrieval which gives us the lowest price per Gb.

In Intelligent Tiering, data moves between frequent access and infrequent access tier on its own, this is managed at the object level. Frequent tier is priced the same as S3 Standard and infrequent tier is priced the same as Standard IA. 

There is a monitoring fee, which is charged at an object level. Fee is not charged according to size of an object but the number of objects instead. 

 

Lifecycle Policies

  • These are custom policies which can be used when there is good data access pattern predictability. 
  • There is Storage Class Analysis, which can be activated at the bucket, prefix or tag level, depending on how much data is retrieved relative to the amount of data being stored(Generally over a period of a month).For eg: Data being accessed > Data being stored = Frequently accessed data and vice versa. 
  • When the data is a certain day old, and the ratio of data being accessed to the data being stored is very small, then we can configure it to be moved to IA. 

Tags don't depend on the name of the object, if a bucket is being shared by multiple teams/users then it is easier to define lifecycle policies using tags. 

Tag policies override bucket policies. 

Best Practices

Using the latest SDK helps in performance gains from :

  • Automatic retries
  • Handling timeouts
  • Parallelised uploads and downloads with Transfer Manager

If some data is being used by an application a fairly high number of times then ElastiCache or CloudFront , Elemental media store will give you better latency, throughput and also help in decreasing S3 requests.

Security

 

  • Encrypt data by default at the bucket level.
  • Encryption status in S3 inventory report for auditing purposes. 
  • Bucket permission checks to check ACLs, or any kind of access at one place. 

Policies are a better option to control access to the bucket instead of ACLs since ACLs are much less flexible than Policies. 

Object ACL 

  • Give permission to read an object or to modify the ACL.

Bucket ACL 

  • Give permission to read or write objects to a bucket.
  • Here, to modify access to an object we need to modify both object ACL and bucket ACL, but by using policies, we can set permissions according to prefix, tags and buckets and it is much easier to update as everything is in one place. 

Blocking public access can be done at an account level which applies this permission policy to all the existing as well as newly created buckets in that account.

We can specify encryption in the put request in which case the object will follow that particular encryption policy. In cases where the encryption policy is not defined then the bucket default is applied. 

If you want to encrypt your existing objects then you can define that as a batch operation.

Managing Objects at Scale

 

Inventory can help us list objects and stats of objects(storage class, creation date, encryption status, replication status, object size, Intelligent Tier access tier etc) when the number of objects is very high, inventory is delivered daily or weekly.

We can use the LIST API as well, but it lists objects 1000 at a time.

AWS Batch Operations : Replace object tag sets, Change object ACLs , Restore objects from Amazon S3 glacier , copy objects, Run AWS Lambda functions. These can be done using the inventory as there is a list of objects inside inventory. There is a completion report about what the batch operation did.

Coupling Batch operation with Lambda functions helps us perform operations at an object level specification by using url encoded json to pass object level parameters. Lambda functions can be used for ML services, any kind of ML operation can be run on a high number of objects using lambda functions. We can also run our own custom code.

Ideally a user should

  • Add metadata to objects.
  • Add lifecycle policies from day one.
  • Even if we don't follow these 2 suggestions, S3 has rules which let us recover the tech debt.

Data Protection Capabilities

Versioning and object lock work within a bucket.

Replication gives us the ability to replicate the data to another bucket to another or same region along with some other controls on the data.

What to replicate (tags, prefix or bucket)

You can change ownership across accounts. 

You can set the storage class of the new and old bucket.

Replication time control : 15 mins replication time backed by AWS SLA

There are 3 metrics, replication latency, bytes pending replication, operations pending replication.

We can set alarms on these metrics. 

If versioning is enabled then each version is indexed by their version ID, if we send a get request then automatically the most recent version is sent as an output. When we delete an object then the index of that version of that object it is substituted by a delete marker. 

Think of it as a stack where newer versions come on top of older ones. 

You can set lifecycle policies to delete older versions of an object.

Object Lock

  • Compliance mode : You can set a retention date and it will retain the object till the time expires.
  • Governance mode : Store data in WORM format, privileged users can modify retention controls.
  • Legal hold : Object is protected till the legal hold exists.