AWS Redshift 101

2021.03.30

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

AWS Redshift enables you to have a real time lake house approach which combines traditional data sources in our data warehouse with event data stored in S3 as well as real time data stored in Amazon Aurora or RDS.

 

The goal is to have a single interface to run queries which get results from multiple types of data stores. Also have a set of tables which can map to where the data is stored. 

The unique thing is that we do not need to pool all the data we need to access into Redshift’s data store or any particular data store. End users can only see the data in terms of tables and not where it is coming from.

What does Redshift have to offer?

Data Lake integration : Integrated with Data Lake as well as other AWS services which helps in using these services together seamlessly.

Best performance :3x faster than other cloud data warehouses through a variety of use cases.

Lower costs : 75% lower costs than other cloud data warehouses and costs are predictable.

Highly scalable :Virtually unlimited elastic linear scaling.

Most secure and compliant : Ways to encrypt data, VPC and certifications such as ISO, HIPAA etc.

Easy to manage : The queries are based on SQL which makes anyone who is well versed in SQL to be able to query and manage Redshift.

Source : Getting started with Amazon Redshift

Redshift Architecture

The user’s BI tools ( such as Tableau, micro-strategy) or data integration products (Infromatica, Actian, Adeptia etc) or SQL clients can be the end point to access our Redshift Data Warehouse. 

Redshift works as a cluster of nodes, for each cluster there is 1 leader node and the rest are compute nodes. The user connects to the Leader node using JDBC/ODBC drivers. The compute nodes sit below the leader node and each cluster can have a maximum of 128 compute nodes however each cluster starts only with 2 compute nodes initially which can be increased later.

Leader Node 

  • Provides SQL endpoint.
  • Stores metadata.
  • Coordinates parallel SQL processing in the compute nodes.
  • Manages the compute nodes and assigns workloads.

Compute Node

  • Store data locally in a columnar fashion.
  • Executes queries in parallel.
  • Load, unload backup and restore from S3 buckets whenever necessary.

Spectrum data node

  • Run queries against data lake directly.

When a query is run in Redshift, all compute nodes of the cluster work in parallel to execute the query. This is known as shared nothing massively parallel processing architecture. The compute nodes also integrate with S3 for parallel processing of data. You can copy data from S3 to Redshift and vice versa, all of which happens in parallel. 

Elastic fleet of compute nodes known as Redshift spectrum which provide the ability to query data stored in S3 buckets present in any of the open formats (.csv, .json, .orc etc) .You can also join the data present in S3 with the data present in Redshift cluster.

Optimising workloads

Optimising workloads is quite easy in Redshift with only a few settings which need to be changed which are seemingly automated now. There are worker threads collecting data automatically and analysing the tables automatically. There are additional worker threads which are running in the background which continuously work to reclaim the deleted space, which is known as Automatic Vacuum Delete. These worker nodes automatically sort the tables as well. By using machine learning Redshift identifies optimal distribution keys and sort keys for the given data.

Amazon Redshift Advisor

Amazon Redshift Advisor is one of the most recently added features to Redshift, it provides operational statistics by leveraging the data which is being stored. It runs daily and provides recommendations tailored to an amazon user which can help them increase their cluster’s efficiency and understanding of the data which is stored. This significantly reduces the amount of effort which has to go into effectively managing the redshift cluster and the user can focus on the insights which are being produced by running queries against the data lake.

Redshift Data types

Redshift supports a huge variety of data types, majority of which are also supported by SQL based databases but most recently AWS has added support for geospatial data as well.

Materialised Views

When a user creates a view, every time a query has to be run against the view, it has to be created again first and then the query can be run, by using materialised views the result of the view query is persisted, which means for all further queries the response will be faster. Moreover, whenever any of the underlying tables of the materialised view are changing in terms of modification or addition of data, users have the choice of updating their materialised views at the same time, which saves a lot of compute power and time if the view needs to be in latest state whenever a query is being run against it. 

Stored Procedure

We can write DDL and DML operations and Redshift also supports the select query. If the user is using stored procedures in their on premise/other cloud service provider's DWH(Data Warehouse) then with the help of AWS schema conversion tool all of these procedures can be converted and stored onto the Redshift database in PL/PGSQL.

Redshift Federated Query

Federated Queries are a way in which Redshift can run queries on data which is not present inside Redshift but instead is present inside other data stores such as Aurora and RDS. It also provides the ability to join data from various data sources such as Aurora, S3, Redshift in itself etc. This is a performant way to run queries and also provides secure access data without a lot of data movement which saves network traffic and bandwidth occupation. 

Monitoring

Redshift console gives users an insight into the health of their clusters in their account within a region. It also provides you with a dashboard which shows the state of queries(under execution, completed or failed) which are executed on the db. This helps in understanding which kind of queries are the least efficient and how they can be made efficient for better performance. The query monitoring dashboard also has inbuilt filters which list queries according to the time taken (<10 mins, <1 min…).

Scaling

We can scale Redshift clusters horizontally and their nodes vertically. 

Vertical Scaling :We can scale out to multiple redshift clusters from the same endpoint. This is done by setting a policy, which allows the user to scale the cluster however they wish to. Per second billing for any additional clusters used. Free 1 hour usage per day. Allows unlimited users and queries.

Horizontal Scaling : Elastic resize allows the user to simply move from a number of nodes in a cluster to another number of nodes, this is a very good option in terms of costs as well. This elastic resize is fairly quick, sometimes the end user doesn't even notice when scaling took place. 

Security

  • End to end encryption.
  • IAM integration.
  • Amazon VPC for network isolation.
  • Database security model.
  • Certificates that include SOC 1/2/3 etc.

References