This article talks about analysing and executing SQL queries on data which is stored in amazon S3 using Amazon Redshift and Redshift spectrum.
What is Amazon Redshift?
Redshift:- It is an OLAP style, relational database. Fully managed, elastic scaling, fast, easy to use, and set up petabyte-scale cloud data warehouse solution for runningthe most complex analytical workloads in standard SQL.
Redshift supports SQL, ETL, BI tools. It optimizes queries, scales quickly and automatically for high traffic and concurrent queries, and encrypts data at rest and transit. It's mainly used for analysis, BI, Data mining, and huge queries with the ease of automating admin and maintenance tasks.
What is Amazon Redshift Spectrum?
Redshift Spectrum:- Storing data in Redshift over a long period of time becomes very expensive. That's where the Redshift spectrum saves our day. With Redshift Spectrum, we can query structured and unstructured data present in S3 without having to save the data in redshift tables. Redshift Spectrum allows to take advantage of low-cost affordable S3 storage and still scale-out to pull, filter, and sort data.
Redshift Spectrum lives on Redshift servers that are independent of the cluster, It is a serverless query processing engine, with nothing to provision or manage. Spectrum costs $5 per terabyte of data queried, and combined with S3 storage, it is a more cost-efficient solution than storing data in a Redshift cluster.
Note:- Redshift cluster and the Amazon S3 bucket must be in the same AWS Region, can't perform update or delete operations on external tables.
Working with Redshift Spectrum
- For the purpose of the article, JSON files are being stored in S3 by lambda every 5 mins.
- JSON file content and lambda code repository.
- Create a role for redshift, which gives access to S3. For role, permissions choose
AWSGlueConsoleFullAccess-> if you're using the AWS Glue Data Catalog. Or choose
AmazonAthenaFullAccess-> if you're using the Athena Data Catalog.
- Note:- Athena-managed data catalog is an old concept, for the current status quo amazon recommends to use external Data Catalog in AWS Glue ( see reasons to upgrade to Amazon Glue Data Catalog)
- Attach this role to redshift cluster.
- Connect to Redshift either using SQL client tool or native redshift query editor.
- This step and the following step is optional, applies only in the case when somebody wants to store their external table into an individual redshift database and corresponding table ( of course query performance of redshift >> redshift spectrum)
- Once we are connected, to the default database, we can create our individual database and work in it.
create DATABASE blogdb create table redshiftanalysistable ( colour varchar(4000), number int )
- External tables are created in the external schema. The external schema points out to a database in the external data catalog. The external database can be created in an Amazon Athena Data Catalog, AWS Glue Data Catalog, or an Apache Hive metastore, such as Amazon EMR. This article uses Amazon Glue Data Catalog.
- To create an external schema.
create external schema blogspectrum from data catalog database 'blogspectrumdb' iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole' create external database if not exists;
- Now that we have an external schema, will create an external table and point it to the S3 bucket we wish to query in SQL. An important thing to remember redshift spectrum flawlessly supports querying nested data too.
CREATE EXTERNAL TABLE blogspectrum.extable( colour varchar, number int ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' stored as textfile LOCATION 's3://jatin-bucket-for-redshift-spectrum-blog/jatin-mehrotra/blog_demo/';
- Now you can query your table the same way you do it at any other table.
select a.colour FROM blogspectrum.extable a;
- If you want to save the external table in redshift without duplicating records we can do it in this way.
insert into redshiftanalysistable (colour,number) select a.colour,a.number FROM blogspectrum.extable a EXCEPT SELECT colour,number FROM redshiftanalysistable select * from redshiftanalysistable where number=12;
In this article, we saw how we can leverage the power of redshift for analyzing complex analytic queries, joins, and aggregations over big datasets, when redshift is combined with redshift spectrum and S3 it becomes a much better cost-saving approach for analyzing data and Business Intelligence.
Till then happy learning :)