In this blog we will create a Datalake using Amazon S3 and use AWS Glue to crawl the data from S3 to build a metadata store. Finally, query the data in Amazon Athena using standard SQL.
A Datalake is the most popular way to store and analyse the massive amount of data. A data lake allows organisations to store all their data—structured and unstructured—in one centralised repository. Amazon S3 provides the foundation of a data lake. AWS Glue is use to crawls the Amazon S3 data, identifies data formats, and then suggests schemas for use with other AWS analytic services.
Create Datalake Using Amazon S3
Upload the data to the S3 bucket
Create a s3 bucket by applying partitions to improve the query performance and upload the parquet file into s3 bucket.
- Parquet stores data in a columnar format. It provide features that store data efficiently by employing column-wise compression, different encoding, compression based on data type which helps to improve the query performance.
- Partitioning divides your table into parts and keeps the related data together based on column values such as date, country, and region. You can define them at table creation, and they can help reduce the amount of data scanned per query, thereby improving performance. You can restrict the amount of data scanned by a query by specifying filters based on the partition.
Create a bucket
The data is partitioned according to the source where it comes from and to the day when it is loaded. For example, files from a source named "source" loaded on September 20th, 2022 are stored in /source/2022/09/20. As shown in image below, we have created folder for year[example:2022]inside the source. Similarly create the folder for month [example:09] and day [example:20].
- Finally, on the last level, the data is stored using parquet files.
Create an IAM role
- Go to IAM console and click on "Roles" and then the "Create Role" button.
- Choose "Glue" and click on next.
- Give permission to S3.
- Give permission to Glue.
- Enter a role name (eg, AWSGlueServiceRole)and click on the "Create role" button and a new role is created.
Configure AWS Glue Crawler
Now we have the data stored in the Data Lake, we need to be able to query the data using SQL. We are going to use AWS Glue Data Catalog Databases and Crawlers for allowing us to run SQL queries on top of the Datalake.
- Go to the "AWS Glue" in the AWS console and click on"Databases".
Click on "Add Database" and Enter the name of the Database (eg, cm-datalake-db)
- Click on Crawlers and click on "Create Crawler".
- Enter the crawler name (eg, cm-datalake-crawler) and click on the "Next".
- Select the Data Source as "S3" and give the path of the data that we have stored in our S3 bucket and click on "Next".
- Select the IAM role which we have created before and click "Next".
- Add the "Target Database". Select the "On Demand" option and click "Next".
- In this step, we have to review the AWS Glue crawler configuration and click on "Create Crawler".
- Try running the Crawler. Go to the Crawler and Run the Crawler.
Running query in Amazon Athena
- Go to the Amazon Athena Console and Click on "Query editer".
- Select the DataSource and Database which we have created. We can see the table which has been created [CM_database_123] and Execute the query using standard MySQL.
SELECT *FROM "cm-datalake-db"."cm_datalake_123";
In this blog we have created a Datalake and configure AWS Athena for querying the data using standard MySQL. Athena is easy to use and integrate with Amazon S3. Using Athena, there is no need to create Extract, Transform and Load(ETL) jobs to prepare our data for analysis. Partitioning increases the query performance. Most of the queries delivered the result within seconds.