Query S3 Data with Athena using AWS Glue

2021.03.29

Athena

Athena is a serverless Interactive query Platform. It is used to query data stored in S3 directly. Standard SQL is used to query data. Athena uses Presto (a Distributed SQL Engine) to run queries. It supports variety of file formats – CSV, JSON, ORC, Parquet, Avor. With Athena you only Pay for querying the data.  You can get better performance and save on costs by partitioning data,  compressing data or converting data into columnar formats. Athena can be used to perform query on any kind of data – Unstructured, Semi-structured and Structured data. Ways to access Amazon Athena : AWS console, Athena API,  JDBC connection, Athena CLI. Athena can be integrated with QuickSight for data visualization, and with AWS Glue Data Catalog. Athena does not require Hadoop Clusters or Servers. Athena can be used for analysing Logs of CloudTrail, CloudFront, VPC, ELB, Ad-hoc.

AWS Glue

Glue is a serverless ETL service that can categorize, clean and move data between various data sources. AWS Glue Crawlers create databases and tables automatically from your data source and it stores the metadata in the AWS Glue Data Catalog. Athena uses Glue as central location for storing and retrieving metadata. Glue Data Catalog provides a unified repository of metadata. AWS Glue can be integrated with S3, Athena, RDS, Redshift, EMR. Glue provides both code and visual based interactions.

Querying S3 Data with Athena using Glue

Load Data to S3

Create a bucket and load the data file into the S3 bucket. Here I have taken a feed data in the JSON format.

Add a AWS Glue Crawler

AWS Glue Crawler is used to extract and create database and table from JSON file stored in S3 bucket automatically.

  • Select Crawler in AWS Glue console.  Click on Add Crawler in Crawler page. Enter Crawler name in name field and choose next. You can provide tags and description if needed(Optional).
  • Select the Crawler source type as Data stores and choose next.

  • In Add a data store page, select S3 as data store and specify the path of the S3 bucket. The S3 bucket path can be of your account or another account.
  • You can specify multiple data stores within a single Crawler. In this case, S3 is our only data store. So in Add another Data Store choose No and click next.

  • AWS Glue requires permissions to access S3 and create objects in Data Catalog. To add permissions create an IAM role or choose an existing role containing the required permissions. In this case, I have created a new role.
  • Choose Run on Demand and select Next. You can choose the schedule based on the requirements. The Schedule can hourly, daily, monthly or weekly basis. You can also choose specific days to run the Crawler.

  • In output page, click on Add database to create a new database. In the pop-up window, enter the name of the database and click create.
  • Review the choices and choose Finish to create the Crawler.

  • Select the Crawler you created and choose Run Crawler. When the Crawler completes, the database and the table are created.
Query the database using Athena

The query results and metadata information need to be stored in S3. In order to query the Database created, first we need to specify the location of output folder in S3. Create a folder in S3 and specify the location of this folder in Athena.

Run the SQL queries on the database. Select the database and then run the queries. You can also save the queries.

Conclusion

Amazon Athena allows you to analyse S3 data using SQL queries without the need of any infrastructures. Athena can be integrated with AWS Glue for creating database and tables automatically from the data source. Athena only charges for the amount of data scanned per query. This post shows how to use AWS Glue to create database and Athena to query data in Amazon S3.