Session Summary: How FINRA operates PB-scale analytics on data lakes with Amazon Athena #ANT308 #reinvent

This post is the session report about ANT308: How FINRA operates PB-scale analytics on data lakes with Amazon Athena at AWS re:Invent 2020.

日本語版はこちらから。

Abstract

FINRA, the largest independent regulator for all securities firms doing business in the US, regulates trading in equities, corporate bonds, security futures, and options. Security and SLAs are critical for FINRA to meet its regulatory requirements. In this session, hear how FINRA developed applications on Amazon Athena to enable 1,500+ analysts and business partners to securely query financial trading data with multiple terabytes in daily updates. Learn how FINRA operates large-scale analytics using Amazon S3–based data lakes, open-source technologies, and Amazon Athena, and hear about the techniques the company uses to improve performance, operational efficiency, and operating costs.

Speakers

  • Janak Agarwal
    • AWS Speaker
  • Alain Menguy
    • Systems Director - FINRA

How FINRA operates PB-scale analytics on data lakes with Amazon Athena - AWS re:Invent 2020

Content

  • Amazon Athena overview
  • Who is FINRA, and what is FINRA’s mission?
  • Application Requirements
  • Why did we want to try Amazon Athena?
  • How did we integrate Amazon Athena into our architecture?
  • What do we want to see improve?
  • Future – Universal Data Access

Amazon Athena overview

  • Amazon Athena
    • Amazon Athena is a serverless, open source compatible query service for your S3-based data lake
    • Query data in your Amazon S3 based data lake
    • Analyze infrastructure, operation, and application logs
    • Interactive analytics using popular BI tools
    • Self-service data exploration for data scientists
    • Embed analytics capabilities into your applications

  • Serveless: No infrastructure to manage
  • JDBC and ODBC drivers
  • Simple console to get started immediately
  • Pay only for the data you scan
  • Charge for compressed data scanned and achieve 30% to 90% of cost reduction
  • Query any single data source such as Redshift and RDS via Federated Query
  • Compatible with Apache Parquet, Avro and ORC
  • Querying data is encrypted at rest and in transit
  • AWS Private Link: without requiring an internet gateway

  • Released over 100 features in the last year and a half
    • Federated Query
    • Machine learning interface
    • Cost control using workgroups

Who is FINRA, and what is FINRA’s mission?

  • Mission
    • Regulate security market and protect investors from market manipulation fold and inside operating
  • Collect data from firms and exchanges

  • Overview of the business flow
    • Data providers store exchanges and send the files to FINRA using SFTP or HTTPS
    • FINRA receives 10 ~ 20 terabytes of data on a daily basis
    • Validate the incoming files to meet the specification and the requirements
    • Register the files into data catalogs
    • Run thousands of ETL jobs on a daily basis to prepare for the analytics
    • Run automated detection model for anomalies and generating alerts

  • Data lake architecture
    • Store all data in Amazon S3 and able to access any historical data
    • Develop own data catalogs, HERD which is OSS
    • Separate storage from compute
  • Protect the data
    • Encrypt at rest
      • Amazon KMS or SSE on Amazon S3
      • LUKS encryption on ephemeral or Amazon EBS with Amazon EMR security configuration
    • Encrypt in transit
      • HTTPS, SSL
    • S3 bucket versioning
      • Keep older versions of files if need to revert updates or deletes
    • Backup to another Region or account
      • In case primary AWS Region or account compromised
      • Immediately store in Amazon S3 Glacier for savings

  • Technology stacks
    • ECS, EC2: web applications and SFTP layer
    • S3: All the data here
    • HERD: All the data registered to data catalog
    • EMR: ETL process using Hive and Spark, Realtime analysis using HBase and Presto
    • Redshift, Athena: Analysis
    • Databricks, SageMaker: Machine learning workloads with R, Python and Scala

Application Requirements

  • CATFOLA Viewer
    • The order lifecycle assembly tool
    • Allows users to enter a CAT Lifecycle ID, or set of key fields
    • Quickly and dynamically return all the events that participate in the lifecycle
  • Ingest 150–250 billion records (daily)
  • Volume estimation = 10-20 TB (daily)
  • Retention: 6 years of data online
    • ~300 trillion records
    • ~25 PB Amazon S3 storage
  • Integration with HERD (Hive metastore)
  • 1500+ analysts (~300 concurrent users)
  • 90% of all request finishing under a minute (60 seconds)

Why did we want to try Amazon Athena?

  • EMR-HBase on Amazon S3
    • Operationality
      • Long-running EMR cluster
      • Maintenance, upgrade required lot of engineering and operational work
    • Complexity
      • Kerberos to enable inter-nodes encryption for HBase
    • Expertise
      • Difficult to find people with HBase knowledge
    • Pricing
      • Why pay for what you don't use?
  • Amazon DynamoDB
    • Performance at scale is excellent but price is high
    • Cost of storage is premium - 25 PB needed
    • Add the cost of a 2nd copy (for D/R) in another AWS Region
    • Data ingestion challenge: how to load 250 billion records (or 20TB) daily?
  • Amazon Redshift
    • Redshift + Spectrum: still need to spin a Redshift cluster to query data
  • Amazon Athena
    • No infrastructure provisioning
      • Ability to spend more time on business core mission than on infrastructure tasks?
    • Automatic Scaling
      • No need to build for peak
    • Highly available and secure
      • Internodes encryption, AWS KMS Encryption
    • Pay for Value
      • $5 per TB of data scanned

  • Athena pricing is very attractive
    • Pay only for the queries you run vs. cost of maintaining long running cluster
    • Athena charges $5 per TB of data scanned
    • For $5 we can execute:
      • 100,000 queries with an average of 10MB data scanning
      • 10,000 queries with an average of 100MB data scanning
      • 1,000 queries with an average of 1,000MB (1GB) data scanning
  • FINRA have loaded 250 billion records stored in 1 partition, which is 20 terabytes of data
    • But the query is scanning only 10 megabytes of data to retrieve all the necessary information

How did we integrate Amazon Athena into our architecture?

  • FINRA has its own data catalog (HERD) using Hive Metastore to store table definition and schema.
  • HERD is our Single Source of Truth
  • The Hive external metastore integration with Amazon Athena is one of the key components in our architecture
  • Fully serverless integration vs EMR- thrift solution
  • No performance difference between AWS Glue Catalog and HERD
    • → Ready to adapt Athena into the existing architecture

  • Performance at scale
    • Goal: Prove that Athena can support 100 concurrent requests with 90% of all request
      • Finishing under a minute (60 seconds)
      • Maximum latency of no more than 3 minutes (180 seconds)
    • Test flow
      • Loaded up 2 petabytes of data
      • 1-hour test and 30-min with 100 concurrent requests
    • Orange graph
      • Submit 50,000+ query and observe 17 failures only
      • Success ratio: 99.9%
      • → Athena is reliable even under loads
    • Query response time
      • Average: 17 seconds
      • The 95 percentile: 32 seconds meets the FINRA's requirement

  • Performance tuning tips
    • Query performance is strongly dependent on how data is organized in Amazon S3
    • Partitioning/ Bucketing
      • Choose Partition/Bucket with keys that align with common query filters used
    • High performance columnar storage
    • File size
      • Use of optimized file sizes to reduce Amazon S3 roundtrips (between 256MB to 1GB)
    • “S3 Pre-partitioning”
      • Collaborate with Amazon S3 team for optimal performance
  • Operational readiness
    • Workgroup for workload segregation
      • Data and logs are isolated between applications
    • AWS Glue Catalog only for transient data
      • Define time to leave and delete at some point of time
    • AWS CloudWatch monitoring
      • Query related metrics, custom dashboards, realtime notification
    • Application specific monitoring (Nagios)
      • Check on query queue time, execution time and planning time
    • JDBC & ODBC Drivers

What do we want to see improve?

  • Supports only Hive bucketing
    • Using Hive to generate Athena dataset is slow
    • → Want to use Spark to generate
  • Remove the 1-file-per-bucket restriction
    • Athena doesn't support tables in which the number of files doesn't match the number of buckets
    • This causes the data skew
  • Lack of SQL explain plan
  • In need of Presto engine upgrade
  • Lack of dynamic filtering
  • Lack of PII and PCI redaction

Future – Universal Data Access

  • Federated query, AWS Glue and more
    • FINRA's enterprise-wide philosophy
      • Give access to all structures data available with appropriate data access control
    • Access structured data from multiple storage platforms (Amazon S3, Amazon RDS) in one SQL-based interface or query tool
    • Control data access at table, column, and row level with centralized policies enforced in each storage platform
    • Filter and join data across multiple storage platforms with reasonable performance

AWS re:Invent 2020 is being held now!

You wanna watch the whole session? Let's jump in and sign up AWS re:Invent 2020!

AWS re:Invent | Amazon Web Services