Session Summary: Under the hood How Amazon uses AWS for analytics at petabyte scale #ANT311 #reinvent

This post is the session report about ANT311: Under the hood: How Amazon uses AWS for analytics at petabyte scale at AWS re:Invent 2020.

日本語版はこちらから。

Abstract

As Amazon’s consumer business continues to grow, so does the volume of data and complexity of analytics done to support the business. In this session, hear how Amazon.com uses AWS technologies to build a secure, compliant environment to process and analyze petabytes of data. This session examines how Amazon is evolving the world of data analytics and machine learning with a combination of data lakes and parallel, scalable compute engines such as Amazon EMR, AWS Lake Formation, and Amazon Redshift. You also learn about the challenges Amazon faced in building and migrating to a next-generation data lake of Amazon scale.

Speakers

  • Karthik Kumar Odapally
    • AWS Speaker
  • Huzefa Igatpuriwala
    • Solutions Architect in Amazon Business Data Technologies, Amazon.com

Under the hood: How Amazon uses AWS for analytics at petabyte scale - AWS re:Invent 2020

Content

  • History
  • Migration patterns
  • Current architecture
  • Evolution is key
  • Learn more

History

  • Business landscape at Amazon
    • Amazon is a globally distributed businesses
  • In the legacy days...
    • The system had one of the largest Oracle clusters in the world
    • Used by multiple teams; Prime Video, Alexa, Music and etc

  • Running the business
    • Diverse user base and use cases
      • For SDE's, analysts and machine learning scientists
      • Have plethora of use cases
      • Have access patterns for the underlying data
    • ~38K active data sets, Curated business data
    • ~900K daily jobs, ~80K active users

  • Amazon legacy data warehouse
  • Source systems
    • The underlying data stores on Amazon DynamoDB and Amazon Aurora
  • Legacy data warehouse
    • Multiple services leveraged a combination of Oracle and Redshift
    • Single fleet managed all of the data via the Oracle clusters
    • Contains ELT environment wherein all the jobs were handled via SQL
  • Analytics
    • Software applications, reporting systems and users interacted with the data

  • Andes Data Lake
    • Owned by Amazon which is one of the largest data lakes
  • Goals of Project Andes: Data warehouse and data processing at scale
    • Provide an ecosystem that scales with the Amazon business
    • Open systems architecture; provide choice and options of analytics technologies
    • Leverage AWS technologies and improve these technologies for all Amazon customers

Migration patterns

  • Legacy data warehouse challenges
    • Coupled data storage and compute
      • Could not scale the necessary individual components
    • 100s of hours/month of maintenance and patching
    • Expensive licensing and costs of H/W scaled to peak
      • Kept Paying for the peak cost for the entire year
  • Requirements
    • Business does not stop
      • Our customers expect Amazon services to be online 24/7, 365 days
    • Analytics solutions needed redesign
    • Compute systems management decentralized
      • Scale individual components when needed

  • Migration sequence
    • Took 2 years due to a huge technical challenge
    • Run both the legacy and the new one at the same time
  • 1 Build metadata and governance services
  • 2 Build a data mover which can move the data from the Oracle system to S3
    • S3 = the primary data store and landing zone
    • Leverage new formats versus the old at the new destination
    • Load strategies
      • Ensure that the data was presented in two systems in parallel
      • Ensure data loss does not happen
    • Build a query conversion service for 900,000 jobs
  • 3 Disable the legacy ecosystem one by one
  • 4 Finally had just one platform, Andes Data Lake

  • Key takeaways
    • Leadership support and organizational buy-in
    • Eliminate, automate and provide self-service tools
    • Plan to run both legacy system and new system during migration
    • Engineers and users need re-traning
    • Communicate, over-communicate, and over-communicate often to all stakeholders

Current architecture

  • Center: Andes Data Lake
    • Emulate the behavior of the legacy data warehouses for accessing data backed by S3
    • Left boxes; inherited from the legacy system
      • Kept the workflow systems largely the same
      • Enhance the workflows to read from S3 apart from Oracle
    • Right boxes: Andes
      • A bunch of services to manage the data on S3
      • Consists of a UI at the top where users can search, find and modify datasets
      • Users can version and replicate datasets in a coherent way
      • Consistently propagate data to downstream targets
      • Allow the files in S3 to be used like tables
        • Able to define primary keys for the dataset
      • Have mod semantics in the datasets
    • Right box in the middle: Subscription Service
      • Managed by the orchestration service inherited from the legacy data warehouse
      • Propagate the data and the metadata out to the analytics environment across Amazon as changes occurred to datasets on Andes
  • AWS Glue
    • A centerpiece in the non-relational approach of accessing datasets
    • Help software engineering teams to access datasets via EMR
  • Right: Targets
    • Synchronizers for Glue
      • Propagate metadata to the Glue catalog for any updates
    • Redshift Spectrum + Glue
      • Synchronize datasets

After all migrations completed, they disabled the legacy Oracle environment and eventually they are only running on S3 as the backend store

  • Heterogeneous sources
    • Amazon Kinesis
      • Kinesis Data Streams and Data Firehose
      • Real time and batch ingestion
    • Amazon DynamoDB Streams
      • Ingest data at lowest grain of a minute
    • Amazon Redshift and Amazon RDS
      • Batch ingestion
    • AWS Glue + Amazon S3
      • Register AWS Glue datasets backed by S3
      • Data synchronized based on CloudWatch Logs events

  • Andes metadata service
    • Synchronization primitives
      • Write the data out to compute targets
      • Keep track of the data when it arrives
      • Integrate the mod semantics which can mimic the mod logic on the relational database
    • Completion service
      • Coordinate dataset updates from step to step and notify
      • Using Glue, Airflow or custom EMR orchestration
    • Manifests
      • Consists of the list of errands for the files which are being updated
      • Captures the metadata and mod semantics
    • Access management and governance
      • The abstraction layer on top of datasets
      • Control permissions on who had access to what datasets
      • Managing shared encryption keys
    • Track workflow analytics
      • Perform lineage and drag upstream and downstream workflows
      • Help in operational troubleshooting, workflow discovery and reducing engineering efforts
    • Data formats -> TSV and Parquet

  • Subscription Service
    • For analytics at scale
    • Contract
      • Metadata about governance, security and SLAs
      • Who and what kind of access to data is present
      • Declaration of how frequent datasets will be updated
      • What kind of support dataset owners had agreed to provide
    • Synchronization
      • Updates data and metadata
      • Schema change
  • Synchronizer targets
    • Amazon Redshift
      • Shared compute fleets
      • Also ETL fleets
      • High-performance, lower-skill bar
    • AWS Glue catalog
      • Amazon EMR–based clients, Amazon Athena, etc.
      • ETL fleets
      • Also shared EMR compute fleets -> Spark SQL and Scala
    • Amazon Redshift Spectrum
      • Open a wide range of possibilities where users can query on large datasets

  • Data processing architecture
    • Orchestrated DAG workflows using Spark on Amazon EMR
    • Heterogeneous data pipelines
    • Perform data aggregations and bring Amazon Redshift
    • Make data accessible for analytics stack from AWS
    • Perform ML on Spark using Scala

  • Operational excellence
    • Reduced peak scaling effort 10x and admin overhead with AWS managed services
  • Performance efficiency
    • Teams reported latency improvements of 40% at 2x–4x the load
  • Cost optimization
    • Teams reported 40–90% operational cost savings

Evolution is key

  • Serverless
    • Leverage more and more managed services
  • Optimization
    • Transit from centralized Redshift cluster model to a "bring-your-own" Redshift cluster model
    • Optimize compute storage
    • Provide cost accountability
  • Machine learning
    • Make Andes interoperable with more AWS ML services
  • Compliance
    • Automate as per different data regulation requirements
  • Service discovery framework
    • Provide service-to-service authentication

Learn more

AWS re:Invent 2020 is being held now!

You wanna watch the whole session? Let's jump in and sign up AWS re:Invent 2020!

AWS re:Invent | Amazon Web Services