The digest of all database services managed by AWS, including the latest updates #reinvent [DAT212-L]

This post is the session report about DAT212-L: Leadership session: Database and analytics at AWS re:Invent 2019.

日本語版はこちらです。

Abstract

We’re witnessing an unprecedented growth in the amount of data collected and stored in the cloud. Generating insights from this data requires database and analytics services that scale and perform in ways not possible before. AWS offers a broad set of database and analytics services for processing, storing, managing, and analyzing all your data. In this session, we provide an overview of the database and analytics services at AWS, new services and features that we launched this year, how customers are using these services, and our vision for continued innovation in this space.

Speakers

  • Raju Gulabani
    • VP Databases, Analytics and ML, Amazon Web Services

In this session, Raju Gulabani gave a brief digest of AWS database services including the latest updates announced in re:Invent 2019. This session is really helpful if you need to see the big picture of AWS database services to build a new application or migrate an on-premises application to AWS.

Introduction

The new realities

  1. Connected customers, employees & devices that are always on
  2. Explosion of data
  • Designed for the new reality
    • Cloud optimized
      • Architect services built from the ground up for the cloud and for the explosion of data
    • Purpose built
      • Offer a portfolio of purpose-built services, optimized for your workloads
    • Fully managed
      • Help you innovate faster through managed services

Now used by a very large number of customers for mission-critical applications

Our portfolio

Three types of projects

  1. Get more value out of your data quickly
  2. Build apps for the new scale of data
  3. Modernize your data infrastructure

1. Get more value out of your data quickly

Traditional DWH approaches don't scale

  • Data lake
    • Amazon S3
    • AWS Glue
    • Lake Formation

Amazon Redshift: Data warehousing

  • First and most popular cloud data warehouse
    • Datalake & AWS Integration
      • Redshift Spectrum
    • Best performance, most scalable
      • Massively parallel processing (MPP) architecture, Shared nothing model
    • Most secure & compliant
    • Lowest cost
  • Amazon Redshift Federated Query

  • Query realtime data generated in an operational database from Redshift cluster

  • Amazon Redshift on RA3 instances (GA)

  • Separate compute from storage using S3
  • Optimize your data warehouse by paying for compute and storage separately

  • AQUA - Advanced Query Accelerator (COMING IN 2020)
  • Redshift runs 10x faster than any other cloud data warehouse without increasing costs

Amazon EMR: Big Data Processing

  • Easily run Spark, Hadoop, Hive, Presto, HBase, and more big data apps on AWS
    • Latest versions
    • Low cost
    • Use S3 storage
    • Easy, fully managed
  • Performance Improvement in Sparks for Amazon EMR

  • Performance optimized runtime for Apache Spark, 2.6x faster performance at 1/10th the cost

  • Amazon EMR on AWS Outposts (GA)

  • Use on-premises data center like as another availability zone

Amazon Athena: Interactive query service

  • Query instantly
    • Ad hoc query
  • Pay per query
  • Use S3 storage
  • Easy, serverless

  • Amazon Athena Federated Query (PREVIEW)

  • Run SQL queries on data spanning multiple data stores

Amazon Elasticsearch Service: Operational Analytics

  • Fully managed, scalable, secure, Elasticsearch service
    • Open source Elasticsearch APIs, Kibana, and Logstash
    • Fully managed
    • Scalable, secure, and compliant
    • Pay only for what you use
  • UltraWarm for Amazon Elasticsearch Service (PREVIEW)

  • A new warm storage tier for Elasticsearch service
  • Cost is 90% lower than the hot tier

AWS Glue: ETL and data catalog

  • Simple, flexible, and cost-effective ETL
    • Less hassle
    • Serverless
    • More power, automatically generates the code
  • Point to (crawl) the data source, figure out the data format, generate the code to transform it

AWS Data Exchange: Data exchange (GA)

  • Easily find and subscribe to 3rd-party data in the cloud
    • Quickly find diverse data in one place
    • Easily analyze data
    • Efficiently access 3rd party data

Amazon QuickSight: Visualizations

  • First cloud-native serverless BI with pay-per-session pricing & ML insights for everyone
    • Elastic Scaling
    • Serverless, no infrastructure
    • Native AWS, build for the cloud
    • API Support (NEW)
  • Machine learning in Amazon QuickSight
    • Anomaly Detection
      • Discover unexpected trends and outliers against millions of business metrics
    • Forecasting
      • Machine learning forecasting with point and click simplicity
    • Auto Narratives
      • Summarize your business metrics in plain language
    • ML Predictions (PREVIEW)
      • Visualize and build predictive dashboards with SageMaker models
  • ML predictions in Amazon QuickSight (PREVIEW)

  • Build predictive dashboards in hours with point-and-click, no coding required

    • Connect to any data
      • Data lakes, SQL engines, 3rd party applications and on-premises databases
    • Select an ML model
      • Create models with Amazon SageMaker AutoPliot, existing custom models and packaged models from AWS Marketplace
    • Visualize and share
      • Analyze results, create visualizations, build dashboards / email reports and share to business stakeholders
  • Easily embed analytics in your own tools (NEW)

  • Powered by QuickSight APIs and flexible customization. Entire serverless

  • Customers using Amazon QuickSite

  • Capital One
    • Distribute Interactive dashboards and reports at scale to tens of thousands of users in their organization
  • Best Western
    • Deploy QuickSight across 40 thousand users across all their hotel franchises
    • Replace their legacy reporting system
  • Amazon.com
    • Standardized on QuickSight company-wide across large numbers of teams and employees to enable fast and easy access to data

The latest updates in AWS data analytics services

  1. Easiest to build data lakes at scale
    1. Amazon Redshift Data Lake Export
    2. Amazon Redshift Federated Query
    3. Federated Query for Amazon Athena
  2. Best performance at lowest cost
    1. AQUA for Amazon Redshift
    2. RA3 for Amazon Redshift
    3. Amazon Redshift Materialized Views
    4. UltraWarm for Amazon Elasticsearch Service
    5. Performance Improvements for Spark in Amazon EMR
  3. Most comprehensive and open
    1. AWS Data Exchange
    2. Amazon EMR on AWS Outposts
    3. Record-level Insert/updates for Amazon EMR
    4. ML in Amazon Athena
    5. ML in Quicksight
  4. Most Secure
    1. Amazon S3 Access Points

AWS Data Lake use case: Sysco

  • Sysco migrated its on-premises data warehouse to AWS
  • Data lake -> Amazon S3
  • Analytics -> Amazon Redshift (Spectrum), Amazon EMR, Amazon Athena

By bringing the data to S3 from various sources, including niche datamarts such as Lexington and EDW into S3, the true potential of the data was unlocked by allowing different persons to use the ecosystem in new ways.

2. Build apps for the new scale of data

This part is almost same as DAT209-L: Leadership session: AWS purpose-built databases page, but this last post is more concrete and the following is more "concepty".

Common data categories and use cases

Amazon ElastiCache

  • Managed, in-memory data store service
  • Redis or Memcached to power real-time apps with submillisecond latency
    • Extreme Performance
    • Secure and hardened
    • Easily scalable
    • Highly available & reliable

Amazon DynamoDB

  • Fast and flexible NoSQL database service for any scale
    • Performance at scale
      • A database with no scaling limits
      • If you're designing a new application without a legacy system, DynamoDB gives you the best performance
    • Serverless
      • The basic idea is very simple
    • Comprehensive security
    • Global database for global users and apps

Amazon DocumentDB (with MongoDB compatibility)

  • Fast, scalable, highly available MongoDB-compatible database service
    • MongoDB-compatible
    • Highly available
    • Performance at scale
    • Highly secure

Amazon Managed (Apache) Cassandra Service (PREVIEW)

  • Scalable, highly available, and managed Cassandra-compatible database service
  • A solution for easily managing many Cassandra servers and keeping them available
    • Apache Cassandra-compatible
    • No servers to manages
    • Single-digit millisecond performance at scale
    • Simple migration

Amazon Neptune

  • Fully managed graph database
    • Open
    • Fast & Scalable
    • Reliable
    • Easy
  • Graph use cases
    • Navigate (variably) connected structure
    • Filter or compute a result on the basis of the strength, weight, or quality of relationships

Amazon Quantum Ledger Database (QLDB)

  • Fully managed ledger database
  • Track and verify history of all changes made to your application's data
  • Have a central authority you trust that can run the infrastructure
    • Immutable
    • Cryptographically verifiable
    • Highly scalable
    • Easy to use
  • Use case: BMW Group
    • Challenge
      • Need to track trusted, verifiable ledger for automotive data
    • Solution:
    • Build a BMW Digital Vehicle Passport App
    • Provide a transparent and complete history of vehicle data using Amazon QLDB

Amazon Managed Blockchain

  • Create and manage scalable blockchain networks
    • Choice of Hyperledger Fabric or Ethereum
    • Scalable and secure
    • Fully managed
    • Easily analyze blockchain activity
  • Use case: Nestle
    • Created transparent supply chain management on blockchain

3. Modernize your data infrastructure

Old-guard database providers

  • Very expensive
  • Proprietary
  • Lock-In
    • You date a hardware vendor but you marry a database vendor
    • Database is sticky to applications and hard to be replaced
  • Punitive licensing
  • You've got mail

Amazon Aurora

  • MySQL and PostgreSQL compatible relational database built for the cloud
  • Performance and availability of commercial-grade databases at 1/10th the cost
    • Performance and scalability
    • Availability and durability
    • Highly secure
    • Fully managed
  • Amazon Aurora - high performance

  • Scale out to millions of reads per second
    • Up to 15 read replicas across three AZs
    • Auto-scale new read replicas
    • Seamless recovery from read replica failures

ML in Amazon Aurora, Athena, and QuickSight (NEW)

  • Bringing machine learning to databases, analytics and BI
    • Incorporate ML into databases, analytics and BI
    • Integrated with Amazon SageMaker & Comprehend
    • ML predictions using standard SQL statements
    • No ML expertise required
    • Reduces time to go getting predictions out of models

AWS Database Migration Service

  • Migrate between on-premises and AWS
  • Migrate between databases
  • Automated schema and stored procedure conversion
  • Data replication for near-zero downtime migration

  • Use case: Amazon.com

    • Challenge
      • Complex and costly to scale
        • 75PB of data
        • 7,500 Oracle databases
      • Expensive and punitive Oracle licenses
    • Solution
      • Migrated to...
        • Amazon Aurora
        • Amazon RDS
        • Amazon DynamoDB
        • Amazon ElastiCache
        • Amazon Redshift
      • 60%: Reduction in database costs
      • 70%: Reduction in database administrative overhead
      • 40%: Increase in performance of most critical apps

Managing software on-premises is time consuming and complex

  • Hardware and software installation
  • Configuration, patching, and backups
  • Cluster setup and data replication for high availability
  • Capacity planning, and scaling clusters for compute and storage

Amazon RDS

  • Managed relational database service with choice of popular databases
  • Amazon Aurora, MySQL, PostgreSQL, MariaDB, Microsoft SQL Server, Oracle
    • Easy to administer
    • Performant & scalable
    • Available & durable
      • Multi-AZ. a standby database in a different AZ
    • Secure and compliant
  • Amazon RDS Proxy (PREVIEW)

  • Fully managed, highly available database proxy
  • Offer a serverless fashion layer between the application and the database

    • Supports new scale of serverless application connections
    • Pools and shares database connections
    • Preserve connections during database failovers, which reduce downtime to 10 seconds
    • Manage DB credentials with Secrets Manager and IAM
    • Fully managed - No provisioning, patching, management
  • Amazon RDS on AWS Outposts
    • Launch RDS in your data centers with AWS Outpost
    • Deploy secure, managed, RDS in minutes
    • Store data without moving to cloud
    • Integrate with on-premises databases and applications
    • Automates provisioning, patching, backup, restoring, scaling, and failover

Customers want to move fully managed

When to use which services

Situation Solution
Existing applications MySQL -> Amazon Aurora, RDS for MySQL
PostgreSQL -> Amazon Aurora, RDS for PostgreSQL
MariaDB -> Amazon Aurora, RDS for MariaDB
Oracle -> Use SCT to determine complexity -> Amazon Aurora, RDS for Oracle
SQL Server -> Use SCT to determine complexity -> Amazon Aurora, RDS for SQL Server
MongoDB -> Amazon DocumentDB
Cassandra -> Amazon Managed Apache Cassandra Service
New application If you avoid relational features -> Amazon DynamoDB
If you need relational features -> Amazon Aurora
In-memory store/cache Amazon ElastiCache
Time series data Amazon Timestream
Track every application change, crypto verifiable. Have a central trust authority Amazon Quantum Ledger Database (QLDB)
Don't have a trusted central authority Amazon Managed Blockchain
Data warehouse & BI Amazon Redshift, Amazon Redshift Spectrum, and Amazon QUickSight
Adhoc analysis of data in AWS or on-premises Amazon Athena & Amazon QuickSight
Apache Spark, Hadoop, HBase (needle in a haystack type queries) Amazon EMR
Log analytics, operational monitoring & search Amazon Elasticsearch Service
Real-time analytics Amazon Kinesis and Amazon Managed Streaming for Kafka