Session Summary: How BMW Group uses AWS serverless analytics for a data-driven ecosystem #ANT310 #reinvent

This post is the session report about ANT310: How BMW Group uses AWS serverless analytics for a data-driven ecosystem at AWS re:Invent 2020.

日本語版はこちらから。

Abstract

Data is the lifeblood fueling BMW Group’s digital transformation. It drives BMW’s personalized customer experiences, connected mobility solutions, and analytical insights. This session walks through the journey of building BMW Group’s Cloud Data Hub. BMW Group’s technical lead, Simon Kern, dives deep into how the company is leveraging AWS serverless capabilities for delivering ETL functions on big data in a modularized, accessible, and repeatable fashion and provides insight into the next steps of the journey. The services used in BMW Group’s AWS architecture are discussed, including AWS Glue, Amazon Athena, Amazon SageMaker, and more.

Speakers

  • Simon Kern
    • Lead DevOps Engineer - BMW Group

How BMW Group uses AWS serverless analytics for a data-driven ecosystem - AWS re:Invent 2020

Content

  • BMW Group IT: Brief intro
  • Cloud Data Hub: BMW Group’s central data lake
  • Orchestrating data
  • Ingesting and preparing data
  • Analyzing data
  • Outlook

BMW Group IT: Brief intro

BMW Group is a global mobility company in 29 countries where there are 60 nationalities. It is also called as IT company because 694 locations are are connected through the global IT network and it delivers over 230 software products. One of the most important services is BMW’s ConnectedDrive backend where over 14 million vehicles are connected to and over 1 billion requests are being served per day.

  • -> BMW Group produces a lot of data with the backend systems.
  • -> Ingest the data into data lake to organize and analyze together
  • -> Cloud Data Hub

Cloud Data Hub

  • Cloud native data lake that makes it easy to...
    • Ingest data
    • Have a scalable storage solution
    • Open many possibilities to get values out of the data
  • Left Side
    • Over 500 software and data engineers
    • Build data ingests and data preparations to fuel a data marketplace
  • Right Side
    • Over 5,000 business analysts and data scientists
    • Build use cases, machine learning models and AI products
  • -> Data democratization: easy to access all the data in the BMW Group

You can work with the data seamlessly from the portal below.

Internal architecture consists of 3 pillars; the data providers, the data consumers and the data portal and APIs. All of them is done with AWS multi-tiered account setup.

Orchestrating data

  • Data Providers (Left Side)
    • Global IT unit that provide central datasets
    • Local IT unit
  • Use Cases (Right Side)
    • Controlled by the access management layer to allow to see the data
    • Global uses Global datasets
    • US uses Global and Local datasets
  • Data portal and API (Center)
    • Important for customers to...
      • Explore and query the data
      • Manage metadata
      • Deploy infrastructure
    • Built on top of several APIs
    • Security, central compliance services
    • Single sign-on for all users
    • Gray boxes
      • Separate markets and legal entities into different hubs with own storage accounts and processes.
    • Unified seamless front end

  • Dataset
    • A combination of S3 buckets and Glue databases
    • Always lives inside the Hub
    • Assigned to a business object which categorizes the data into a separate unit
    • 3 types of layers, and every data set is a part of them
      • Source: the copy of the source system
      • Prepared: clean and harmonize data
      • Semantic: make data enriched by aggregation or join

Ingesting and preparing data

  • Key concepts of data ingestion
    • Ease of use
      • Ingestion kickstart via UI-accessible and Typescript-based CDK stack
      • Advanced features can be leveraged via Terraform modules
    • Flexibility
      • Specialized building blocks
      • Reusability via modularization
    • Maintainability
      • Community via internal open source
      • Bigger changes via forks

  • Ingestion from on-premise to CDH Core account
    • All setup via Terraform
    • Glue ETL
      • Running in a private VPC
      • Read and pull data from on-premise network
      • Store it in the central S3 bucket of Cloud Data Hub
    • Secrets Manager
      • Store database credentials
    • Cloud Watch
      • Logging and trigger Glue jobs
    • Glue Data Catalog
      • Lambda syncs catalogs in Provider and CDH Core
    • Independent security account
      • Store KMS keys
    • PII API
      • Encrypt sensitive data

  • Data preparation
    • Set up via another Terraform module
    • Read data from the central S3 Bucket (source layer)
    • Write it into the prepare layer datasets

  • Ingestion recap
    • Reusable building blocks for common tasks
    • Multi-account setup
      • Infrastructure isolation
      • Scale - out to whole organization
      • Team empowerment
    • >150 systems ingested
    • >1 PB data volume total
      • ~ 100 TB data volume via PySpark-based ETL
      • Residual data via stream-based ingests

Analyzing data

  • Analyses via
    • Amazon Athena
    • Amazon SageMaker (optionally with Amazon EMR or AWS Glue ETL development endpoints)
    • Amazon QuickSight
  • Challenge: Moving from exploration to production for non-experts
    • Easy integration in CI / CD for ingestion and transformation
    • Managed environment for creating new versions of data products

If you'd like to see the analyses demo, please check the actual session out.

  • Architecture
    • Central Data Portal and Bitbucket rollout different Code Pipeline
      • Exploration: Explore the data and build transformation codes
      • Development: Verify the transformation with Athena
      • Production: Deploy after the verification

  • Data analysis recap
    • Enable non-experts via managed toolstack
    • Built-in best practices
    • Predefined path into production
    • Empower experts to utilize the full power of AWS services

Outlook

  • Integration of AWS Lake Formation for Fine-grained access control
  • Fine-grained data lineage via Spark execution plans
  • Automated data monitoring (including frequent users, dataset updates, health, and statistics)
  • Query acceleration layer for better performance with established BI tools

AWS re:Invent 2020 is being held now!

You wanna watch the whole session? Let's jump in and sign up AWS re:Invent 2020!

AWS re:Invent | Amazon Web Services