Session Summary: How Disney+ uses fast data ubiquity to improve the customer experience #ANT309 #reinvent

This post is the session report about ANT309: How Disney+ uses fast data ubiquity to improve the customer experience at AWS re:Invent 2020.

日本語版はこちらから。

Abstract

Disney+ uses Amazon Kinesis to drive real-time actions like providing title recommendations for customers, sending events across microservices, and delivering logs for operational analytics to improve the customer experience. In this session, you learn how Disney+ built real-time data-driven capabilities on a unified streaming platform. This platform ingests billions of events per hour in Amazon Kinesis Data Streams, processes and analyzes that data in Amazon Kinesis Data Analytics for Apache Flink, and uses Amazon Kinesis Data Firehose to deliver data to destinations without servers or code. Hear how these services helped Disney+ scale its viewing experience to tens of millions of customers with the required quality and reliability.

Speakers

  • Martin Zapletal
    • Director, Software Engineering - Disney+

How Disney+ uses fast data ubiquity to improve the customer experience - AWS re:Invent 2020

Content

  • Streaming data and Disney+
  • Before – silos
  • Evolution – streaming silos
  • Now – data driven
  • Streaming Data Platform
  • Goals: Ubiquity, platform, culture

Streaming data and Disney+

  • Disney+ is using streaming and fast data for...
    • The personalized recommendation which is updated automatically in the real time
    • Monitoring the API's
    • Apply machine learning models to detect any potential fraud or misuse
    • Tracking the progress in a video
  • Streaming data use cases
    • Variety of streaming use cases with varying needs
    • Dozens of millions of users
    • Hundreds of Amazon Kinesis data streams
    • Thousands of shards
    • Multiple regions
    • Billions of events
    • Terabytes of data

Before – silos

  • Disney+ always stays at the cutting edge to innovate and improve the solutions over time
  • Before: Silos
    • Microservices with databases and data warehouses
    • Able to use Batch processing
    • But insights were slow and limited
    • Silos: they didn't have access to all the power they could extract from the data

Evolution – streaming silos

  • Streaming, event driven, asynchronous
  • Custom, unique integrations and data warehouses
  • Event driven -> decoupling of domain, deployment scalability, replayability, back pressure
  • Transporting data between database and data solutions

  • Each team uses own approach to data format, schema management and the quality
    • -> Kept the silos
  • Still unable to...
    • Scale the organization
    • Use the data to its fullest extent
  • -> Need to figure out how to scale that in the organization

Now – data driven

  • (Fast) data democracy
    • How to break the boundaries and the silos
    • Everyone can make decisions and actions, and run experiments
  • Real-time data, insights, ML
  • Experimentation
  • First-class consideration
  • Culture

“Data / insights they need available at the time they need it”

Streaming Data Platform

  • Fill the gap between the technology and use cases
  • Very easy and accessible platform to start leveraging data-driven features
  • Streaming is just a small subset of the overall data-driven ecosystem

  • Services, devices and other producers...
    • Make their data sets available in this standardized platform
    • Become consumers of the other teams data sets
    • Derive new data from the existing data, and apply the typical streaming solutions
  • Triangle in the middle
    • Integration between streaming data platform, analytics, machine learning and experimentation

  • Streaming Data Platform
    • Backbone: Amazon Kinesis Data Streams
    • The abstraction layer on AWS services and 3rd party technologies
    • Fill in the gap between technologies and use cases
  • Kinesis
    • Need a reliable, performant, cost-efficient event log
    • Other Options: Kafka, Pulsar, and others
    • Amazon Kinesis Data Streams
      • Replicated, partitioned, ordered, distributed log
      • Managed, Replication to 3 AZs
      • Integration with other AWS services
      • Near real time
      • Scalability, Elasticity

Goals: Ubiquity, platform, culture

Ubiquity

  • Provide the universal availability of near realtime data
  • Keep providing the appropriate framework for management and access to the data

  • Devices
    • ECS: Process data coming from the devices
    • Publish data to Kinesis Data Streams
  • Services
    • Apply some business logics when receives a request
    • Produce information about what happened downstream
  • Merge or join two streams together
    • Enrich the information by validation, routing or filtering
    • Send the enriched information to the downstream consumers
  • 2 downstreams
    • Past validation of the data
    • Failed validation of the data
    • Both validations are ingested to S3 which is the data platform
  • 2 near realtime use cases
    • Near realtime operational analytics and use cases
      • Do aggregations using Spark
      • Store the aggregated data in Amazon Elasticsearch using Kinesis Data Firehose
    • Fraud detection and etc...
      • Kinesis Data Analytics for Apache Flink
      • Apply basic aggregations and machine learning models
      • Consumers of Kinesis Data Stream can leverage the data
  • Automated data management
    • Make the data usable across the organization
    • Manage whatever else comes with the data; schema, quality, governance
    • Unlike API-driven services, the producers doesn't know who the consumers are and how they use the data
    • Declarative way to
      • Define the schema and the properties
      • Enforce them throughout the ecosystem
  • -> Schema registory

  • Schema Registry
    • Offer the centralized view of the schema and the properties
    • Enforce checking throughout the ecosystem
    • Apply as much validation upstream as possible
    • Apply as early in the development life cycle as possible

Platform

  • The tools and the practices for...
    • Building streaming solutions, declarative pipeline and deployment
    • The control plane that focuses on the shared concerns
  • Teams should always have the ability to build their own custom solutions
  • -> Create some patterns and compose them into larger pipelines

  • Reliable domain events
    • Deliver data reliably so that view are consistent with the source of truth
    • Problem -> losing any transactional guarantees, atomicity and isolation
      • Downstream consumers may receive the data out of order or miss some events
    • DynamoDB Stream: playing the commit log of the database and some transformations

  • Validation, routing, filtering
    • The streams are abstructed
    • Consumers just subscribe the streams and get the information

  • Join, enrichment
    • Join two Kinesis Data Streams using ECS

  • Ingestion
    • Sync the data from Kinesis to S3
    • Use the information from the schema registory

  • Use Delta format in the data platform
  • The maturity of streaming applications is not always at the same level as applications with API's
  • Streaming application need...
    • Architecture patterns
    • Automated testing
    • Performance testing and management
    • Elasticity and auto scaling
    • Deployment
    • Observability, alerting
    • Reliability and resilience
    • Operations simplicity
    • Multi-region replication and failover
    • Data lineage
    • Self-healing
    • Distributed tracing
    • Cost efficiency
    • Discoverability
    • Traffic routing
    • Guarantees
    • Streaming as a service platform
    • -> Make sure the production is ready

  • Configurable trade-offs
    • Graph: CPU or memory usage of Kinesis Data Analytics for Apache Flink applications during a deployment
    • A bit of gap in the middle -> the application is simply not reporting metrics
    • Some use cases can allow the gap, but some other use cases like near realtime processing can't allow
  • Latency management
    • If a processing takes a bit longer causes bunch of problems
  • Deployment patterns
    • Blue-green-like deployment

  • Stream elasticity
    • Able to scale Kinesis Data Streams up and down along the peak traffic hours
  • Application elasticity
    • Change the number of shards to leverage the attributes to its fullest
  • Elasticity trade-offs
    • Scaling causes rebalancing of the shard and the assignments

  • Graph: the throughput of the stream in two different regions in AWS
    • At one point, the one traffic drops to zero and the other one picks up that traffic
    • Eventually stabilizes and continues as usual
    • Caused by scheduled chaos testing initiatives
  • Failure scenarios
    • Guarantee reliability or delivery semantics
    • Make the right choices for your particular use case and maintain them end to end

Culture

  • Data centric thinking in the company's culture
  • Focusing on...
    • Documentation and best practices
    • Training and Collaboration
    • Tooling and Ease of use
    • Integrations

Conclusion

  • Data-driven organization and data democracy
  • Ubiquitous data
  • Streaming data platform
  • Culture and tools
  • Build on top of Amazon Kinesis

AWS re:Invent 2020 is being held now!

You wanna watch the whole session? Let's jump in and sign up AWS re:Invent 2020!

AWS re:Invent | Amazon Web Services