Session Summary: How Disney+ uses fast data ubiquity to improve the customer experience #ANT309 #reinvent

AWS re:Invent 2020

2021.01.06

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

This post is the session report about ANT309: How Disney+ uses fast data ubiquity to improve the customer experience at AWS re:Invent 2020.

日本語版はこちらから。

Abstract

Disney+ uses Amazon Kinesis to drive real-time actions like providing title recommendations for customers, sending events across microservices, and delivering logs for operational analytics to improve the customer experience. In this session, you learn how Disney+ built real-time data-driven capabilities on a unified streaming platform. This platform ingests billions of events per hour in Amazon Kinesis Data Streams, processes and analyzes that data in Amazon Kinesis Data Analytics for Apache Flink, and uses Amazon Kinesis Data Firehose to deliver data to destinations without servers or code. Hear how these services helped Disney+ scale its viewing experience to tens of millions of customers with the required quality and reliability.

Speakers

Martin Zapletal
- Director, Software Engineering - Disney+

How Disney+ uses fast data ubiquity to improve the customer experience - AWS re:Invent 2020

Content

Streaming data and Disney+
Before – silos
Evolution – streaming silos
Now – data driven
Streaming Data Platform
Goals: Ubiquity, platform, culture

Streaming data and Disney+

Disney+ is using streaming and fast data for...
- The personalized recommendation which is updated automatically in the real time
- Monitoring the API's
- Apply machine learning models to detect any potential fraud or misuse
- Tracking the progress in a video
Streaming data use cases
- Variety of streaming use cases with varying needs
- Dozens of millions of users
- Hundreds of Amazon Kinesis data streams
- Thousands of shards
- Multiple regions
- Billions of events
- Terabytes of data

Before – silos

Disney+ always stays at the cutting edge to innovate and improve the solutions over time
Before: Silos
- Microservices with databases and data warehouses
- Able to use Batch processing
- But insights were slow and limited
- Silos: they didn't have access to all the power they could extract from the data

Evolution – streaming silos

Streaming, event driven, asynchronous
Custom, unique integrations and data warehouses
Event driven -> decoupling of domain, deployment scalability, replayability, back pressure
Transporting data between database and data solutions

Each team uses own approach to data format, schema management and the quality
- -> Kept the silos
Still unable to...
- Scale the organization
- Use the data to its fullest extent
-> Need to figure out how to scale that in the organization

Now – data driven

(Fast) data democracy
- How to break the boundaries and the silos
- Everyone can make decisions and actions, and run experiments
Real-time data, insights, ML
Experimentation
First-class consideration
Culture

“Data / insights they need available at the time they need it”

Streaming Data Platform

Fill the gap between the technology and use cases
Very easy and accessible platform to start leveraging data-driven features
Streaming is just a small subset of the overall data-driven ecosystem

Services, devices and other producers...
- Make their data sets available in this standardized platform
- Become consumers of the other teams data sets
- Derive new data from the existing data, and apply the typical streaming solutions
Triangle in the middle
- Integration between streaming data platform, analytics, machine learning and experimentation

Streaming Data Platform
- Backbone: Amazon Kinesis Data Streams
- The abstraction layer on AWS services and 3rd party technologies
- Fill in the gap between technologies and use cases
Kinesis
- Need a reliable, performant, cost-efficient event log
- Other Options: Kafka, Pulsar, and others
- Amazon Kinesis Data Streams
  - Replicated, partitioned, ordered, distributed log
  - Managed, Replication to 3 AZs
  - Integration with other AWS services
  - Near real time
  - Scalability, Elasticity

Goals: Ubiquity, platform, culture

Ubiquity

Provide the universal availability of near realtime data
Keep providing the appropriate framework for management and access to the data

Devices
- ECS: Process data coming from the devices
- Publish data to Kinesis Data Streams
Services
- Apply some business logics when receives a request
- Produce information about what happened downstream
Merge or join two streams together
- Enrich the information by validation, routing or filtering
- Send the enriched information to the downstream consumers
2 downstreams
- Past validation of the data
- Failed validation of the data
- Both validations are ingested to S3 which is the data platform
2 near realtime use cases
- Near realtime operational analytics and use cases
  - Do aggregations using Spark
  - Store the aggregated data in Amazon Elasticsearch using Kinesis Data Firehose
- Fraud detection and etc...
  - Kinesis Data Analytics for Apache Flink
  - Apply basic aggregations and machine learning models
  - Consumers of Kinesis Data Stream can leverage the data
Automated data management
- Make the data usable across the organization
- Manage whatever else comes with the data; schema, quality, governance
- Unlike API-driven services, the producers doesn't know who the consumers are and how they use the data
- Declarative way to
  - Define the schema and the properties
  - Enforce them throughout the ecosystem
-> Schema registory

Schema Registry
- Offer the centralized view of the schema and the properties
- Enforce checking throughout the ecosystem
- Apply as much validation upstream as possible
- Apply as early in the development life cycle as possible

Platform

The tools and the practices for...
- Building streaming solutions, declarative pipeline and deployment
- The control plane that focuses on the shared concerns
Teams should always have the ability to build their own custom solutions
-> Create some patterns and compose them into larger pipelines

Reliable domain events
- Deliver data reliably so that view are consistent with the source of truth
- Problem -> losing any transactional guarantees, atomicity and isolation
  - Downstream consumers may receive the data out of order or miss some events
- DynamoDB Stream: playing the commit log of the database and some transformations

Validation, routing, filtering
- The streams are abstructed
- Consumers just subscribe the streams and get the information

Join, enrichment
- Join two Kinesis Data Streams using ECS

Ingestion
- Sync the data from Kinesis to S3
- Use the information from the schema registory

Use Delta format in the data platform
The maturity of streaming applications is not always at the same level as applications with API's
Streaming application need...
- Architecture patterns
- Automated testing
- Performance testing and management
- Elasticity and auto scaling
- Deployment
- Observability, alerting
- Reliability and resilience
- Operations simplicity
- Multi-region replication and failover
- Data lineage
- Self-healing
- Distributed tracing
- Cost efficiency
- Discoverability
- Traffic routing
- Guarantees
- Streaming as a service platform
- -> Make sure the production is ready

Configurable trade-offs
- Graph: CPU or memory usage of Kinesis Data Analytics for Apache Flink applications during a deployment
- A bit of gap in the middle -> the application is simply not reporting metrics
- Some use cases can allow the gap, but some other use cases like near realtime processing can't allow
Latency management
- If a processing takes a bit longer causes bunch of problems
Deployment patterns
- Blue-green-like deployment

Stream elasticity
- Able to scale Kinesis Data Streams up and down along the peak traffic hours
Application elasticity
- Change the number of shards to leverage the attributes to its fullest
Elasticity trade-offs
- Scaling causes rebalancing of the shard and the assignments

Graph: the throughput of the stream in two different regions in AWS
- At one point, the one traffic drops to zero and the other one picks up that traffic
- Eventually stabilizes and continues as usual
- Caused by scheduled chaos testing initiatives
Failure scenarios
- Guarantee reliability or delivery semantics
- Make the right choices for your particular use case and maintain them end to end

Culture

Data centric thinking in the company's culture
Focusing on...
- Documentation and best practices
- Training and Collaboration
- Tooling and Ease of use
- Integrations

Conclusion

Data-driven organization and data democracy
Ubiquitous data
Streaming data platform
Culture and tools
Build on top of Amazon Kinesis

AWS re:Invent 2020 is being held now!

You wanna watch the whole session? Let's jump in and sign up AWS re:Invent 2020!

AWS re:Invent | Amazon Web Services