Migrating the live Pokémon database to Aurora PostgreSQL #reinvent [GAM304]

AWS re:Invent 2019

#レポート

#データベース

#PostgreSQL

#Amazon Aurora

#AWS

春田拓海 | Takumi

2019.12.09

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

This post is the session report about GAM304: Migrating the live Pokémon database to Aurora PostgreSQL at AWS re:Invent 2019.

日本語版はこちらです。

Abstract

Learn how the Pokémon Company International stores and supports data for more than 300 million players with the help of AWS. In this session, Jeff Webb (development manager) and David Williams (senior DevOps engineer) discuss how they migrated to Amazon Aurora PostgreSQL and Amazon DynamoDB, resulting in zero downtime for operations. Hear about best practices for migration and get a glimpse into the services supporting Pokémon Go, including AWS Lambda, Amazon Kinesis, Amazon Redshift, and more.

Speakers

Chris Finch
- Senior SA Game Tech Evangelist, Amazon Web Services
David Williams
- Sr DevOps Engineer, Pokemon
Jeff Webb
- Manager, Software Engineering, The Pokémon Company International

Pokémon GO had a huge impact on not only people's live all over the world but also the systems of The Pokémon Company International. They needed to redesign the architecture to stabilize their services and reduce downtime. Their approach to deal with that, that is the agile and data-driven way was pretty cost-effective and fast, which was really interesting.

Content

Introduction

The Pokémon Company International (TPCI)

Subsidiary of The Pokémon Company (TPC)
Manages Pokémon property outside of Asia
- 324 million video games
- 1,000 anime episodes
- 26 billion trading cards
- Brand management
- Localization
- Trading card game
- Marketing
- Licensing
- PokemonCenter.com

Before Pokémon GO (pre-2016)

Pokemon.com was the focus of the small tech team and included:

Pokemon.com
- Marketing and user engagement
- Pokémon TV (Animated episodes over the past 24 years)
Organized Play
- Manage trading card game league/tournament
- A platform for other software to manage the league/tournament
- Collect championship points over a season
Pokémon Trainer Club
- User registration
- User profile management
- Authentication
- Millions of accounts
- Used by Pokémon.com and a few smaller digital products

Pokémon Traner Club Service (PTCS)

Purpose: User registration and login system for governance reasons and regulations
- COPPA
- GDPR
- A child cannot manage their own account
  - To keep children from getting access to open chats
  - To track children's payment history
Size: Into the hundreds of millions
Usage: Millions of logins a day

Preparing for Pokémon GO

Shift to the NoSQL solution
- Super fast seek times when it based on key
- Scale the architecture to high amount of user growth
No side effect on pokemon.com and the Organized Play

Pokémon GO launched (July 2016)

Everything changed but service and DB performance was good

10x growth of PTCS users in six months
100% increase in PTCS users by end of 2017
Additional 50% increase in users by end of 2018

Service stability issues: 2017/2018

Service and DB availability was not good
- Downtime: 137 hours (down or degraded) in six-month period
Licensing and infrastructure costs increasing
- 300 nodes required to support
Engineering time
- Full-time support from 1-2 resources

Business drivers for change

Instability of the DB platform was impacting customer and visitor experience
Recognize that there was a potential market for mobile properties on the Pokémon platform
Future project success required achieving new goals
1. Stabilize our services & infrastructure for reduced downtime & customer impact
2. Reduce operational overhead of managing the DB platform
3. Reduce costs associated with our DB platform

Infrastructure drives for change

Indexes became extremely large and were wholly resident within memories
- Also scaling vertically to support that
Oversized EC2 instances
Duplicated infrastructure
- Maintain multiple redundant copies of data and indexes
Operational overhead
- Routine activities no longer routine
- Unexpected behaviors: Amazon EC2 restarts, capacity, etc.
- DB backups
- Patching/upgrading became unmanageable

Data tier architecture

Hundreds of DB instances across several roles
All deployed with Amazon EC2 Auto Scaling groups
One datastore for all data

Design goals

Leverage managed services
- Optimize resource utilization for our core business
Use appropriate datastore for data
- Event data
  - Records of logins and profiles maintenance for long term analytics and investigations
- User data
  - Support profile maintenance and authentication/authorization activities
- Configuration/TTL data
  - Short-lived data describing the current user state and various activities
High availability, stability, performance
- Encryption in transit and at rest
Reduce cost & right-size infrastructure
- Get rid of the licensing cost
- Only want to use and pay for what we need

Choosing Amazon Aurora PostgreSQL

The critical requirement is protecting children's data and general customers

	Pros	Cons
Amazon DynamoDB	Tier 1 AWS service Multi-region Highly scalable Easier lift from JSON?	Encryption (at the time) Internal expertise
Amazon MySQL	Internal expertise Encryption Feature rich	Nascent tech Multi-region JSON to relational ETL
Amazon PostgreSQL	Internal expertise Encryption DB internals Native support for JSON	Feature lag Multi-region

Summer 2016
- Pokémon Go launched
- Scaling and right sizing the architecture
Summer 2017
- Ready to start looking for an alternative data store
- DynamoDB, Aurora MySQL, or Aurora PostgreSQL
- Hire up an engineering team
Spring 2018
- DynamoDB didn't support encryption at rest (at that time)
- Aurora PostgreSQL natively support for JSON -> THE BEST OPTION
Summer 2018
- Use PostgreSQL as the authentication database
- Pull the data into JSON to avoid complex deals
- Decided to use just PostgreSQL not as a landing pad to get to MySQL

Data-driven approach

Acceptance criteria
- Authentication: 2k/sec
- User signups: 60+/sec
  - A read heavy application
- Admin bulk queries
Testing
- User generation: 200m
  - Users had a variety of statuses and conditions
- Performance test suite: Burst, soak testing
Iterate
- Admin bulk queries weren't working
- Rework schema and relational fields
- Rerun tests
Recommendation:
- If you're going to migrate your database of 300 million users, you don't do it based on a guess but do it with a data-driven approach

The migration plan

An iterative approach
- Improve cache stability
- Migrate out TTL and configuration data
- Stream event data
Each phase should independently deliver value

Migration phases

The architecture before the migration

Problem: Cache nodes
- Cache nodes failure caused over 15 minutes to 1 hour degradation of the entire application
Solution:
- Simply migrating the cache nodes to ElastiCache
- No code change
- Minor degradation which is resolved within 60 seconds

Move all of configuration and TTL data into DynamoDB

Move authentication and profile management records into S3 for an aggregated analytics and an individual investigation
This enabled us to reduce the size of the EC2 clusters.

Migrate user data into Aurora PostgreSQL

Planning for Aurora PostgreSQL

Contact AWS Professional Services to bridge the knowledge gap
- Validate our schema design and provide feedback
- Advice on how to tune DB parameter groups
- Tools for planning monitoring and tunig

Aurora PostgreSQL cluster design

PTC instances:
- Write new profiles directly to the cluster endpoint
Auth instances:
- Use the custom endpoint
- The read-only endpoint by default is going to balance all read-only requests across all nodes
- Isolate all login traffic to custom build nodes

Failover
- Allow one of auth instances to be the failover endpoint

Create the 3rd custom endpoint
- Handle bulk queries
- Isolate heavy traffic, and profile management or authentication activities

The migration: Extract-Transform-Load (ETL)

Transform & load
- "Pretty easy"
- Abandon users that had never activated
- Minor data changes, nothing structural
Extract
- Leverage NoSQL cluster architecture
- Map-reduce to find users not extracted
- Extraction process marks users as such in backup cluster
- Any user changes in production would overwrite changes in backup cluster

ETL - Test and Iterate

Test
- 11m user multi-cluster test setup
- Dozens of test runs
- Test cases - inactive users, user updates
~2% of documents were not overwritten
And iterate
- User profile change documents
- Add third cluster to do the backup and the extraction

Migration Day

Entire architecture before the migration
To do:
- Replace entire database platform
- Process thousands of logins, profile changes and authorization requests
- Opted for simplicity and a guarantee of consistent data between the two clusters

Disabled write activities for a period of time
Stopped NoSQL cluster after the final extraction finished

Shut off the extraction cluster

Repoint all of authentication traffic to Aurora PostgreSQL
Largely no one would have noticed this cut over

Repoint the PTCS application to use Aurora PostgreSQL cluster for writes
Given testers access to the cluster to do final sanity check

Let the traffic in
Put 300 EC2 instances into a terminated state

How did it go?

The good
- No authentication downtime
- 95% of users experienced no impact
- Plan worked as expected
- Performance was good and consistent
The bad
- Patches
- Some underperforming queries
The ugly
- Nothing

Design goals revisited

Leverage managed services
Use the appropriate datastore for the data
High availability, stability, and performance
Reduce cost & right size infrastructure

Overall value

	Old platform	New platform	Benefits
Technology	3rd Party NoSQL	Aurora, DynamoDB, S3	Independent scaling / managed
Infrastructure	~300	~10-20	~$3.5~4.5 million/year savings
Licensing	Costly	~$0	~$3.5~4.5 million/year savings
Dedicated resources	1 1/2 dev/engineer	None	1 1/2 dev/engineer savings
Stability	137 hours (6 months)	0	Customer experience, priceless

Project retrospective

What went well
- An angle approach to the project & problems we encountered
- Leveraging the experts: AWS DB Professional Services
- Segmenting the data and how it was handled
Key learning
- Deeper understanding of our data and services
- Prior solution was more expensive (money and people than we realized)

Tech org moving forward

Next phase
- Monitor & optimize the platform and our services
- Understand upcoming needs & changes to our data
- Scope our advanced analytics requirements
New architectural tenets
- Define and use common design patterns and components
- Simplify toolset, preferring managed services
- Documentation must exist, be well-defined
- Data standards & practices
- Use data to make decisions