Migrating the live Pokémon database to Aurora PostgreSQL #reinvent [GAM304]
This post is the session report about GAM304: Migrating the live Pokémon database to Aurora PostgreSQL at AWS re:Invent 2019.
日本語版はこちらです。
Abstract
Learn how the Pokémon Company International stores and supports data for more than 300 million players with the help of AWS. In this session, Jeff Webb (development manager) and David Williams (senior DevOps engineer) discuss how they migrated to Amazon Aurora PostgreSQL and Amazon DynamoDB, resulting in zero downtime for operations. Hear about best practices for migration and get a glimpse into the services supporting Pokémon Go, including AWS Lambda, Amazon Kinesis, Amazon Redshift, and more.
Speakers
- Chris Finch
- Senior SA Game Tech Evangelist, Amazon Web Services
- David Williams
- Sr DevOps Engineer, Pokemon
- Jeff Webb
- Manager, Software Engineering, The Pokémon Company International
Pokémon GO had a huge impact on not only people's live all over the world but also the systems of The Pokémon Company International. They needed to redesign the architecture to stabilize their services and reduce downtime. Their approach to deal with that, that is the agile and data-driven way was pretty cost-effective and fast, which was really interesting.
Content
Introduction
The Pokémon Company International (TPCI)
- Subsidiary of The Pokémon Company (TPC)
- Manages Pokémon property outside of Asia
- 324 million video games
- 1,000 anime episodes
- 26 billion trading cards
- Brand management
- Localization
- Trading card game
- Marketing
- Licensing
- PokemonCenter.com
Before Pokémon GO (pre-2016)
Pokemon.com was the focus of the small tech team and included:
- Pokemon.com
- Marketing and user engagement
- Pokémon TV (Animated episodes over the past 24 years)
- Organized Play
- Manage trading card game league/tournament
- A platform for other software to manage the league/tournament
- Collect championship points over a season
- Pokémon Trainer Club
- User registration
- User profile management
- Authentication
- Millions of accounts
- Used by Pokémon.com and a few smaller digital products
Pokémon Traner Club Service (PTCS)
- Purpose: User registration and login system for governance reasons and regulations
- COPPA
- GDPR
- A child cannot manage their own account
- To keep children from getting access to open chats
- To track children's payment history
- Size: Into the hundreds of millions
- Usage: Millions of logins a day
Preparing for Pokémon GO
- Shift to the NoSQL solution
- Super fast seek times when it based on key
- Scale the architecture to high amount of user growth
- No side effect on pokemon.com and the Organized Play
Pokémon GO launched (July 2016)
Everything changed but service and DB performance was good
- 10x growth of PTCS users in six months
- 100% increase in PTCS users by end of 2017
- Additional 50% increase in users by end of 2018
Service stability issues: 2017/2018
- Service and DB availability was not good
- Downtime: 137 hours (down or degraded) in six-month period
- Licensing and infrastructure costs increasing
- 300 nodes required to support
- Engineering time
- Full-time support from 1-2 resources
Business drivers for change
- Instability of the DB platform was impacting customer and visitor experience
- Recognize that there was a potential market for mobile properties on the Pokémon platform
- Future project success required achieving new goals
- Stabilize our services & infrastructure for reduced downtime & customer impact
- Reduce operational overhead of managing the DB platform
- Reduce costs associated with our DB platform
Infrastructure drives for change
- Indexes became extremely large and were wholly resident within memories
- Also scaling vertically to support that
- Oversized EC2 instances
- Duplicated infrastructure
- Maintain multiple redundant copies of data and indexes
- Operational overhead
- Routine activities no longer routine
- Unexpected behaviors: Amazon EC2 restarts, capacity, etc.
- DB backups
- Patching/upgrading became unmanageable
Data tier architecture
- Hundreds of DB instances across several roles
- All deployed with Amazon EC2 Auto Scaling groups
- One datastore for all data
Design goals
- Leverage managed services
- Optimize resource utilization for our core business
- Use appropriate datastore for data
- Event data
- Records of logins and profiles maintenance for long term analytics and investigations
- User data
- Support profile maintenance and authentication/authorization activities
- Configuration/TTL data
- Short-lived data describing the current user state and various activities
- Event data
- High availability, stability, performance
- Encryption in transit and at rest
- Reduce cost & right-size infrastructure
- Get rid of the licensing cost
- Only want to use and pay for what we need
Choosing Amazon Aurora PostgreSQL
- The critical requirement is protecting children's data and general customers
Pros | Cons | |
---|---|---|
Amazon DynamoDB | Tier 1 AWS service Multi-region Highly scalable Easier lift from JSON? |
Encryption (at the time) Internal expertise |
Amazon MySQL | Internal expertise Encryption Feature rich |
Nascent tech Multi-region JSON to relational ETL |
Amazon PostgreSQL | Internal expertise Encryption DB internals Native support for JSON |
Feature lag Multi-region |
- Summer 2016
- Pokémon Go launched
- Scaling and right sizing the architecture
- Summer 2017
- Ready to start looking for an alternative data store
- DynamoDB, Aurora MySQL, or Aurora PostgreSQL
- Hire up an engineering team
- Spring 2018
- DynamoDB didn't support encryption at rest (at that time)
- Aurora PostgreSQL natively support for JSON -> THE BEST OPTION
- Summer 2018
- Use PostgreSQL as the authentication database
- Pull the data into JSON to avoid complex deals
- Decided to use just PostgreSQL not as a landing pad to get to MySQL
Data-driven approach
- Acceptance criteria
- Authentication: 2k/sec
- User signups: 60+/sec
- A read heavy application
- Admin bulk queries
- Testing
- User generation: 200m
- Users had a variety of statuses and conditions
- Performance test suite: Burst, soak testing
- User generation: 200m
- Iterate
- Admin bulk queries weren't working
- Rework schema and relational fields
- Rerun tests
- Recommendation:
- If you're going to migrate your database of 300 million users, you don't do it based on a guess but do it with a data-driven approach
The migration plan
- An iterative approach
- Improve cache stability
- Migrate out TTL and configuration data
- Stream event data
- Each phase should independently deliver value
Migration phases
- The architecture before the migration
- Problem: Cache nodes
- Cache nodes failure caused over 15 minutes to 1 hour degradation of the entire application
- Solution:
- Simply migrating the cache nodes to ElastiCache
- No code change
- Minor degradation which is resolved within 60 seconds
- Move all of configuration and TTL data into DynamoDB
- Move authentication and profile management records into S3 for an aggregated analytics and an individual investigation
- This enabled us to reduce the size of the EC2 clusters.
- Migrate user data into Aurora PostgreSQL
Planning for Aurora PostgreSQL
- Contact AWS Professional Services to bridge the knowledge gap
- Validate our schema design and provide feedback
- Advice on how to tune DB parameter groups
- Tools for planning monitoring and tunig
Aurora PostgreSQL cluster design
- PTC instances:
- Write new profiles directly to the cluster endpoint
- Auth instances:
- Use the custom endpoint
- The read-only endpoint by default is going to balance all read-only requests across all nodes
- Isolate all login traffic to custom build nodes
- Failover
- Allow one of auth instances to be the failover endpoint
- Create the 3rd custom endpoint
- Handle bulk queries
- Isolate heavy traffic, and profile management or authentication activities
The migration: Extract-Transform-Load (ETL)
- Transform & load
- "Pretty easy"
- Abandon users that had never activated
- Minor data changes, nothing structural
- Extract
- Leverage NoSQL cluster architecture
- Map-reduce to find users not extracted
- Extraction process marks users as such in backup cluster
- Any user changes in production would overwrite changes in backup cluster
ETL - Test and Iterate
- Test
- 11m user multi-cluster test setup
- Dozens of test runs
- Test cases - inactive users, user updates
- ~2% of documents were not overwritten
- And iterate
- User profile change documents
- Add third cluster to do the backup and the extraction
Migration Day
- Entire architecture before the migration
- To do:
- Replace entire database platform
- Process thousands of logins, profile changes and authorization requests
- Opted for simplicity and a guarantee of consistent data between the two clusters
- Disabled write activities for a period of time
- Stopped NoSQL cluster after the final extraction finished
- Shut off the extraction cluster
- Repoint all of authentication traffic to Aurora PostgreSQL
- Largely no one would have noticed this cut over
- Repoint the PTCS application to use Aurora PostgreSQL cluster for writes
- Given testers access to the cluster to do final sanity check
- Let the traffic in
- Put 300 EC2 instances into a terminated state
How did it go?
- The good
- No authentication downtime
- 95% of users experienced no impact
- Plan worked as expected
- Performance was good and consistent
- The bad
- Patches
- Some underperforming queries
- The ugly
- Nothing
Design goals revisited
- Leverage managed services
- Use the appropriate datastore for the data
- High availability, stability, and performance
- Reduce cost & right size infrastructure
Overall value
Old platform | New platform | Benefits | |
---|---|---|---|
Technology | 3rd Party NoSQL | Aurora, DynamoDB, S3 | Independent scaling / managed |
Infrastructure | ~300 | ~10-20 | ~$3.5~4.5 million/year savings |
Licensing | Costly | ~$0 | ~$3.5~4.5 million/year savings |
Dedicated resources | 1 1/2 dev/engineer | None | 1 1/2 dev/engineer savings |
Stability | 137 hours (6 months) | 0 | Customer experience, priceless |
Project retrospective
- What went well
- An angle approach to the project & problems we encountered
- Leveraging the experts: AWS DB Professional Services
- Segmenting the data and how it was handled
- Key learning
- Deeper understanding of our data and services
- Prior solution was more expensive (money and people than we realized)
Tech org moving forward
- Next phase
- Monitor & optimize the platform and our services
- Understand upcoming needs & changes to our data
- Scope our advanced analytics requirements
- New architectural tenets
- Define and use common design patterns and components
- Simplify toolset, preferring managed services
- Documentation must exist, be well-defined
- Data standards & practices
- Use data to make decisions