How Woot.com built a serverless data lake with AWS analytics #reinvent [ANT333]
This post is the session report about ANT333: How Woot.com built a serverless data lake with AWS analytics at AWS re:Invent 2019.
日本語版はこちらです。
Abstract
Woot.com designed and developed a data lake as a replacement for their legacy data warehouse to deliver powerful analytics capabilities across multiple business areas. In this session, learn how it used Amazon EMR, Amazon Athena, AWS Glue, and Amazon QuickSight to build an automated process to ingest and centralize data from external and internal sources for immediate analysis.
Speakers
- Karthik Kumar Odapally
- Sr Solutions Architect, Amazon Web Services
- Zhaolu (Yolanda) Meng
- Woot.com
- Theo Carpenter
- Sr. Systems Manager, Woot.com
Woot.com gained modern data lake architecture which brought them a lot of benefits. Let's see the point of migrating the data store into the data lake in AWS.
Content
What is Woot!?
- An Amazon owned subsidiary company
- But everything is outside of the amazon ecosystem
- Have small teams for engineers which work on impact all of its organization
- "One day, one deal"
Problem statement
- Automation
- Add data in minutes and standardize the data solution
- Scaling
- A platform or a solution for scaling with the exponential growth
- Reporting
- User configurable
- How do we report on the actual data?
- How do we ensure that we're not looking at false positive?
- How do we ensure that the BI is easy to act on the data in near real-time?
- Performance
- Data now
- When we have peak events like Black Friday, we need to have access to the data at that time
- Accessibility
- Tools and log-on
- The ability to meet customer expectations
Legacy solution
- Single database (DB) instance
- Shared resource
- Complex custom ingestion
- The reports were not being generated during Black Friday
- 30,000 lines of the code which made the bug fixing difficult
- Difficult to use
- Separate DB users
- Learning curve
- Took 3 days to fix the excel formatting data
Requirements
- Platform agnostic
- Any data, any source being compatible in the cloud
- Resilient
- Separation of duties
- A platform where to run a single query doesn't take hours
- Self-service
- Data democratization
- Have engineers rather focus on the business problems than access managements and database tuning
10,000-ft view
- Architected as Lego blocks
- Corporate data center
- Legacy SQL server
- NAV services
- Woot production VPC
- Using AWS managed services
- Data Lake
- Use S3 as a landing zone
- Raw data and transform data
- ETL
- Glue loads data from S3
- Adds the business logic and the magic sauce
- DWH
- Reporting
- Load into Amazon Redshift
- Access from third-party vendors
- Athena for engineers
- QuickSite for the instant connectivity
Migrate existing data
- Selected S3 as a data lake
- Set up and streamed existing data into S3 by AWS DMS
- Moved data between buckets with different accounts
- Brought compressed file over to the other side which saved on costs
- Include column names for Glue
Building data pipelines
- No dependency on any particular technology
- Scalability
- Use AWS SDKs of .NET and Kinesis Firehose
- A direct put ingestion
- Point directly to S3
- S3 meets the requirements of availability, scalability and durability
- Separate accounts
- Had Lambdas trigger off by DynamoDB streams to run ETL
Building it all together
- Work with the data in S3
- AWS Glue handled the ETL scheduling
- Use Glue crawlers to collect the metadata
- Crawlers recognizes new columns and new schemas
- Transform JSON based data from Firehose into Parquet
The Woot data lake solution
- Amazon Kinesis Data Firehose for data ingestion
- Amazon S3 for data storage
- AWS Lambda and AWS Glue for data processing
- AWS Database Migration Service (AWS DMS) and AWS Glue for data migration
- AWS Glue for orchestration and metadata management
- Amazon Athena and Amazon QuickSight for querying and for visualizing data
- AWS Directory Service for user identity
GODS architecuture
- GODS (Google's of Data Services; Good Old Data Services)
- Query data from any source
- An evaluation with calculating the forecast
- What the sales price is?
- How new the object is?
- Is it a tier 1 or smaller vendor?
Job orchestration
- A lot of different data sources, processes and ETLs
- Store all job status into DynamoDB with dependency tags
- Every time a table is updated, run a lambda to keep track of the job status
What's next?
- Adopt AWS lake formation
- Multiple environments
- Keep consistency across multiple environments
- Configuration simplification
- Transactional data
- Incremental data loads
- ETL and view simplification
- More deal evaluation
- Models
- Historical tracking system
Lessons learned
- Aggregation
- Pre-aggregate raw data, such as clickstream data
- Preserve raw data
- Versioning
- Re-aggregating when a problem happens
- Reduce the cost by moving old data into Glacier
- Service limits
- Balance between the number and the size of the file and the speed of query
- Data quality
Pain points
- Poor visibility; When an Athena query fails, it doesn't tell rapidly
- QuickSite doesn't have the default admin role
- Autosave is on by default
- Due to a resource limitation, Glue retries on the same worker process, which also goes fail
- Glue can't read wrong datetime fields,
Wins
- Magic features
- Performance
- AWS integrated
- Ease of use
- Flexibility
Data points
- 12 TB -> 60 TB
- 80% is new data
- 40 hours saved weekly
- Automate business reports
- 90% operating cost reduction
- Reduce licensing costs and cluster costs
- 1 account -> 8 AWS accounts sharing data
- Calculate 600 million rows in a week
- 0 screaming Woot monkeys harmed