How Woot.com built a serverless data lake with AWS analytics #reinvent [ANT333]

2019.12.11

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

This post is the session report about ANT333: How Woot.com built a serverless data lake with AWS analytics at AWS re:Invent 2019.

日本語版はこちらです。

Abstract

Woot.com designed and developed a data lake as a replacement for their legacy data warehouse to deliver powerful analytics capabilities across multiple business areas. In this session, learn how it used Amazon EMR, Amazon Athena, AWS Glue, and Amazon QuickSight to build an automated process to ingest and centralize data from external and internal sources for immediate analysis.

Speakers

Karthik Kumar Odapally
- Sr Solutions Architect, Amazon Web Services
Zhaolu (Yolanda) Meng
- Woot.com
Theo Carpenter
- Sr. Systems Manager, Woot.com

Woot.com gained modern data lake architecture which brought them a lot of benefits. Let's see the point of migrating the data store into the data lake in AWS.

Content

What is Woot!?

An Amazon owned subsidiary company
But everything is outside of the amazon ecosystem
Have small teams for engineers which work on impact all of its organization
"One day, one deal"

Problem statement

Automation
- Add data in minutes and standardize the data solution
Scaling
- A platform or a solution for scaling with the exponential growth
Reporting
- User configurable
- How do we report on the actual data?
- How do we ensure that we're not looking at false positive?
- How do we ensure that the BI is easy to act on the data in near real-time?
Performance
- Data now
- When we have peak events like Black Friday, we need to have access to the data at that time
Accessibility
- Tools and log-on
- The ability to meet customer expectations

Legacy solution

Single database (DB) instance
- Shared resource
- Complex custom ingestion
- The reports were not being generated during Black Friday
- 30,000 lines of the code which made the bug fixing difficult
Difficult to use
- Separate DB users
- Learning curve
- Took 3 days to fix the excel formatting data

Requirements

Platform agnostic
- Any data, any source being compatible in the cloud
Resilient
- Separation of duties
- A platform where to run a single query doesn't take hours
Self-service
- Data democratization
- Have engineers rather focus on the business problems than access managements and database tuning

10,000-ft view

Architected as Lego blocks
Corporate data center
- Legacy SQL server
- NAV services
Woot production VPC
- Using AWS managed services
Data Lake
- Use S3 as a landing zone
- Raw data and transform data
ETL
- Glue loads data from S3
- Adds the business logic and the magic sauce
DWH
- Reporting
- Load into Amazon Redshift
- Access from third-party vendors
- Athena for engineers
- QuickSite for the instant connectivity

Migrate existing data

Selected S3 as a data lake
Set up and streamed existing data into S3 by AWS DMS
Moved data between buckets with different accounts
Brought compressed file over to the other side which saved on costs
Include column names for Glue

Building data pipelines

No dependency on any particular technology
Scalability
Use AWS SDKs of .NET and Kinesis Firehose
- A direct put ingestion
- Point directly to S3
S3 meets the requirements of availability, scalability and durability
Separate accounts
Had Lambdas trigger off by DynamoDB streams to run ETL

Building it all together

Work with the data in S3
AWS Glue handled the ETL scheduling
- Use Glue crawlers to collect the metadata
- Crawlers recognizes new columns and new schemas
- Transform JSON based data from Firehose into Parquet

The Woot data lake solution

Amazon Kinesis Data Firehose for data ingestion
Amazon S3 for data storage
AWS Lambda and AWS Glue for data processing
AWS Database Migration Service (AWS DMS) and AWS Glue for data migration
AWS Glue for orchestration and metadata management
Amazon Athena and Amazon QuickSight for querying and for visualizing data
AWS Directory Service for user identity

GODS architecuture

GODS (Google's of Data Services; Good Old Data Services)
Query data from any source
An evaluation with calculating the forecast
- What the sales price is?
- How new the object is?
- Is it a tier 1 or smaller vendor?

Job orchestration

A lot of different data sources, processes and ETLs
Store all job status into DynamoDB with dependency tags
Every time a table is updated, run a lambda to keep track of the job status

What's next?

Adopt AWS lake formation
- Multiple environments
- Keep consistency across multiple environments
- Configuration simplification
Transactional data
- Incremental data loads
- ETL and view simplification
More deal evaluation
- Models
- Historical tracking system

Lessons learned

Aggregation
- Pre-aggregate raw data, such as clickstream data
Preserve raw data
- Versioning
- Re-aggregating when a problem happens
- Reduce the cost by moving old data into Glacier
Service limits
- Balance between the number and the size of the file and the speed of query
Data quality

Pain points

Poor visibility; When an Athena query fails, it doesn't tell rapidly
QuickSite doesn't have the default admin role
Autosave is on by default
Due to a resource limitation, Glue retries on the same worker process, which also goes fail
Glue can't read wrong datetime fields,

Wins

Magic features
Performance
AWS integrated
Ease of use
Flexibility

Data points

12 TB -> 60 TB
- 80% is new data
40 hours saved weekly
- Automate business reports
90% operating cost reduction
- Reduce licensing costs and cluster costs
1 account -> 8 AWS accounts sharing data
Calculate 600 million rows in a week
0 screaming Woot monkeys harmed