[Session Report] How to build a platform that developers love (How Starbucks created their platform) #DOP219

This session reports talks about how Starbucks have developed their platform for their tenants. It sheds light on their initial history and idea as a platform and learning from their mistakes to their current platform which efficiently supports their application team.
2023.11.30

This session talks about how Starbucks has built their platform to cater for its tenants, and why it needed a platform after all, what worked for Starbucks, their take on the recent shift from DevOps to Platform engineering.

Their best practices and how they measure failure/success, their idea of ownership of the platform, and the way they observe the platform using Data Dog.

I want to write this session report in the form interview QnA

session info

What is platform engineering for Starbucks?

platform

When it started they had to think whether to keep it on-premises, there were public cloud offerings, and at the same time, lot of things were happening on Starbucks, a platform designed around the shared ownership models between the platform and the application team. It was first made live In Japan and then after learning from it was made for America.

history

The evolution of platform engineering at Starbucks has 3 phases:

phases

Phase 1: Origin

Learning how to manage infrastructure code, how to give a roadmap for liability and scaling were some of the key areas of this phase which was achieved through a handmade manual approach

Phase 2: Evolution

In this phase upgrading infrastructure was of essence. Moving from a handmade manual approach to pipelines, developing tooling for scaling from phase 1 was being tacked. Things like what is the best tool to achieve, runbooks were defined

Phase 3: Expansion(current day)

Now the platform is developed, the next major business use case is to bring new tenants to the platform which in turn leads to new use cases, new opportunities, and redefining processes with an ultimate purpose that if automation can help us it should help the application team too

What are your Guiding principles?

Have a solid foundation

Allows to grow architecture. Knowing what you want and what you want to achieve for whom is critical.

Leading by examples

If we want tenants to own stuff and decide upon the responsibilities it is imperative that first it should be established for the platform team; the way the platform team runs documentation, automates their processes owns their mistakes etc.

Long term aspirations

What this could be, what it can grow to have certain expectations and aspirations is the need of the principles because the platform is like a product which needs to be evolved and optimised and grow with time.

Q: Why does your team want to expand and transition to the platform?

why to use platform

To achieve Security by design, declaratively define scalable cloud environments, diving the ownership among the application team and platform team, and establishing shared responsibility.

ownership

Q: What was the plan as a platform team?

plan

  • Reduce platform pain so that developers can focus on needs application needs The platform team will make the platform better.

    • Built a lot of automation
    • CI/CD: easy to use pieplines
    • The small stuff: automate small tasks, automate tasks which take a lot of time
    • Processes: Adding processes, defining rails and guidelines in how to operate between tenant-platform relationship
  • Onboarding
    • How tenants will be onboard to the platform
    • The first part of the process for a new tenant
      • we define boundaries whose responsibility will be what, the platform will be responsible for what
      • what support you expect, what platform will offer
      • demo environment: Number of tutorials just to give the feel of the platform
    • Second step: onboard access to demo environment, tells how to use platform.
    • Observability :
    • Template monitoring: for onboarding new tenants
      • build templated GitHub repo
      • Allows to download template and customise value as a tenant
    • How do tenants get help and support
      • Created a Community forum on Slack internally called Slack Overflow as a result culture has been established where tenants help each other.

demo env

monitoring template

slack overflow

Q: What is the end goal?

  • Setting up for success for tenants !!!

How do you measure success?

success

  • We have done things wrong, reiterate them. Have a Feedback loop from tenants and act on it. The platform is never gonna be perfect, has to be treated like a product.

How to measure success/failure

  • Philosophy:
    • Blameless culture
    • always room for new ideas, and new engineers are also the part of planning process for adding new features.
  • Continuous approach
    • each iteration is an opportunity to improve
    • postmortems should be done after every iteration not only when there is a failure.

Q: What are the Lessons learned?

lessons

  • How can the platform help
  • Have lots of documentation. Docs should be for everything
  • Tutorials and having the template
  • Identifying pain points
  • Being open to feedback

Q: What tooling you are happy about?

Observability

datadog

Datadog is the one-stop solution for all the observability needs logs, metrics and traces. It is one stop because it has integrations with everything

Golden signals on Datadog

golden-signals

What's next

Future, as a platform team at Starbucks there is a lot, Cost optimisation is part of ownership, Tenants looking for resource consumption is also critical and that is why the platform team has dedicated recommended alerting as a template to help tenants. It allows to monitor and get alerts for tenants and their services

cost optimization

alerting

Questions from the Audience

how to start everything in place, we don't have knowledge, how did you get that first plan?

  • Security approval for the first tenant was the

Lessons learned on how to solicit pain points

  • asking directly tenants what working and what's not
  • give people the confidence to bring the issue
  • make the culture let them ask

Onboarding, what do you do to tenants who are there if changes happen

  • Updating tutorials, bringing existing tenants along with us
  • get active feedback while building it

Suggestion for a team who doesn't use Datadog?

  • Datadog is the most exclusive having all features, monitoring as code, observability as code tool [ NO PAID PROMOTION]

Interested in the composition of the platform team and the formal process of gathering interviews for requirements, do you have a PM?

In the past engineers were doing their job, having a Slack tool to have feedback and ask questions, At present there Tenant engagement solution team

Q: challenging when it comes to culture as standardisation of documentation

  • Separate the source of truth for documentation
    • Tenant-facing docs. Confluence where all pages exist.
    • GitHub repo for the platform team

Takeaway

devops is not dead

As a DevOps engineer, I feel DevOps is not at all DEAD. DevOps approach, skills and tools are still needed to bring them under the umbrella of the Platform to support the application team (tenants). Now platform cannot be built in one day and proper planning is required to see what this platform will serve as a foundation, before separating responsibility it is important to implement features as a platform team first and have a long-term view to expand and continuously improve this platform.

Observability cannot be ignored it is as critical as documenting everything in order to support tenants, automation is key so proper tooling and creating processes are important for a successful platform and continuous approach to improve the platform because it is like a product which needs to be updated, improved and cater the growing needs.