Spot Instances are servers that you can bid on whose supply comes from servers that have been paid for, but are not in use. But you can’t have these servers forever — if the original customer wants them back, you get two minutes to return your servers. Can the business value of saving all that money be balance with the operational risk of running on such (theoretically) unstable infrastructure? At Yelp, the answer to that question is "Yes!" Yelp uses AWS Spot Fleet, a mechanism to launch EC2 instances in Amazon Web Services with deep discounts (~80%).
Yelp was an early adopter of Mesos and Marathon, building PaaSTA, a PaaS that provides an easy way for developers to deploy their services and batches. As Yelp migrated more parts of the infrastructure to run on PaaSTA, they had to figure out how to maximize cluster utilization and minimize costs. In this talk from MesosCon EU 2017 in Prague, Rob Johnson discusses how Yelp autoscales both services and servers, shuffling tasks around its Mesos cluster to improve utilization while dealing with the extra volatility caused by running on AWS Spot Fleet. He tells stories of outages, strategies for improving resilience against AWS pulling the plug on instances with 2 minutes warning and gracefully migrating services actively serving traffic, and discuss how we decide when to increase and decrease cluster capacity.
About the speaker:
Rob Johnson works as a Site Reliability Engineer on the Operations team at Yelp in London. Most of Rob's time is spent developing PaaSTA, Yelp’s internal platform-as-a-service, which runs nearly all of Yelp's production services. Rob has spoken at MesosCon previously about PaaSTA, and is keen to return to talk about how the platform has grown and developed.