Autopilot: A journey towards clairvoyant resource management in Google’s internal cloud

Prelegent: dr hab. Krzysztof Rządca, prof. UW (Uniwersytet Warszawski & Google)
Miejsce: Sala seminaryjna B1-7/8
Data i godzina: 3 grudnia 2019 r., g. 12:30

Streszczenie: When working on scheduling theory, I used to start the description of my model with a magic phrase: „We assume jobs’ resource requirements are known (a clairvoyant model)”. In this talk I’ll describe how this assumption might be achieved in a real-world, large scale infrastructure.

When submitting a job to a cloud, a user specifies limits on the job’s resource usage. However, humans are notoriously bad at predicting, especially predicting non-tangible resources such as CPU cores or GB of memory. Under-prediction has potentially disastrous consequences in particular for user-facing jobs: a job exceeding its limits might be throttled or killed, resulting in end-user requests delayed or dropped. Thus, human operators tend to err on the side of caution and over-allocate. Summed over the whole infrastructure, such widespread over-allocation and the resulting low utilization of hardware leads to significant costs in infrastructure over-expansion.

In my talk I will describe Autopilot, a production automation tool that Google uses in its internal cloud. Autopilot sets jobs’ resource limits using a combination of heuristics and machine learning. The context – a low-level production system many other higher-level services depend on – translates into ambitious reliability and efficiency requirements that are challenging for an ML application.