Leveraging workload knowledge to design data center networks

PhD thesis [ pdf ]

Authors:
Vojislav Dukic

Data center networks are at the heart of cloud infrastructure. They allow cloud workloads to scale, be flexible, and meet the needs of modern businesses. Since the demand for higher bandwidth, lower latency, and minimal resource cost, is growing, cloud providers must continuously improve the efficiency of their infrastructure. The key to achieving optimal performance of data center networks is to understand the communication needs and behavior of modern cloud workloads.

Intuitively, as cloud operators collect more knowledge about their tenants and applications, i.e., obtain workload specifications like required bandwidth, tail latency requirements, time to first byte, etc., they can use that knowledge to precisely provision the physical network infrastructure and deploy sophisticated control algorithms that maximize performance, increase utilization, and reduce the cost of cloud resources. However, obtaining and leveraging workload specifications is challenging in practice. On the one hand, users do not have clear incentives in terms of performance and cost benefits to invest the effort and help obtain the specifications of their workloads. On the other hand, cloud operators cannot provide those benefits without having a substantial number of workload specifications. Thus, many proposed systems that depend on application-specific knowledge remain in the domain of academic research, despite their significant performance advantages.

To break this vicious circle, we propose a set of methods, theoretical results, and systems that enhance the process of obtaining and using workload specifications in the data center network environment. Moreover, we demonstrate how to explore and utilize the space of various workload specifications. We start the exploration from coarse-grained insights from the past execution of cloud workloads and demonstrate how they can be used to reduce the cost of physical network infrastructure. Our system, Iris, leverages historical knowledge to reduce the overall cost of one of the most expensive parts of cloud networks – Data Center Interconnect (DCI) – by an order of magnitude compared to equivalent workload-agnostic solutions.

Furthermore, we analyze how to obtain fine-grained workload specifications that describe the future behavior of cloud applications and use them to enhance network efficiency. Thanks to our system, Flux, we automatically infer advance specifications using machine learning methods. Then, we show how to leverage these specification estimates to deploy sophisticated network control and scheduling mechanisms that achieve an order of magnitude improvement in terms of flow completion time and queue occupancy compared to the systems deployed in the cloud today.

Finally, we provide a set of rules and guidelines that cloud providers need to satisfy in order to motivate tenants to collaborate in the process of obtaining and utilizing workload specifications, and ultimately, make these specification-dependant systems practical in the modern cloud environment.