www.micahlerner.com

Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer

2025-01-03T00:00:00-08:00

Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer

What is the research and why does it matter?

This paper shares Google’s techniques for operating a large cluster of machine learning resources (specifically, tensor processing units, aka “TPUs”) reliably and at scale.

A challenge the research discusses stems from the requirements of modern AI, which need significant computing power to train and serve models - in systems this large, component breakage leads to costly downtime in training and serving, a topic I previously touched on in Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. The key insight from the paper is that dynamic network reconfiguration (e.g. to route network traffic from training models around faulty TPUs and to working ones) dramatically improves availability and scalability of the system.

How does the system work?

There are five main technical components of the system:

TPU chips and their groupings into cubes (a “cube is a hardware unit with 64 TPU chips arranged in a 4x4x4 3D mesh”) and pods (64 cubes).
The inter-chip interconnect (ICI), that directly interconnects TPUs to allow device-to-device communication (the paper cites Remote Direct Memory Access, aka RDMA) without involving the CPUs. The ICI is like the “highway” that connects network traffic.
Optical circuit switches (OCSes) which contain mirrors that actually point the network traffic (in the form of light) on the right “highway” (provided by the ICI). This technology is discussed in more detail in Jupiter evolving: Transforming Google’s datacenter network via optical circuit switches and software-defined networking.
The Borg cluster manager (described in previous research) combined with a Pod Manager to handle TPU-specific considerations (e.g. ensuring Pod connectivity and health).
Code that allows TPUs to connect to one another (liptpunet) and evaluates hardware health (healthd).

The key insight of the paper is that through the combination of existing general technologies (e.g. the Borg cluster scheduler) and TPU-specific adaptations (e.g. dynamically reconfiguring the network if a TPU pod has problems), it is possible to make a more resilient system that recovers from hardware failure/degradation quickly.

The design of the system is influenced by previous versions where connections between TPUs were static (in contrast with the new system where connections can be easily reconfigured) - “in a static pod, all resources in a contiguous set of nodes must be simultaneously healthy to be assigned to a user, which becomes combinatorially less likely as the system scales”.

Furthermore, static configurations posed three other challenges:

Maintenance: updating any part of the stack that connects TPUs mandated downtime.
Workload defragmentation: a job had to be scheduled across contiguous GPUs, which made it hard to schedule jobs at different priorities (e.g. training jobs vs smaller experiments). With dynamic network reconfiguration, jobs can shift resources and the connections to use them on demand.
Deployment lead time: a TPU pod wasn’t available until all of the underlying resources were ready and installed. Now, TPU pods are usable even when only partially deployed.

How does reconfigurability work?

A key part of reconfigurability is quickly changing connectivity between TPU cubes/pods in response to failure or degradation. Two components work in concert to provide connectivity - the inter-chip interconnect (ICI) connects TPUs within a machine (4 TPU chips) and cube (16 machines), while the Optical circuit switches (OCSes) connect cubes (each containing 96 TPUs).

The ICI handles multiple levels of the networking stack, with technologies like RDMA (which supports collectives like “combine all of the data from other TPUs into a single result”) at the top-level.

The libtpunet software is responsible for setting up the inter-chip interconnect (ICI) according to a job’s needs, as well as discovering (using breadth first search, who said Leetcode isn’t useful!) and monitoring links (the latter of which becomes a factor in reconfiguration). The data generated by libtpunet also informs scheduling decisions.

Connectivity changes at three points in the lifecycle of a TPU cube: on job start, on failure, and on migration/preemption.

On job start, after a user submits a job that requires multiple cubes, the Borg scheduler selects which cubes to use. Then, the Pod Manager configures the Optical circuit switches (OCSes) to connect these cubes together in the right pattern. If that happens successfully, the Pod Manager communicates this fact back to Borg, which handles deploying the binary that will be executed and running the job.

Second, connectivity changes on failure, as detected by two software components (libtpunet and healthd). For example, when a TPU machine breaks down, an optical circuit switch (OCS) needs maintenance, or a network link starts showing errors. The two aforementioned systems provide signals to Borg/Pod Manager, who update the network topology for running jobs to route around failures. Additionally, these two systems reflect current state in the system’s source of truth, ensuring that future jobs don’t run into the same issues.

Lastly, connectivity changes on migration/preemption - this can happen when the cluster manager chooses to de-frag a job into fewer cubes, or relocate a job to make room for a larger one. In this case, the system needs to ensure that the new resources and networking links are configured correctly.

How is the research evaluated?

To evaluate the system, the paper includes data on the number of reconfigurations, ability to automate maintenance, and the impact of failure recovery mechanisms on performance.

To evaluate reconfiguration rate, the authors compare the number of xconnect actions with the number of jobs submitted. xconnect handles the low-level changes to networking connections, and varies with the number of jobs.

Additionally, the paper contains information on the failure rate of different system components:

In an average supercomputer, each day, 0.08% of the TPU machines, 0.005% of the ICI cables, and 0.04% of the OCS experience a failure. While these values are small, the number of jobs that are impacted by hardware outages is non-trivial because each supercomputer has a large number of machines, ICIs, and OCS. Machine and ICI outages are automatically tolerated by reconfiguring jobs to use spare healthy cubes.

The claim about automatically reconfiguring the system (both in response to failures and to handle changing workloads) is supported by data shared at the beginning of the paper - availability of the system grows, even with significantly larger TPU pods. Interestingly, being able to reconfigure at all has a much higher impact on availability than support for fault tolerant routing.

Lastly, the paper notes that there is a non-negligible cost to setting up a job for fault-tolerance, although not all jobs take advantage of the technique.

Specifically, fault-tolerant routing can slow down jobs due to congested network links, most notably for collective operations. This is visible in results for training recommendation models which use “embeddings” to represent data (part of training these models is updating embeddings on other chips, which is performed by collectives), and all-to-all communication to access embeddings on different TPUs.

Conclusion

The infrastructure challenges this paper addresses have become even more critical with the growth in AI model sizes and associated computational demands (although there is active discussion on scaling walls). I’d be interested to know whether the approach and pod-structure should or would scale beyond the current implementation - for example, what happens with even larger TPU pods (or is increasing TPU pod size a non-goal)? Lastly, I found the opportunity to reduce the impact of failures even further through the use of “hot-standbys” without the use of checkpointing (potentially migrating state from a problematic node), and I’m looking to future research on that front.

ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta

2024-03-28T00:00:00-07:00

ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta

What is the research and why does it matter?

Many tech companies have distributed services deployed in the cloud in regions around the world. The systems often depend on each other, meaning that they need to determine which dependencies are where (service discovery), and route the requests across the network (often performed via a “service mesh. Inter-system communication also needs to be highly reliable and load balanced.

This paper is about Meta’s infrastructure that implements these capabilities, called ServiceRouter.

While there are well known open source systems for routing traffic (e.g. Linkerd, Envoy, and Istio), there are a few interesting components of ServiceRouter:

ServiceRouter supports embedding inside Meta application code, significantly reducing cost from the common pattern of running separate “service mesh” infrastructure - the paper suggests that a separately deployed service mesh at Meta scale would need the equivalent of 1,750,000 AWS t4g.small VMs.
ServiceRouter is one of the first pieces of RPC routing infrastructure discussed in research deployed at hyperscale.
ServiceRouter is able to handle sharded stateful services, unlike open source alternatives.
The technology handles load balancing across regions using the novel idea of “Latency Rings”.

How does the system work?

Design

There are three main functions of ServiceRouter:

Gathering the data that informs how services talk to each other.
Distributing that data reliably around the network.
Routing a request from a service to another service.

To build a source of truth for routing decisions (which the paper calls the Routing Information Base (RIB)), ServiceRouter gathers information from the cluster manager about which services are running where. Importantly, ServiceRouter can also handle stateful services (discussed in my previous paper review on ShardManager) - for example, some services will store a specific subset of data on a specific server, so knowing the server alone is not enough.

As an input to the Routing Information Base, ServiceRouter also gathers information that allows it to make decisions about how services talk to each other across clusters (for example, monitoring the latency of traffic from North America to South America).

ServiceRouter then distributes the Routing Information Base across infrastructure in the network to allow services to make routing decisions.

To implement the part of the system responsible for making routing decisions, ServiceRouter supports three main types of deployments: SRLib, Remote Proxy, and Sidecar Proxy (the paper also mentions a fourth, SRLookaside which is now deprecated).

SRLib embeds the ServiceRouter functionality actually inside the application binary, deeply integrated with application source code. While this introduces some risk (e.g. if the embedded library had a bug or vulnerability, all applications would need to be re-released), it dramatically reduces hardware cost.

There are several situations in which SRLib performs suboptimally - for example, with traffic that goes across regions, it is preferable to have a smaller set of proxies that perform the RPC forwarding using long-held connections, lowering the overhead of sending RPCs.

ServiceRouter also supports codebases where it is difficult or impossible to embed SRLib directly. The paper cites one example of internal Erlang applications which didn’t have builtin support for the library, but still wanted to make use of Meta-internal systems.

Load Balancing

One of the most novel features of ServiceRouter is its approach to global load balancing traffic across regions.

The system implements this capability using the idea of locality rings for a service:

An RPC client uses cross-region RTTs to estimate its latency to different servers. Starting from ring1, if the client finds any RPC server whose latency is within the latency bound for ring i, it filters out all servers in ring i+1 and above, and randomly samples two servers from ring i. If the service has no servers in ring i, it considers servers in ring i+1, and so forth. SR’s default setting maps [ring1 ring2 ring3 ring4] to [same region neighboring regions same continent global].

The paper discusses several downsides to this approach, notably that latency alone doesn’t reflect how servers in a locality ring are being utilized. To solve this shortcoming, ServiceRouter integrates another input to the Routing Information Base - the load of a “locality ring”. This data allows ServiceRouter to support functionality like “route X% of traffic to this locality ring until the load of that locality ring exceeds X%, then send traffic to the next locality ring.” This is particuarly useful during incidents, where traffic can spill across multiple regions.

The paper also discusses alternatives to the locality ring approach, including relying solely on RPC latency and feedback from a service about overload to decide when to send traffic to a different locality ring - the authors decided not to follow this approach as they argue that routing would change only under severe overload.

How is the research evaluated?

The paper evaluates ServiceRouter on four main aspects: its scalability, the cost-savings of an embedded routing library, performance of global load balancing, and ability to handle sharded services.

To assess scalability, the paper shares data on the number of servers used by services and the requests per second by service:

A small fraction of services are very large while most are very small. Specifically, while 90% of services each use less than 200 servers, 2% of services each use more than 2,000 servers and the largest service uses about 90ć servers…Similarly, while most services have a low RPS, some hyperscale services process billions of RPS.

The paper also discusses several scalability challenges, specifically with the Routing Information Base, which must store data on Meta’s ever-changing services and production infrastructure. Interestingly, the authors say that the RIB is not currently a bottleneck, following their work to migrate off of Zookeeper and onto a custom datastore.

To evaluate hardware cost, the paper compares RPC latency and CPU overhead for Meta’s raw RPC library (called Thrift), embedded SRLib and the SRProxy - “across the RPC client and proxy, the SRProxy setup in total consumes more than twice the amount of CPU cycles as the SRLib setup”.

The paper also includes several production use cases of SRProxy. One example was for a sharded system that sends traffic cross-region. SRProxy was able to reduce cross-region latency because it reuses connections. Because ServiceRouter was able to effectively support cross-region load balancing, the system didn’t need to replicate all the shards to all the regions, significantly reducing capacity usage.

To evaluate load balancing, the paper considers the permutations of same-region and cross-region load balancing for both sharded and unsharded services. For same-region traffic of unsharded services, load balancing is quite good, represented with a low “coefficient of variation” for CPU usage and outstanding requests. The story for sharded services is more complicated due to inherent shard imbalance - “some shards are hot (receiving a lot of traffic) while others are cold (receiving little traffic), due to the nature of data stored in the shards.” In other words, even if ServiceRouter load balances performs perfectly, there will always be some variation of load between shards.

To evaluate global load balancing with locality rings, the paper includes an example of an incident where traffic spilled cross-region, and ServiceRouter was able to balance load below the 75% locality threshold.

Lastly, the paper shows that traffic to sharded services makes up a significant portion of total traffic, highlighting the requirement that this nuance needs to be natively supported in Meta’s service mesh.

Conclusion

While service meshes aren’t necessarily novel, ServiceRouter’s deployment at scale, along with its implementation of global load balancing and support for sharded services is unique.

Load balancing cross region at scale, in particular to handle reliability issues, is non-trivial. I’d be interested in hearing more about how teams formulate locality rings (as from the paper, it seems like some custom tuning is involved). Furthermore, the ideas behind locality rings seems ripe for further development - are latency and CPU usage the only factors that locality rings should be limited to? Relying only on those two metrics seems like a potential source of further instability during an incident (e.g. if a region was quickly returning many errors, its CPU utilization might appear low, meaning that ServiceRouter would send requests there, potentially exacerbating an outage with overload).

Lastly, embedding SRLib in an application’s code saves resources, but seems like it would introduce risk. For example, if SRLib had a fleet-wide security vulnerability or performance regression that couldn’t be turned off, what would the impact to services and developers be?

In future paper reviews, I’ll continue diving deeper on sharded services in hyperscale environments - for example, I’m planning on comparing ServiceRouter and its discussion of sharded services with Google’s paper from 2016 on Slicer.

A Cloud-Scale Characterization of Remote Procedure Calls

2024-03-03T00:00:00-08:00

This is one of several papers I’ll be reading from 2023’s Symposium on Operating Systems Principles (SOSP). If you’d like to receive regular updates as soon as they’re published, check out my newsletter or follow me on the site formerly known as Twitter. Enjoy!

“A Cloud-Scale Characterization of Remote Procedure Calls”

What is the research and why does it matter?

This paper is slightly different from others I’ve written about recently - rather than a novel system design, it contains a characterization of a production system, with the goals of sharing data with the research community.

Specifically, the research dives deep on contributors to RPC latency and performance in Google’s hyper-scale systems, including Google Search, Google Maps, Gmail, and YouTube. In some areas, this data matches existing research and points towards the benefits that further investment could provide. Other datapoints don’t match up with previous thinking, indicating the possibility of new research threads or focused interest in existing ideas!

Characteristics of RPCs at Hyperscale

The paper focuses on production traffic that uses Google’s internal RPC library, Stubby. This traffic powers first-party services and their accesses to other internal systems and databases. The authors use data from Monarch (the subject of a previous paper review!), Dapper (a library for distributed tracing) and Google Wide Profiling (which continuously gathers performance data from services deployed in Google datacenters).

An analysis of these datasets exposes five insights about Google-internal RPCs.

RPC performance is growing over time
Some RPCs take microseconds, while many take milliseconds
The RPC call graph mostly involves fan out, rather than deep call trees.
Many RPCs response/request sizes are small, but some are quite large.
A significant portion of RPC traffic is associated with access to storage.

First, the authors measure RPC performance improvement over time using “RPCs per CPU cycle”. This effect is because of factors like optimizing the RPC library (which reduces the cost of sending RPCs, allowing more RPCs to be sent with fewer resources). In turn, these performance improvements are posing a greater load on other resources, like the network.

Second, “not all RPCs are the same” in terms of their latency - some take microseconds, while others take milliseconds. Furthermore, a small number of RPC calls make up a majority of traffic, meaning that optimizing them could have outsized impact. Other calls are infrequent, but take up significant computation - “the slowest 1000 RPC methods account for only 1.1% of all calls, but they take 89% of the total RPC time.” The authors share data on per-method RPC latency and frequency to demonstrate these trends.

Third, “RPCs are Wider than Deep” - RPCs have significant fan out into other systems Google infrastructure, but don’t normally result in many services calling each other far down into the stack. The authors note this behavior matches with existing studies from Alibaba and Meta. The paper visualizes this insight with CDFs of “descendants” and “ancestors” in the RPC call graph - “looking at the number of descendants shows the scale of distributed computation performed by an RPC, and the number of ancestors provides insights into how the properties of RPCs change as they get deeper into the call graph of a root RPC.”

Fourth, there is an “elephant and mice distribution” of RPC sizes - “most RPCs are small with the smallest a single cache line (64 B)”. Others are significantly larger - “P99 requests and responses are 196 KB and 563 KB”. This data shows that projects like hardware accelerators would be able to optimize significant parts of the RPC workload, but would not be able to handle others (specifically, the authors reference “Zerializer: Towards zero-copy serialization”). The authors present this data using CDFs that show percentiles of request sizes and the ratio between response/request.

Lastly, a significant portion of RPC traffic is associated with accesses to storage - “these findings motivate application-specific optimizations, especially on storage systems, as storage is by far the largest distributed application in the fleet.”

RPC Latency

The papers dive into the sources of RPC latency in a client-server interaction - at a high level, the components boil down to client/server send and receive queues, server processing logic, the networking stack, and networking infrastructure.

To describe the cost of sending an RPC to an external service, minus server processing time, the paper uses the term RPC latency tax - the paper focuses on this because while “application-processing time dominates…[the] RPC tax can be significant.” This tax applies no matter how good a server gets at returning a response - for many RPC calls this tax makes up the bulk of their time.

This tax also varies across different types of services. For example, RPCs to an SSD cache would benefit the most from reducing the time an RPC spends in the server send queue, while RPC calls to the F1 database would benefit the most from reducing time in the client recv queue.

The RPC latency tax also varies across clusters - in other words, a service can respond faster to RPCs when it is deployed in cluster A instead of cluster B. This happens because of characteristics of the cluster, like CPU utilization and memory bandwidth - the paper calls these exogenous variables.

Each application category reacts differently towards these exogenous variables. Bigtable is a server-processing-heavy workload, and its performance is highly dependent on CPU utilization, memory bandwidth, wake-up time, and cycles per instruction. Video Metadata is queuing heavy, which follows a similar trend.

Resource Utilization of RPCs

There is also another cost for RPCs, the CPU cost, which the paper calls the cycle tax. Multiple components of the RPC flow contribute, however compression dominates.

The paper also evalutes the CPU cycle usage from unsuccessful RPCs - the single largest contributor are cancelled requests (likely sent because of request hedging). Other types of potentially avoidable errors consume a suprising amount of CPU resources (e.g. “entity not found” response codes).

Conclusion

I enjoyed this paper because of its focus on providing data on the potential impact of several opens areas of academic research - without this thorough characterization, it would be difficult to understand their expected value.

While many proposals are focused on Attack of the killer microseconds, these improvements aren’t required for many RPCs. The research also highlights challenges with solutions to known problems like tail latency - approaches like request hedging have their own downsides in wasted CPU resources. Rather than trying to globally optimize RPCs, focusing on specific operations is likely to the highest impact - “the 10 most popular RPC methods account for 58% of all calls and the top-100 account for 91% of all calls.” On the hardware front, accelerators (some of which have already been discussed in research) could yield significant benefits - for example, previous papers evaluated a hardware accelerator for protocol buffers.

With the insights from this paper, I’m looking forward to seeing how the authors follow up with future improvements and to see how other groups respond in adjusting the direction of their research (or not).

Lastly, the research cited a number of other industry studies about microservices architectures and their costs that I’m hoping to dive into with future paper reviews - specifically ServiceRouter.

Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints

2024-01-30T00:00:00-08:00

Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints

What is the research and why does it matter?

Training AI models requires a large amount of compute resources, in particular GPUs. Many large companies purchase their own GPUs, leading to both up front costs of acquisition, as well as ongoing spend to power the clusters where the GPUs are hosted. Furthermore, at an organization with many projects requiring these resources, there is contention for compute time.

While the first two sources of cost are difficult to minimize, effective usage of compute time is a growing area of research. One way that teams are making gains is by improving the reliability of model training - if a machine involved in training fails, in-progress work may be lost. Motivated by data that existing solutions don’t handle this case well, the authors propose a framework for limiting the amount of wasted resources - “According to the report from OPT-175B training…about 178,000 GPU hours were wasted due to various training failures.”

The Gemini paper aims to solves this problem by providing a failure recovery system for training runs. Rather than solely relying on far remote storage which is costly to read/write to, Gemini builds a multi-level cache comprising GPU memory, local and remote CPU memory, with persistent storage as a last resort. This new design and implementation is able to successfully achieve 13x faster failure recovery without significantly impacting training.

Motivation

A key idea in Gemini is “checkpointing” progress so that when a failure happens, the system doesn’t need to recompute results.

The paper lays out three factors to “wasted time” in existing checkpoint-based systems:

Checkpoint time: how long a model takes to build an intermediate checkpoint.
Checkpoint frequency: how often a training system builds intermediate states.
Retrieval time: how long it takes to fetch a checkpoint.

These factors manifest in existing systems where checkpoints are resource-intensive to create, limiting how often the system produces them. In turn, this impacts the freshness of checkpoints and increases “wasted time” - a system resuming from an old checkpoint will redo more computation.

Beyond creating the checkpoints, the system must store them and make them available with low-latency - to drive down wasted time, Gemini proposes a system that aims to “maximize the probability of failure recovery from checkpoints stored in CPU memory”.

How does the system work?

There are two parts of the Gemini system: the checkpoint creation module and the failure recovery module.

The checkpoint creation module creates checkpoints and figures out where to store them. It implements a distributed multi-level distributed cache of checkpoints stored in CPU Memory, GPU Memory, and Remote Storage.

The failure recovery module is responsible for evaluating whether a component of the system has failed and needs replacement - it contains contains four components:

Gemini worker agents: each node participating in training has a process that reports on machine health and provides state updates.
Root agent: worker agent promoted to leader via distributed consensus algorithm (e.g. Raft) and provides commands to workers (e.g. recover from failure).
Distributed KV Store: stores state on the machines in the network and assists in electing new root agents on failure.
Cloud Operator: the root agent communicates with a hosting provider (e.g. a central schedular) to perform actions like requesting more resources (e.g. more machines with GPUs).

Technical Challenges

The system aims to minimize “wasted time” by limiting the impact of machine failure. Key to the system’s approach is distributing checkpoints across machines in the network - the paper investigates different configurations of where to place the checkpoints - it calls them group, ring, and _mixed. The paper also includes pseudocode of the group algorithm.

Distributing checkpoints is not without its cost, particuarly to networking resources - as model training also uses the network, competing traffic could impact performance. To address this the authors implement traffic interleaving, which attempts to send the checkpoints over the network in a way that doesn’t interfere with other network traffic associated with training.

After creating a checkpoint, Gemini transfers it to remote GPU memory on another machine in the network, then that machine transfers it to CPU memory (which is relatively cheaper and more abundant).

A simple implementation following this approach transfer the whole checkpoint at once across the network to remote GPU memory. As checkpoints can be quite large, this implementation requires the destination machine to set aside a large amount of GPU memory just for receiving checkpoints (or risk hitting out of memory errors). Instead, the paper proposes splitting up a checkpoint into partitions, then incrementally transferring them over the network.

The authors also describe an approach to online profiling where the dynamics of network traffic are learned over time and then eventually feed into the decision making for sending traffic over the network. By combining this idea with checkpoint partitioning, Gemini is able to make decisions about when to send buffers over the network.

Resuming from failure

The authors describe two different failure types: software failure (e.g. the code running training has a bug) and hardware failure (e.g. the machine loses network connectivity or a hard drive breaks). Critically, GEMINI treats the two types differently because of the impact they have on the in-memory data used to restore from checkpoint - a software failure can likely recover from checkpoint stored in memory, while a hardware failures often requires a combination of machine replacement and fetching checkpoints from a different computer in the network.

How is the research evaluated?

The research evaluates GEMINI’s impact on training efficiency, and effectiveness in traffic interleaving. Additionally, the authors make projections around the system’s scalability and impact it could have in training large language models.

For training efficiency, the paper measures whether GEMINI changes training time, wasted time, and checkpoint time. When training three large models, the paper finds that GEMINI doesn’t increase iteration time (time where the model is doing work before it must pause to communicate) while significantly reducing wasted time in the presence of machine failures. Checkpoint time also goes down when compared to existing checkpoint-based training solutions.

The paper also measures the effectiveness of traffic interleaving (specifically by tracking iteration time), comparing the Gemini approach against existing baselines and other approaches (e.g. the naive implementation without checkpoint partitioning) - the Gemini solution doesn’t result in out of memory issues while keeping the iteration time the same and being able to recover from failure.

Lastly, the research contains projections about Gemini’s ability to reduce wasted time if the system was applied to training a large language model - while the results of this projection seem promising, it seems like there is more work to gather the effectiveness of Gemini at scale.

Conclusion

Gemini is a system that could potentially dramatically reduce wasted time in training AI models - as models continue to grow and use more resources in a distributed setting, recovering from failure will become even more of a concern than it already is.

One of my main takeaways from the Gemini paper is around the application of systems ideas to AI models and their training. For example, Gemini takes advantage of common patterns like reliance on a distributed key-value store, leader election, and a multi-tier memory system. The idea that adopting well-known patterns could lead to dramatic performance and reliability improvements in this new type of serving system is quite exciting - it means there is a lot of low hanging fruit!

I’m looking forward to further developments in this space, and hope to see a followup paper from the authors soon with more data on training a model at scale (or alternatively a reference to using Gemini-like techniques from other organizations).

XFaaS: Hyperscale and Low Cost Serverless Functions at Meta

2024-01-23T00:00:00-08:00

“XFaaS: Hyperscale and Low Cost Serverless Functions at Meta”

Background

Function-as-a-Service systems (a.k.a. FaaS) allow engineers to run code without setting aside servers to a specific function. Instead, users of FaaS systems run their code on generalized infrastructure (like AWS Lambda, Azure Functions, and GCP’s Cloud Functions), and only pay for the time that they use.

Key Takeaways

This paper describes Meta’s internal system for serverless, called XFaaS, which runs “trillions of function calls per day on more than 100,000 servers”.

Besides characterization of this unique at-scale serverless system, the paper dives deeper on several challenges that the authors addressed before reaching the current state of the infrastructure:

Handling load spikes from Meta-internal systems scheduling large numbers of function executions.
Ensuring fast function startup and execution, which can impact the developer experience and decrease resource utilization.
Global load balancing across Meta’s distributed private cloud, avoiding datacenter overload.
Ensuring high-utilization of resources to limit cost increases from running the system.
Preventing overload of downstream services, as functions often access or update data via RPC requests when performing computation.

How does the system work?

Architecture

The multi-region infrastructure of XFaaS contains five main components: Submitter, load balancers, DurableQ, Scheduler, and Worker Pool.

Clients of the system schedule function execution by communicating with the Submitter. Functions can take one of three types:

(1) queue-triggered functions, which are submitted via a queue service; (2) event-triggered functions, which are activated by data-change events in our data warehouse and data-stream systems; and (3) timer-triggered functions, which automatically fire based on a pre-set timing.

The submitter is an interesting design choice because it serves as an entry point to downstream parts of the system. Before the pattern was introduced, clients interfaced with downstream components of the system directly, allowing badly behaved services to overload XFaaS - now, clients receive default quota, and the system throttles those that exceed this quota (although there is a process for negotiating higher quota as needed).

The next stage in the request flow is forwarding the initial function execution request to a load balancer (Queue Load Balancers (QueueLB)) sitting in front of durable storage (called DurableQ) that contains metadata about the function. The QueueLB is one usage of XFaaS’ usage of load balancers, and ensures effective utilization of distributed system resources while preventing overload.

Once the information about a function is stored in a DurableQ, a scheduler will eventually attempt to run it - given that there are many clients of XFaaS, the scheduler, “determine(s) the order of function calls based on their criticality, execution deadline, and capacity quota”. This ordering is represented with in-memory datastructures called the FuncBuffer and the RunQ - “the inputs to the scheduler are multiple FuncBuffers (function buffers), one for each function, and the output is a single ordered RunQ (run queue) of function calls that will be dispatched for execution.”

To assist with load-balancing computation, a scheduler can also choose to run functions from a different region if there aren’t enough functions to run in the local-region - this decision is based on a “traffic matrix” that XFaaS computes to represent how much load a region should externally source (e.g. Region A should source functions from Regions B, C, and D because they’re under relatively higher load).

Once the scheduler determines that there is sufficient capacity to run more functions, it assigns the execution to a WorkerPool using a load-balancer approach similar to the QueueLB mentioned earlier.

Given the large numbers of different functions in the system, one challenge with reaching high worker utilization is reducing the memory and CPU resources that workers spend on loading function data and code. XFaaS addresses this constraint by implementing Locality Groups that limit a function’s execution to a subset of the larger pool.

Performance Optimizations

The paper mentions two other optimizations to increase worker utilization: time-shifted computing and cooperative JIT compilation.

Time-shifted computing introduces flexibility to when a function executes - for example, rather than specififying “this function must execute immediately”, XFaaS can delay the computation to a time when other functions aren’t executing, smoothing resource utilization. Importantly, users of the system are incentivized to take advantage of this flexibility as functions have two different quotas, reserved and opportunistic (mapping to more or less rigid timing where opportunistic quota is internally treated as “cheaper”).

Additionally, the code in Meta’s infrastructure takes advantage of profiling-guided optimization, a technique that can dramatically improve performance. XFaaS ensures that these performance optimizations computed on one worker benefit other workers in the fleet by shipping the optimized code across the network.

Preventing Overload

It is critical that accessing downstream services don’t cause or worsen overload - an idea very similar to what was discussed in a previous paper review on Metastable Failures in the Wild. XFaaS implements this by borrowing the idea of backpressure from TCP (specifically Additive increase/multiplicative decrease) and other distributed systems.

How is the research evaluated?

The paper evaluates the system’s ability to achieve high utilization, efficiently execute functions while taking advantage of performance improvements, and prevent overload of downstream services.

To evaluate XFaaS’s ability to maintain high utilization and smooth load, the authors compare the rate of incoming requests to the load of the system - “the peak-to-trough ratio of CPU utilization is only 1.4x, which is a significant improvement over the peak-to-trough ratio of 4.3x depicted…for the Received curve.”

One reason for consistently high load is the incentive to allow flexibility in the execution of their functions, highlighted by usage of the two quota-types described by the paper.

To determine the effectiveness of assigning a subset of functions to a worker using Locality Groups, the authors share time series data on the number of functions executed by workers and memory utiliation across the fleet, finding that both stay relatively constant.

Furthermore, XFaaS’ performance optimizations allow it to maintain a relatively high throughput, visible from contrasting requests per-second with and without profile-guided optimizations in place.

Lastly, the paper presents how XFaaS execution behaves in response to issues with downstream systems (specifically, not exacerabating outages). For example, when there were outage in Meta’s graph database (TAO, the subject of a previous paper review), or infrastructure related to it, XFaaS reduced the execution of functions accessing these services.

Conclusion

The XFaaS paper is unique in characterizing a serverless system running at immmense scale. While previous research has touched on this topic, none have provided specific numbers of utilization, likely omitted due to privacy or business concerns (although Serverless in the Wild comes close).

At the same time, the data on XFaaS comes with caveats, as the system is able to make design choices under a different set of constraints than serverless platforms from public cloud providers. For example, public clouds must guarantee isolation between customers and prioritize security considerations. While XFaaS doesn’t wholly neglect these concerns (e.g. some jobs must run on separate machines and there are some levels of isolation between jobs with these considerations), it otherwise relaxes this constraint. Furthermore, XFaaS explicitly does not handle functions on the path of a user-interaction (even though the paper discusses executing latency-sensitive functions) - this is in contrast with services like Lambda which use Serverless functions to respond to HTTP requests.

While XFaaS is a fascinating system, the paper left me with several questions including whether many of the functions the system executes would actually be better served with a batch job. Furthermore, the authors allude to XFaaS utilization being significantly higher based on “anecdotal knowledge” - while this might be true, it would be useful to know the source of this data to judge whether any differences are in fact meaningful.

Efficient Memory Management for Large Language Model Serving with PagedAttention

2024-01-11T00:00:00-08:00

Efficient Memory Management for Large Language Model Serving with PagedAttention

What is the research?

Large language models (like OpenAI’s ChatGPT, Google’s Bard, Meta’s Llama, and Mistral’s Mixtral) take in a user prompt and respond with generated text (note: for the purposes of this paper, the authors don’t include multi-modal response). Based on public reports, supporting this functionality is expensive, and given the relatively new nature of LLMs deployed at scale, there are opportunities for improving performance.

To that end, this paper focuses on increasing the queries per second (a.k.a throughput) large language models (LLMs) can serve through two innovations, PagedAttention, and vLLM, discussed in detail later in this paper review. Improving throughput can significantly decrease the cost of large language model serving by responding to more requests with the same number of GPU resources. The evaluations from the paper show that, “vLLM improves the LLM serving throughput by 2-4× compared to the state-of-the-art systems…without affecting the model accuracy at all.”

Based on the observation that large language model serving is memory bound, the authors identify several areas of improvement for GPU memory allocation, then design a system that addresses these shortcomings. One of the foremost problems they address is static allocation of memory. Existing LLM serving systems (or at least publically released ones) set aside fixed, contiguous memory to store the data needed to generate a response. If the response to the user is shorter than this fixed size, the resources are inaccessible to use for serving other requests until the original request is complete. Requiring contiguous memory blocks adds additional resource waste by “stranding” memory between the contiguously allocated areas of memory, causing it become unusable for serving other requests.

Borrowing ideas a page from virtual memory, the authors propose a solution, PagedAttention, that can dynamically grow the memory used in LLM serving (in addition to incorporating other optimizations). The paper also describes how PagedAttention is implemented in a new GPU serving library via the open source vLLM project.

How does the system work?

Large language models take in a prompt from a user, then generate a text response. The paper focuses specifically on improving the performance of serving for transformers, a technology used by predominantly all implementations of large language models to generate the next word in a sequence - for more background, I recommend The Illustrated Transformer and Understand how transformers work by demystifying all the math behind them.

Generating these sequences requires information on the users prompt, and about previous tokens in the response - this knowledge takes the form of vectors stored in memory in a data structure the authors call the Key Value Cache (aka KV cache). Because the limiting step in the execution of an LLM depends on reading and writing data to/from memory, an LLM process is “memory bound” - as a result, improving memory utilization (specifically, of the KV Cache) can increase performance of the system.

The authors identify three main types of waste in the KV Cache:

reserved slots for future tokens, internal fragmentation due to over-provisioning for potential maximum sequence lengths, and external fragmentation from the memory allocator.

PagedAttention

One of the paper’s key insights is that allowing a model to dynamically scale up its usage of non-contiguous memory can drastically improve memory utilization. The authors propose PagedAttention, which introduces the idea of logical and physical memory blocks for storing data in the KV Cache. This distinction is similar to virtual memory which provides the abstraction of contiguous RAM to a program, even though the data is physically stored in separate areas of RAM.

Blocks contain entries for more than one token, and blocks are allocated on demand based on how the LLM responds to a user query - for example, the prompt “Four score and seven years ago our fathers brought forth” contains ten tokens, causing the allocation of three blocks each with the space for four entries (the last block allocated because of the prompt is partially filled). Gradually allocating blocks primarily addresses internal fragmentation and reserved memory.

As the large language model generates tokens, it references data on previous tokens using a block table storing the mapping between logical blocks for a query and physical GPU DRAM. Critically, this approach allows for the GPU to serve multiple requests at the same time while using non-contiguous memory, addressing concerns like external fragmentation.

The paper also describes how PagedAttention approach is able to reduce memory usage in three other large language model serving request patterns - parallel sampling, beam search, and shared prefix prompting.

Parallel sampling involves generating multiple results for a single prompt - this can occur by having the LLM choose a different token, leading to a different branch of response. The implementation follows a “copy-on-write” pattern that reuse the same data in GPU memory until the branch in output occurs (at which point, the block with the difference is copied to a new location in memory, and execution completes independently for the different branches).

The paper also describes PagedAttention in the context of beam search, an algorithm for generating possible next states and choosing a “top-K” subset to continue with - the paper cites Sequence to Sequence Learning with Neural Networks when referencing beam search, but I think this explanation gets the gist across better. A beam search implemented with PagedAttention can reuse blocks across multiple search paths, meaning that the process has less memory overhead.

Lastly, the paper discusses PagedAttention’s impact on prompts with a shared prefix - in many situations, a user of an LLM will provide a separate “system” prompt that applies, no matter the details of the task (this is also discussed in OpenAI’s documentation on prompt engineering). One example system prompt is, “you are a helpful agent that only speaks JSON”. PagedAttention allows the blocks allocated for this part of the prompt to be reused across multiple tasks, reducing memory usage.

vLLM

To deploy PagedAttention in a distributed environment, the paper proposes the vLLM system, containing a scheduler (which chooses which work to run where), the KV Cache Manager, Workers (computers containing GPU hardware), and Block Allocators. I elide the details of this section given that vLLM is an open source project, and the details of the infrastructure are likely to change.

That said, there were a few interesting design choices that stuck out to me:

vLLM adopts patterns from Megatron-LM, which details how to run transformers at scale across many GPUs while minimizing communication.
vLLM implements the OpenAI API interface, simplifying developer adoption.
vLLM supports higher-level abstractions (via fork, append, and free commands) used to implement approaches like beam search, parallel sampling, and shared prefix - luckily the code is open source which allows for a deeper dive!

How is the research evaluated?

The paper compares performance of models served with vLLM against other serving systems (e.g. a custom implementation of Orca, an LLM-serving system described in research from OSDI 2022) emulating workloads sourced based on open source datasets (ShareGPT and Stanford Alpaca).

The paper compares three different types of tasks - basic sampling (e.g. normal LLM usage), search-based techniques like parallel sampling and beam search, and chatbot-like uses of LLMs (which have longer prompts, along with back and forth between the user and the LLM).

For basic sampling, parallel sampling, beam search, and chatbot-like workloads, vLLM is able to achieve significantly higher request rates.

Additionally, vLLM and PagedAttention are able to save significant amounts of memory on tasks where it is possible to re-use blocks (e.g. parallel sampling and beam search) - these graphs show average amount of memory saving as a percent, but it would be interesting to know in absolute terms.

Conclusion

PagedAttention and vLLM are at the cutting edge of systems research and its application to AI - something that is becoming more of a topic in research and in practice (e.g. Charles Frye’s post) now that LLMs are beginning to operate at scale. I’m looking forward to following along on the progress of the vLLM open source project, and from digging into the project, I discovered it is compatible with SkyPilot (an open source project for deploying infrastructure cross-cloud, discussed in research from NSDI 2023). As I tinker on LLM-based side-projects, I’m looking forward to experimenting with and learning from these promising new tools.

Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications

2024-01-02T00:00:00-08:00

Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications

What is the research?

The Blueprint paper talks about a new open source framework for configuring, building, and deploying application code. This framework aims to simplify iteration on system design, application development, and configuration.

The authors argue that these tasks are currently difficult to accomplish because many services have tight coupling between application code, framework-level components (like RPC libraries and their behavior), and the actual deployment of the service (e.g. with Docker, Kubernetes, or other systems like Ansible).

By explicitly separating concerns of an application, and explicitly defining their interactions in a programmatic configuration, the authors are able to test out new configurations of a system - for example, quickly reconfiguring a set of independently deployed microservices into a single monolithic binary, then measuring the performance impact of the change.

How does the system work?

Blueprint’s approach divides a system into three types of components:

Application level workflows: business logic that a developer writes to perform a specific function.
Scaffolding: underlying framework-level components like RPC functionality, distributed tracing libraries, and storage backends (like caches and databases).
Instantiations: specific configuration for framework-level components (e.g. using a specific RPC library with deadlines set or with novel functionality like circuit-breakers enabled.

A system is described in a programmatic configuration called a workflow spec which contains application logic and its external interface.

Next, a user of Blueprint creates a wiring spec that encode the relationship between pieces of application code and framework-level components. In one example, the authors recreate a simple microservice for posting on a social network, including connection to external caches and databases.

Blueprint then uses the wiring spec to compile an intermediate representation (an idea common to many compilers) of the system. The intermediate representation is effectively a graph with nodes describing code and edges describing dependencies (e.g. service A calls service B).

Lastly, the intermediate representation is used to build concrete artifacts representing the components of the system - for example, the build system can compile the code for a service written in Go and wrap it with a Docker image, enabling later deployments to production.

How is the research evaluated?

The authors evaluate several research claims about the implementation, but three themes stood out to me:

Does Blueprint make it easier for developers to try new configurations of an system’s existing components and libraries?
Can Blueprint be used to create system configurations that reproduce reliability issues?
What are the costs of the abstractions that Blueprint provides?

To evaluate the first question of whether Blueprint makes it easier to try new configurations for a system’s existing components, the authors considered the lines of code required to enable/disable tracing and to convert a microservice deployment into a monolith.

They were able to perform the first task of making changes to tracing with 5 lines of code. Similarly, by changing ~10 lines of code in the Blueprint configuration, they were able to generate a monolithic version of an application previously deployed as microservices, then quantify the performance impact of this change.

The authors also used Blueprint to reproduce or create reliability issues in a service - in particular they focused on Metastable failures described in a previous paper review. While creating specific configurations of a system to enable reliability testing is not necessarily a unique feature of Blueprint (e.g. Metastable Failures in the Wild discusses replicating metastability), the ease with which the authors performed this analysis was intriguing.

Lastly, paper analyzes how long it takes for Blueprint to generate systems of different sizes. While many of the examples are based on prototype systems, the authors also ran Blueprint on a system derived from a microservice dataset published by Alibaba.

Conclusion

Blueprint’s idea of separating the concerns involved in an application seems like a promising approach to dramatically increasing the velocity of the software development lifecycle (at least for microservices). One area that seems particularly exciting about Blueprint is the ability to simplify testing different service configurations across infrastructure - for example, rather than rewriting a large body of application code to test out a new tracing library, a developer can simply swap out the code in the Blueprint definition.

From reading the paper, there are several areas of further research for Blueprint, particularly around production readiness. For example, Blueprint’s compilation time for a system of ~3000 microservces described in a paper from Alibaba ran for 12 minutes. For organizations that would have many components in their configuration, the cost to run Blueprint would certainly be non-negligible. To speedup compliation of Blueprint, perhaps it would only recompute parts of the system touched by a developer’s changes.

Furthermore, adoption via onboarding new systems to Blueprint also seems like a challenge, as developers would need to perform some implementation in order to create the definition of their system - perhaps the team behind Blueprint will expand on tooling that automates this process by reading metadata source from a running system (e.g. traces).

2023 and looking forward to 2024

2023-12-27T00:00:00-08:00

A tradition of mine is to write a year end reflection, regardless of whether it makes it up onto the blog or not.

2023 year was a great year, and I’d go so far as to say the best of my life so far.

In my personal life, I celebrated an amazing first year of marriage, and bought a house in San Francisco. While the narrative of SF is that it is in a “doomloop”, the Bay Area has remained the center of gravity for technology innovation - predominantly all of the innovative AI companies have their headquarters in “Cerebral Valley”. There is also a budding community of builders. I’m optimistic that on the medium to longer term timescale, San Francisco’s challenges will be resolved.

Professionally, I also had amazing opportunities for growth, being a tech lead for a 20+ person team of Googley engineers helping to design and scale innovative new products in Maps (e.g. immersive route preview, which uses Neural Radiance Fields). In 2024 I’ll be presenting about product reliability at industry conferences (e.g. SRECon), which I’m quite excited about. While the internet likes to critique Google and its culture, it is clear that the company 1) continuously deploys serious innovation at scale and 2) users around the world continue to love the company’s products.

Balancing my rewarding personal and professional lives impacted my bandwidth for producing content, and I explicitly prioritized the other areas of my life - tradeoffs I embraced enthusiastically. While I wrote fewer paper reviews, I also tried learning-in-public via streaming my process. These tweaks seemed to resonate, and the number of subscribers grew quickly.

I enjoy the process of learning, distilling knowledge, and sharing it with others - wholesale abandoning these activities isn’t something that I intend to do. That said, given my rewarding personal life and demanding job, I’ve rethought how I invest my time in my creative pursuits.

In 2024 and beyond, I’ll be more intentional about the technical areas that I invest in learning about. I think that some topics in systems research are beginning to provide diminishing returns for me personally - for example, while I’m far from an expert, I’m roughly familiar with flavors of distributed databases. As a result, I am not as excited about developments there, meaning I’m less likely to write a deep dive on a new paper. Furthermore, I don’t have interest in producing content solely for the goal of maximizing engagement (e.g. “going full influencer”).

Instead, I want to make a concentrated bet by pivoting my focus towards AI and the systems that power these new technologies - AI is here to stay, and will become a larger part of my life. Even without achieving artificial general intelligence, more everyday tools will begin to take advantage of recent rapid innovations. As a software engineer by trade, I can see AI’s exciting ability to impact my work on the horizon (e.g. with the rise of competent large language models capable of answering technical questions and writing code). In the past, going deep on the fundamentals has been not only interesting, but has also benefited me and my career. I’m very much looking forward to applying this approach, and seeing how this bet pays off down the road.

Looking forward to a great 2024 and thank you for joining me on this journey!

Defcon: Preventing Overload with Graceful Feature Degradation

2023-07-23T00:00:00-07:00

Defcon: Preventing Overload with Graceful Feature Degradation

This is one in a series of papers I’m reading from OSDI and Usenix ATC. These paper reviews can be delivered weekly to your inbox, or you can subscribe to the Atom feed. As always, feel free to reach out on Twitter with feedback or suggestions!

What is the research?

Severe outages can occur due to system overload, impacting users who rely on a product, and potentially damaging underlying hardware. It can also be difficult to recover from outages involving overloaded system due to additional problems this type of outages cause - in particular, cascading failures. There are many potential root-causes to a system entering an overloaded state, including seasonal traffic spikes, performance regressions consuming excess capacity, or subtle software bugs. As such, limiting the damage caused by overload conditions is a complicated problem.

To prevent overload from impacting its products, Meta developed a system called Defcon. Defcon provides a set of abstractions that allows incident responders to increase available capacity by turning off features, an idea called graceful feature degradation. By dividing product features into different levels of business criticality, Defcon also allows oncallers to take a variety actions depending on the severity of an ongoing incident.

The Defcon paper describes Meta’s design, implementation, and experience deploying this system at scale across many products (including Facebook, Messenger, Instagram, and Whatsapp) along with lessons from usage during production incidents.

Background and Motivation

The authors of Defcon describe several alternatives they considered when deciding how to mitigate the risk of system overload. Each of the options is evaluated on the amount of additional resources that the approach would consume during an incident, the amount of engineering effort required to implement, and the potential impact to users.

Given that serious overload events happen on a recurring basis (at least once a year), the authors decided to invest engineering resources in an engineering-intensive effort capable of limiting user impact.

How does the system work?

The core abstraction in Defcon is the knob, which represents for each feature: a unique name, whether a feature is turned on or not, the oncall rotation responsible, and a “level” corresponding to business-criticality.

After a feature is defined using this configuration, servers or applications (for example, in Web or iOS devices) import the knob into code and implement code paths that handle cases when the knob is turned off - for example, short-circuiting expensive logic.

During testing and incident response, operators change a knob’s state via a command-line or user interface, and Defcon handles replicating this state to impacted consumers (like servers and mobile applications). Knob state is also stored in a database.

Defcon’s Knob Actuator Service propagates state changes for two types of knobs: server-side knobs and client-side knobs:

Server-side knobs are implemented in binaries running on the servers in data centers. The advantage of server-side knobs is that we can adjust the knobs’ state in seconds without any propagation delays.

Client-side knobs are implemented in client code running on phones, tablets, wearables, and so on. The advantage of client-side knobs is that they have the capability to reduce network load by stopping requests sent to the server along side reducing server load due to the request.

Client-side knobs (like those in an iOS application) are slightly more complex to update. Under normal conditions, they change via a push (called Silent Push Notification (SPN)) or routine pull (Mobile Configuration Pull) mechanism. To handle extenuating circumstances (like lower latency response to severe outages), Defcon can also instruct clients to pull a broader set of configuration stored in a specific server-location using a process called Emergency Mobile Configuration.

Knobs are, “grouped into three categories: (1) By service name, (2) by product name, and (3) by feature name (such as “search,” “video,” “feed,” and so on)” to simplify testing during development and post-release. Testing occurs through small scale A/B tests (where one “experiment arm” of users experience feature degradation, and the “control” arm does not) and during larger exercises that ensure the Defcon system is working (described later in the paper). These tests also have the side effect of generating data on what capacity a feature or product is using, which serves as an input to capacity planning.

During incidents, oncallers can also use the output of these tests to understand what the potential implications are of turning off different knobs. The

How is the research evaluated?

The paper uses three main types of datasets to quantify Defcon’s changes:

Real-time Monitoring System (RMS) and Resource Utilization Metric (RUM), which aim to measure utilization of Meta infrastructure. The specifics of which one to use depends on the experiment, as discussed below.
Transitive Resource Utilization (TRU), which aims to measure the downstream utilization that a service has of shared Meta systems (like its graph infrastructure described in my previous paper review on TAO: Facebook’s Distributed Data Store for the Social Graph).
User Behavior Measurement (UBM), which tracks how changing a knob’s state impacts business metrics like “Video Watch Time”.

The first evaluation of Defcon’s impact is at the Product-level. By turning off progressively more business-critical functionality, the system makes greater impact on Meta’s resource usage. Entirely turning off critical features (aka “Defcon Level 1”), saves a large amount of capacity, but also significantly impacts critical business metrics.

Defcon is next evaluated for its ability to temporarily decrease capacity required of shared infrastructure. As discussed in a previous paper review of Scaling Memcache at Facebook, Meta uses Memcache extensively. By turning off optional features, oncallers are able to decrease load on this type of core system.

Next, the research describes how Meta can decrease capacity requirements by turning off knobs in upstream systems with dependencies on other Meta products. For example, turning off Instagram-level knobs decreases load on Facebook, which ultimately depends on TAO, Meta’s graph service. Testing knobs outside of incident response surfaces resource requirements from these interdependencies.

The Defcon paper describes a protocol for forcing Meta systems into overload conditions, and testing the impact of turning progressively more business-critical features off. By ramping user traffic to a datacenter, these experiments place increasing load on infrastructure - turning knobs off then alleviates load.

Conclusion

The Defcon paper describes a framework deployed at scale in Meta for disabling features in order to mitigate overload conditions. To reach this state, the authors needed to solve technical challenges of building the system and to collaborate with product teams to define feature criticality - in some ways, the latter seems even more difficult. The paper also mentions issues with maintainability of knobs. On this front, it seems like future work could automate the process of ensuring that knobs cover features inside of deployed code. Lastly, I’m looking forward to learning more about Defon’s integration with other recently published Meta research, like the company’s capacity management system.

Towards an Adaptable Systems Architecture for Memory Tiering at Warehouse-Scale

2023-06-29T00:00:00-07:00

This is one in a series of papers I’m reading from ASPLOS. These paper reviews can be delivered weekly to your inbox, or you can subscribe to the Atom feed. As always, feel free to reach out on Twitter with feedback or suggestions!

Towards an Adaptable Systems Architecture for Memory Tiering at Warehouse-Scale

Applications running in datacenter environments require resources to operate, like dynamic random access memory. DRAM is both expensive and in high demand. As alternatives to DRAM emerge, new approaches to trading off performance for cost became available to datacenter applications - specifically, the paper mentions Intel Optane, which offers significant cost savings

To use this new resource type while limiting the impact to application performance, the authors proposed, built, and deployed at scale a new system called Transparent Memory Tiering System (TMTS). The approach interacts with Google’s Borg scheduler to adaptively move an applications in-memory data from high performance memory to lower cost mediums, an approach it calls memory tiering - based on usage, the system “promotes” in-use memory pages into high performance memory, and “demotes” infrequently-used memory to lower/performance mediums.

When deployed at scale, TMTS replaced 25% of DRAM with lower cost solutions, while incurring little performance impact to the vast majority of applications.

What are the paper’s contributions?

The paper makes three main categories of contributions:

Design and implementation of a memory tiering system.
A testing methodology for evaluating changes to the system at scale.
Lessons and evaluation from running the implementation in production.

How does the system work?

System metrics

The paper discusses the key tradeoff the system needs to make between potential cost savings and performance degradation. For example, applications on the critical path of user requests (which the paper calls high importance latency sensitive (HILS)) are more sensitive to latency and performance impact than batch workloads.

Furthermore, for a tiered memory system to deliver on its promise of cost savings, it needs to strike the right balance of using lower performance hardware - if the cluster scheduler doesn’t run jobs on the lower performance hardware, the resources intended to save cost will sit around being unutilized (meaning the potential cost savings will not occur and there are more resources that you’ve paid for). On the other hand, if the scheduler assigns latency-sensitive applications to lower performance hardware, performance will suffer.

Using these two considerations, the paper defines two metrics to measure the success of memory tiering:

Secondary Tier Residency Ratio (STRR) represents the “fraction of allocated memory residing in tier2 [lower performance memory]”.
Secondary Tier Access Ratio (STAR) “is the fraction of all memory accesses of an application directed towards pages resident in tier2”. This is a proxy for application performance impact because an application accessing lower tier memory will likely incur higher latency.

In summary, the goal of the system is to maximize usage of cheaper/lower performance memory (represented via STRR) while minimizing negative impact to application performance (via STAR).

System Architecture

The memory tiering system is divided into four levels of abstraction: hardware, kernel, userspace, and the cluster scheduler.

At the bottom of the stack is the underlying hardware, made up of several types of memory devices with different performance and cost profiles.

Immediately above the hardware is the kernel, which abstracts the hardware into different tiers (tier1 for higher performance, tier2 for lower performance) and operates on hardware abstractions like memory pages. Inside the kernel, the system uses daemons to monitor memory accesses, building the dataset that will inform the promotion/demotion process.

Above the kernel is user space, where a management daemon (ufard) makes demotion and promotion policies for memory between tier1 and tier2, then conveys changes in policies to the kernel. The promotion/demotion policy can change over time based on information that the kernel provides to this userspace daemon - for example, information on how many pages were not accessed recently. Other components of the system also run in user space, including a scheduler component and the applications themselves.

At the top layer, the cluster scheduler makes decision about where to run applications based on their memory needs and performance of the system. The paper describes how the scheduler consumes information about which tiers of memory are available on which machines to make placement decisions.

Hot page promotion and cold page demotion

A key component of the memory system is demoting cold pages to low-cost memory, and promoting hot pages to higher performance resources.

A page is classified as “cold with threshold t if it has not been accessed in the prior t seconds”, but the policy about when to demote pages to cold memory is dependent on the needs of the application.

An application’s policy can also be adaptive, for example:

“the kernel provides the userspace daemon a cold age histogram - the frequency distribution of inter-access interval duration. It answers questions such as how many pages were not accessed for at least 2 minutes. The policy engine uses this to identify application access patterns and adjust parameter values.”

To promote pages from tier2 to tier1, the tiered memory system relies on two approaches: proactive promotion and periodic scanning.

Proactive promotion aims to move pages from tier2 to tier1 as soon as they are likely to receive more accesses, rather than waiting until a surge of access occurs (which would introduce application latency). This proactive process is informed by signals from hardware, in particular the Performance Monitoring Unit (PMU)) - for example, sampling last level cache miss events provides insights into which data is actively being used.

Periodic scanning complements the sampling-based approach by scanning pages over repeating periods. and promoting them based on how many consecutive “scan periods” the page has been accessed in. This approach is more accurate, but higher overhead. The system also aims to limit thrashing - if a page is potentially going to be demoted, but was recently promoted, the demotion process waits for a longer time period before taking action.

These monitoring processes use a combination of perf_event_open and Berkley Packet Filter (BPF) in the kernel which “optimize[s] the collection of tier2 hot page ages and their page addresses from the in-kernel page.”

How is the research evaluated?

System Evaluation

Memory tiering is deployed in production and is constantly evolving to perform more effectively. To evaluate the system, the paper considers three areas: memory utilization / task capacity, residency ratios, access ratios / bandwidth, and overall performance impact

Memory utilization / task capacity represents the impact that the system has on individual applications - if an application is performing poorly (for example, serving requests with high user facing latency), the scheduler will either schedule more tasks for the application (increasing task capcity) or put fewer tasks on the impacted machines (leading to lower utilization, as there will be machines with fewer tasks). The paper presents data that shows memory utilization and task capacity isn’t significantly impacted by the memory tiering system.

Residency ratios gauges how successful the system is at storing infrequently used pages in tier2 memory. First, the paper shows that the Secondary Tier Residency Ratio (STRR) is close to the percentage of deployed tier2 hardware, demonstrating effective use of tier2 memory. Additionally, the paper includes data on the ratio of cold memory stored in tier2, which is between 50 and 75% across all clusters - the paper compares this to swap based solutions which reach 10-25% memory coverage.

Access ratios / bandwidth are used to understand if the pages in tier2 are accessed frequently (which would impact performance), and whether accesses result in promotions/demotions - “about 80% of tier2 bandwidth is due to applications accessing pages resident in that tier, promotion being about 1/3 of the remaining and demotion 2/3”. The paper argues, “This suggests the system is effective in selecting pages for demotion while avoiding thrashing/ping-pong effects.”

Overall performance impact is core to the tradeoffs that the tiered memory system is making, and the paper uses instructions per cycle (IPC). The authors were targeting a performance impact of 5%, but TMTS impacted a subset of applications more severely.

Digging deeper into the performance impact of tier2 memory, one example discussed by the paper is on huge pages. Hugepages can take up to large amounts of memory, but accesses to a small part of the hugepage can cause it be promoted. Demoting hugepages is also difficult because while the system was capable of breaking up the hugepages into smaller components when demoting, a “mostly cold” hugepage wouldn’t be demoted at all. Because many hugepages weren’t demoted, they were occupying space in tier1 memory, lowering tier2 memory. The authors describe two solutions including “migrating hugepages intact, without breaking them apart into 4KB pages on demotion” and compacting huge pages to produce more entirely cold pages (which can then be migrated to tier2).

Policy Evaluation

Beyond the performance of the system itself, the paper also considers the impact that different policies can have on its northstar metrics (STRR and STAR).

Demotion policies are capable of changing the amount of cold memory in tier2 by trading off performance, for example executing policies more frequently (leading to cold pages moving to tier2 faster). The paper describes tweaking demotion policies according to whether an application serves high importance latency sensitive (HILS)) traffic. Lengthening the time that pages used by HILS applications take to demote to tier2 had minimal impact on percent of tier2 used (STRR), but significant performance impact (represented via STAR, the amount of access ratios for pages in tier2).

The paper also discusses promotion policies, and argues that applications are actually more sensitive to situations when a page is not yet promoted to tier1, but is frequently accessed. The paper considers three policies to address this concern: 60s promotion (2, 30 second scans), 30s (1 30s scan), and a combination of 60s promotion with PMU-based sampling. Effectively all policies have the same outcome with respect to memory ending up in tier1, but the combined approach (described earlier in the paper) is able to successfully promote pages faster because the PMU-based sampling datasource provides faster information on accesses to tier2 memory.

Conclusion

I found the tiered memory paper interesting because it illustrates the tradeoffs between performance and cost for hardware resources deployed at scale at scale. The research also builds on previous work, but uniquely includes many lessons from production - for example, evaluating policies based on their impact to north star metrics gathered from the wild. Lastly, the system described by the paper is enabled by integrating with a robust, extensible scheduler capable of making informed decisions about job placement. This abstraction allowed successful deployment at scale without involving individual application developers, dramatically decreasing the time to deployment.