How does Cruise Migrate Monorepo CI to Kubernetes
I have been working on Cruise’s CI infrastructure over the past several years. Along with witnessing the business’s ups and downs, I helped drive the Cruise CI platform’s evolution over multiple iterations and it’s been a great learning experience. As I worked in the CI space, there were many interesting and challenging problems that we came across that I felt are interesting and are common problems. I’ve always wanted to spend some time writing it down and sharing it with other people. If you are working in the same field, or are looking for solutions to similar problems, or might’ve already solved the same problem but in a different way. If you happen to be one of those cases, I will be super interested to hear and learn from what you’ve gotten.
Cruise V2 CI Infrastructure Primer
Back in 2020, Cruise built a custom continuous integration system known internally as V2, running on top of Buildkite. Before that, we were using a self-hosted CircleCI Enterprise instance. But as our engineering organization grew, so did the cost, reliability, and security pressures. We needed more control and flexibility than an off-the-shelf system could offer. Buildkite gave us that control and allowed us to optimize for our unique scale.
Why We Needed V2
Cruise operates at two extremes:
A massive Monorepo (called cruise/cruise) holding our core autonomous vehicle and AI software, plus supporting frameworks such as simulation, model training, and testing. About 500 engineers commit to it, merging hundreds of pull requests and triggering tens of thousands of CI jobs each day. All of this goes through a single merge gate, so if CI breaks, hundreds of people are blocked.
Thousands of smaller repositories supporting cloud services, internal tools, OS and mobile development. These don’t directly power the AV stack but keep the business running smoothly.
The V2 system gave us a way to serve both worlds while balancing fast growth and infrastructure stability.
If you are interested in more details of the early story, feel free to leave comments and let me know. I will find some time and open a new blog post to dig more into that. And for this post, I’ll mainly focus this blog’s topic of the whys and hows of migrating the current V2 system to V3 system based on Kubernetes. I will talk through challenges we came across with the v2 system, how we are able to improve it further in V3, what are the data we collected and what are the future topics we will explore as a next step.
The V2 CI System
One of our key design choices was to run each CI job on a fresh ephemeral virtual machine (VM). Every job gets a new VM, and the VM is destroyed after the job finishes. This simplified the architecture and eliminated a major pain point from the old system:
Dirty disks and noisy neighbors. Monorepo builds can generate hundreds of gigabytes of data. Over time, leftover data would fill disks, causing “out-of-disk” errors and failed builds. With ephemeral VMs, each job starts from a clean state and avoids these failures.
Fig 1 - High Level Diagram for using ephemeral VMs
Monorepo Cache in V2 System
Ephemeral VMs solved the “dirty disk” problem but introduced another one: no local cache between jobs.
Our monorepo is now ~300GB, mostly large file storage (LFS) objects. A full git clone can take 45+ minutes because of LFS smudging.
We also use Bazel as our build tool, which keeps its own cache under ~/.cache/bazel. In ephemeral VMs, these caches are lost after each job.
To speed things up, we built a persistent disk cache that stores and refreshes both git/LFS data and Bazel artifacts daily. This significantly cut down start-up time for new jobs.
Challenges
After our V2 CI system was launched, we migrated all the workloads on top, and its availability was increased from <80% to 95% over the course of the next 2 years. At the same time, there were several new problems that got surfaced:
queue time is long - this is because, for each job, the V2 system will create a new VM, and it takes time to create VMs in a cloud environment. In Google Cloud, it typically takes about 3-5 mins to create a new VM. As a result, whenever there is a spike of jobs or there are few idle agents, new jobs will be queued within the system waiting for a new VM to be created. This is a poor customer experience, and no customer wants to wait several minutes for their job to start.
availability is capped at 95% - this is because GCE control plane’s availability is around 95%, and that’s much lower than its data plane’s availability (99.99%). source. Since the V2 design created the dependency from CI data plane to the GCE control plane, its availability is limited by GCE control plane’s availability. This is shown in the past years where all of our SEV-1 incidents are caused by GCE outages.
useful artifacts can’t be cached - while the V2 design served its purpose to keep the CI execution environment clean, and there is no dirty data carried over from previous jobs. It caused the problem that some good data also gets thrown away. Those data include things like common docker images, Bazel caches, and LFS objects, etc. If there is a way to keep those good data, it will create more opportunities to further optimize the CI build latency by leveraging caches.
Constraints
On top of the above challenges, there are other constraints that we have to deal with to satisfy the business requirements. Firstly, since we’ve gotten thousands of Buildkite pipelines onboarded to the V2 platform, we want to keep those pipelines working as-is. Because asking all of our customers to modify their existing pipeline definitions will be a cost-prohibitive process. This eliminated Buildkite Agent Stack K8S, which requires using a new Buildkite kubernetes plugin to define the CI jobs in the form of Pod spec. Secondly, due to the same reason, one of the most commonly used Buildkite plugins within our pipelines is docker-buildkite-plugin; the plugin allows the CI job to run within a customer-defined Docker container image. It’s a very useful plugin, and so we have to keep it working.
The V3 System
Containerize the CI Buildkite Agent
Containers have been the golden standard to run cloud-native applications nowadays. More and more product services and infrastructure services are migrated or being migrated to Kubernetes. And running Cruise CI within Kubernetes became an appealing approach that we started to consider seriously. What we liked about containers is that: 1) it allows us to replace recycling VMs with recycling containers, and it’s much faster to create a container than VMs. 2) it allows us to hold the VMs (Kubernetes node) much longer after the CI job finishes. And during GCE control plane outages where VMs cannot be provisioned, we can still run jobs with our existing pool of Kubernetes nodes. 3) it creates an abstraction layer that allows us to choose between what data to be cleaned up (as part of container deletion) and what data to be kept (for the sake of caching).
Sysbox as Container Runtime
By default, GKE uses the container runtime provided by Containerd, which is runC. However, due to the fact that CI jobs need to run Docker, it creates this Docker-in-Docker problem. And running Docker daemon within another Docker container is largely discouraged. To tackle the challenge, we use Sysbox as the container runtime.
Sysbox is an advanced container runtime that allows containers to run system-level software like Docker without needing privileged mode. It enhances security by providing better isolation and user-namespace support, making containers act more like lightweight VMs. Sysbox integrates well with Kubernetes, enabling efficient and secure execution of complex CI/CD workflows. By using the RuntimeClass, we’re able to register Sysbox as a custom runtime handler, and launch pods backed by Sysbox runtime using the “runtimeClassName” pod spec field.
To configure Sysbox as the container runtime using a RuntimeClass, you need to create a RuntimeClass resource in your Kubernetes cluster that specifies Sysbox as the runtime handler. This involves setting the runtimeClassName field in your pod specification to the name of the RuntimeClass that uses Sysbox. This configuration ensures that the pods are scheduled on nodes where Sysbox is installed and used as the container runtime, enabling features like running Docker inside containers without privileged mode, better isolation, and enhanced security. This setup is particularly useful for our case where it requires running docker-buildkite-plugin in a backward-compatible way.
V3 Monorepo Cache
In Kubernetes, we leveraged Persistent Volume, Persistent Volume Claim and VolumeSnapshot to implement the V2 Persistent Disk cache for the monorepo. We created a scheduled job that pulls down the repository and creates the Bazel cache daily onto a PV (baked by GCE PD). After the cache is prepared, our automation will create a VolumeSnapshot on top of the PVC for later to be consumed by the Buildkite Agent pod.
During testing in the PoC phase, we found it’s slow to restore PVs from VolumeSnapshot (which is baked by GCE Persistence Disk Snapshot). The latency of snapshot restore is critical, because it affects how fast a new Buildkite agent will come up online. One way to optimize the latency is by using GCE VM images as the baking store medium for the VolumeSnapshot. To achieve that, we configured snapshot-type to be image. See following for an example of the VolumeSnapshotClass that we use:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: pd-snapshotclass
namespace: ci
driver: pd.csi.storage.gke.io
deletionPolicy: Delete
parameters:
# Since we care more about fast restore, use image snapshots type instead of the default disk snapshot
snapshot-type: images
image-family: preloaded-dataOnce the snapshots are taken, you will find them with following commands:
$ kubectl get volumesnapshot
NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE
snapshot-1 true pvc-1 snapcontent-1 1024Gi pd-snapshotclass snapcontent-1 5m 5m
snapshot-2 true pvc-2 snapcontent-2 1024Gi pd-snapshotclass snapcontent-2 10m 10m
snapshot-3 true pvc-3 snapcontent-3 1024Gi pd-snapshotclass snapcontent-3 15m 15mBuildkite Agent Operator
We build a custom Kubernetes operator to orchestrate the Buildkite Agent container. While the built-in Kubernetes definition (either deployment) meets most of our need for deploying the Buildkite Agent container for non-monorepo use cases, it falls short when we need the Buildkite Agent container to have an update-to-date monorepo cache automatically whenever it’s available. This is because, within the pod spec, the volume data source has to be a fixed value. But the monorepo cache is generated on a daily basis, and we don’t want to use a fixed volume snapshot for monorepo cache. This is important to keep the CI job latency from getting slower and slower from day to day.
In high level, the Buildkite agent operator keeps polling for the number of pending and running jobs from Buildkite metric endpoint, calculates the number of desired Buildkite agent pods to be created, creates the pods, and cleans up finished pods.
AgentSet
In order to configure the BK agent controller on configurations such as queue, resources (CPU/MEM/Storage), environment variables, etc., we created an AgentSet CRD for that. The AgentSet Custom Resource Definition (CRD) is created and configured at the cluster level. For each agent queue, there is a corresponding AgentSet custom resource. The following shows the AgentSet for the prod-uci-v3-medium queue:
apiVersion: buildkite.uci.robot.car/v1alpha1
kind: AgentSet
metadata:
name: prod-uci-v3-medium # name the AgentSet per queue
spec:
resources: # map to pod’s resources
requests:
cpu: 1
memory: 2Gi
ephemeral-storage: 500Gi
limits:
cpu: 1
memory: 2Gi
ephemeral-storage: 500Gi
queue: prod_uci-v3-medium
image: gcr.io/cruise-gcr-dev/uci/bk-agent-cc:{{ version }}
command: # maps to the Pods’ command
- /usr/local/bin/entrypoint.sh
- /usr/bin/buildkite-agent
- start
- --tags
- queue=prod-uci-v3-medium,version={{ version }},worker_env={{ env }}
- --name
- “%hostname”
- --discontinue-after-job
snapshot: # enable the PD snapshot (cruise/cruise only)
labels:
uci.robot.car/repoName: cruise
env: # maps to the Pod’s environment variables
- name: BUILDKITE_AGENT_TOKEN
valueFrom:
secretKeyRef:
key: token
name: bk-agent-token The controller will be responsible for 1) polling and calculating autoscaling metrics 2) creating the corresponding number of BK agent Pods 3) garbage collecting finished BK agent Pods. 4) optionally, for monorepo queues, looking up latest volume snapshot and configuring the data source to use the name of that snapshot as the monorepo cache.
The controller will create a Pod with the following Pod spec:
apiVersion: v1
metadata:
namespace: ci
spec:
# use sysbox-runc as the container runtime, also works as node selector that
# guarantees the correct nodes with sysbox installed
runtimeClassName: “sysbox-runc”
containers:
- name: bk-agent
# container image for the BK agent, we need one image for c/c
# and one image for non-c/c
image: gcr.io/cruise-gcr-dev/uci/bk-agent-cc:3c67177-dirty
resources:
requests:
cpu: 16 # defines the CPU/MEM/Local SSD storage requirements, mapped
memory: 128Gi # from current queue spec
ephemeral-storage: 300Gi
limits:
cpu: 16
memory: 128Gi
ephemeral-storage: 300Gi
command:
- /usr/local/bin/entrypoint.sh
- /usr/bin/buildkite-agent
- start
- --tags
- queue={{ .Values.queue }},version={{ .Values.version }}
- --name
- “%hostname”
env:
- name: BUILDKITE_AGENT_TOKEN
valueFrom:
secretKeyRef:
key: token
name: bk-agent-token
volumeMounts:
- name: cc-pd-cache # (optional) using the PD cache (cruise/cruise only)
mountPath: /var/lib/buildkite-agent
volumes:
- name: cc-pd-cache
ephemeral:
volumeClaimTemplate:
spec:
dataSource:
apiGroup: snapshot.storage.k8s.io
kind: VolumeSnapshot
name: pd-snapshot-2025-10-xx-yy # (optional) configure the snapshot to use this field changes from day to day
accessModes: # (cruise/cruise only)
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: “premium-rwo”
volumeMode: Filesystem
serviceAccountName: cc-bk-agent-sa # for CI worker identity, need one for c/c and one for non-c/c
tolerations: # the BK agent container needs to run on the dedicated node pool
- key: “dedicated” # which has sysbox-runc installed
operator: “Equal”
value: “uci”
effect: “NoSchedule”We use the public manifest to install Sysbox at the node pool with toleration dedicated=uci. Within the above pod Spec, RuntimeClass is set to sysbox-runc which conforms to the runtime class name defined within the public Sysbox installation. Also, note that we use the ephemeral volume as the volume definition for the Monorepo cache, this is handy to avoid managing the volume lifecycle separately. As a result, when the pod is deleted, the corresponding PV and PVC will be deleted by Kubernetes automatically.
Putting it Altogether
Putting is altogether, the following is a high level architecture diagram:
Fig 2 - High Level Diagram for using Kubernetes
The top half shows the mechanics of how the Buildkite agent operator works, the bottom shows how the snapshotting process works.
Migration
To mitigate the risk of the project and create leeway for unanticipated use cases, we created several new queues for the Buildkite agents that run on Kubernetes. This strategy allowed us to reduce the risk of breaking unexpected use cases and provided a smoother transition experience.
Limitations
While the V3 design supports almost all of the use cases, there are several edge cases that’s not yet covered such as GPU, KVM and FUSE. There is a known Github issue reported upstream seeking for supporting GPU devices within the Sysbox container. And for KVM, it’s an interesting situation where it requires host level root privilege to access the /dev/kvm device. However, due to user namespace mapping within the Sysbox container, the “root” process within the container is not the real root at host level, as a result, the Sysbox container will have permission issues when trying to access the KVM device. For such use cases, we kept a small pool of V2 queues.
Measuring the improvements
Queue time latency is one of our key metrics that we use to validate the outcome of this new design. And we observed a big improvement. We listed a few typical queues in the following graphs. The left side is the v2 system’s latency, and the right side is the v3 system’s latency.
Fig 3 non-cruise/cruise queue time latency (not use persistent disk cache)
Fig 4 for cruise/cruise queue time latency (with persistence disk cache)
In both cases the max latencies are reduced from ~3 minutes to be <30 seconds. Notably, there are still some large spikes in the v3 system. And those are problems to be tackled as steps.
Summaries
In summary, we walked through the evolution of Cruise’s CI infrastructure from a V2 system using ephemeral VMs to a V3 system leveraging Kubernetes. The V2 system, built on Buildkite, addressed issues like dirty disks and monorepo cache latency but faced challenges with queue times, availability, and caching useful data. The V3 system introduces containerization with Kubernetes, using Sysbox as a container runtime to solve Docker-in-Docker issues, and a new Buildkite operator to orchestrate Buildkite agent containers. The V3 system is able to reduce the queue time latency, improve platform availability and it opens the door for future improvements. The team is excited about this new platform and the potential it brings in the future.
One of the idea we are pretty excited about is that it allows us to cache artifacts in a managed fashion. This mean from now on, with Kubernetes being the abstraction layer, the platform has the ability to choose which artifact to cache, and what can be cleaned up after each CI job. As a next step, the team’s been already working on caching a variety of artifacts such as LFS object, Bazel cache, etc.
Acknowledgements
Thanks to Tianshi Chen who contributed to the project and made it happen, and Cruise leadership’s support on this work. Also special thanks to Zhimin Xiang who helped early support and idealization.







Where does it say Compute Engine's availability is 95%?
That seems very low. The link you posted says that at 95% availability, GCP will compensate you with credits.
https://cloud.google.com/compute/sla?hl=en
Also, curious if the team looked into MIG standby pools to decrease queue times?