From Kernel-Bound Networking To Programmable Transport: Making Linux Ready for AI-Scale Infrastructure

Linux is at the heart of every data center. We built the layer that makes its networking programmable without making it fragile. 

Controlling how data moves across tens of millions of servers used to require months of kernel updates. Billions of people depend on that path, so the margin for error is minimal. We changed that by building a system with rollback and observability built in.

By Prankur Gupta, Software Engineer, Meta

Meta runs its major products on its own data centers, not public cloud infrastructure. Every request, message and ad transaction depends on the host-level networking path. This path is where network protocols and behavior get tuned for specific hardware, topology and services – congestion control algorithms, window sizes, pacing, MTU and more.

In one regional disable test, turning the transport-tuning system off for 75 minutes reduced the rate of machine-learning inference queries by 29% and increased packet drops at top-of-rack switches by 363%. This infrastructure sits directly in the production path for billions of users.

We run a shared fleet serving services with fundamentally different traffic profiles. One generic transport behavior cannot serve all of them well. That was the problem we built NetEdit around.

Most of my work has been in the host networking stack: Linux networking, eBPF, congestion control, transport protocols and production infrastructure for AI workloads. NetEdit was not a single-layer problem. It required understanding what the kernel could safely expose, what eBPF could safely run and what transport behavior actually needed to change.

Congestion control is decades old but the networks it runs on are not

Congestion control keeps a data center network usable. It decides how fast servers push data and when they need to back off. In Linux, this logic lives inside the kernel. Changing it across millions of machines is slow, risky and expensive.

AI-scale workloads made the gap impossible to ignore: they exposed limits in the existing transport stack that we could no longer work around with static tuning or slow kernel-rollout cycles. New services generate traffic in bursts and volumes that the existing stack was not designed for.

The process for changing it had not changed much: propose a kernel patch, wait for review, test it across the fleet and roll it out in stages. That takes months, sometimes longer. We needed weeks.

Why existing tools weren’t enough

From my work across the host networking stack, eBPF seemed like the right primitive. It lets you run safe programs inside the kernel without rebuilding it. But eBPF did not give us lifecycle management, service-placement integration, upgrade safety or fleet-wide rollback.

Most production eBPF use cases are single-purpose, low-velocity and on-demand: observability, security, packet filtering. NetEdit operates in a different regime – multi-program, high-velocity, always-on and transport-critical.

We evaluated existing open-source eBPF managers, including Cilium, bpfd, l3af and others. None of them supported what we needed: BPF-to-BPF decoupling, lifecycle integration with service placement or the coordination required to manage transport behavior across a fleet of this size.

The harder problem was coordinating across hook points, avoiding disruption during upgrades, adapting configuration dynamically and keeping pace with service placement. We had to do that across millions of servers and multiple kernel versions, while keeping backward and forward kernel compatibility intact. There was no existing system we could build on.

Making transport behavior programmable

NetEdit is the orchestration layer between network policy and the Linux kernel. It lets us deploy, test and centrally manage transport-tuning and congestion-control changes instead of treating every network improvement as a kernel-release project. The design and operational findings are documented in a paper at ACM SIGCOMM 2024, a major computer-networking conference.

A key abstraction is the tuningFeature: a collection of eBPF programs often spanning multiple hook points such as sockops, struct_ops, TC and sockopt that together implement one logical network function. In practice, 180 programs does not mean 180 features. Managing them requires more than deploying eBPF code.

With this model, new features can be developed, deployed and rolled back without manually reconfiguring individual servers or waiting for a kernel rollout. Feature deployment time dropped from months to weeks because transport-tuning configuration – including congestion-control behavior and other networking tunables – was decoupled from the kernel release cycle.

The work behind the platform 

The Linux kernel was missing some of the connection points our model depended on. We built them, got them reviewed and accepted by the kernel community and then built the surrounding infrastructure that made them safe to use at fleet scale. Those interfaces are now part of the upstream Linux kernel rather than a Meta-only patch set.

The full-stack view mattered because the failures did not stay inside one layer. Some problems looked like eBPF problems but were really kernel-interface problems. Others looked like transport problems but depended on how programs were attached, upgraded, observed and rolled back across the fleet. If we had treated it as only an eBPF problem, we would have built the wrong abstraction.

Getting the platform safe enough to run at this scale meant building observability, auditing, staged rollout and rollback into the core from the start. Warm reboot was one of the critical pieces: it keeps eBPF objects attached during user-space restarts, so live connections are not disrupted. Without it, every NetEdit upgrade would have been a production risk. We also needed guarantees that a single bad deployment could not propagate silently across tens of millions of servers.

Safety before features

Keeping policy and enforcement separate is what made the system maintainable. Policy is the decision about what the network should do, for which traffic and under which conditions. Enforcement is the mechanism that carries it out. When those two things are bound together inside kernel code, changing one always risks disturbing the other.

We built the safety infrastructure before we built the features. Observability, auditing and rollback are not optional in a system that changes transport behavior across millions of servers. A change that works correctly but cannot be measured or reversed still creates risk.

Lazy loading attaches a BPF program only when there is an active policy for a service on that host. It detaches the program when the policy or service disappears. This reduces CPU consumption by around 86% on average and 73% at the 99th percentile compared with attaching the same programs non-lazily.

Shared maps address a different problem. Instead of letting each feature independently compute the scope of a connection, we compute it once, store the result in a shared map and reuse it across features. As the platform expands across regions, consistency matters. Independent computations can drift and at this scale drift is hard to debug.

The upstreaming decision was deliberate. Many large organizations fork the kernel and maintain custom changes indefinitely. We chose not to because the maintenance cost of diverging from the mainline accumulates over years. Upstreaming required working with the kernel community rather than solving our own problem in isolation. It took longer. It also means the interfaces are reviewed and maintained outside our own deployment.

Programmability is not enough

The constraint we hit was simple: transport behavior needed to change faster than kernel release cycles allowed.

NetEdit closed that gap for our host networking stack. Network changes can be developed, tested, rolled back and improved without rebuilding the kernel or shipping a new kernel version across the fleet. An orchestration layer that skips lifecycle management, compatibility, observability or rollback is not production-ready – regardless of how well the underlying programs work.

The same idea – programmable, safely orchestrated in-kernel tuning – is beginning to spread beyond networking, into storage data paths and AI-serving infrastructure. At this layer, programmability without rollout safety, observability and rollback is not enough.

The post From Kernel-Bound Networking To Programmable Transport: Making Linux Ready for AI-Scale Infrastructure appeared first on .

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe to our Newsletter