2.4 Every Boring Problem Found in eBPF

 3 Nov 2022

~~[every Boring Problem Found in eBPF] by @FridayOrtiz~~

About six months ago I started a new job and dove into adding Berkeley Packet Filters (BPF) as a telemetry source for our Linux endpoint security agent. This work culminated in the release of three open source libraries ([0], [1], [2]). This isn't about that, though. This is about the issues we ran into while implementing BPF as defenders, and how defenders can use BPF in their environments (although attackers should find useful tips in here too). I went through every PR, Jira ticket, and message from the past six months to put together this list of BPF gotchas and their solutions. I hope it helps defenders, developers, and researchers ramp up with BPF faster than I did.

Note: I'm going to use BPF to mean extended BPF (eBPF), since that's the official name. Not to be confused with the old classic BPF (cBPF). I'm also going to assume you're *loosely* familiar with BPF, enough to be considering whether or not to deploy it in your environment.

Why even use BPF?

You may be wondering, "if you're filling an article with caveats about BPF, why should I even bother trying to use it?" Great question, straw man. There are a number of things BPF is really, really, good at that you should consider.

  • You can get visibility (almost) anywhere you want. If there's a specific code path within the kernel (or userspace) that you know will be executed during an attack, you can put yourself there. If there's a payload value in a packet you need to see before it hits your `iptables` rules, you can do that. Want to modify or block syscall args? You can do that too.
  • You can reinstrument dynamically. Change your mind about what you want to inspect? Change it while it's running. You can either swap out the entire program (although that might not be possible in the future with signed BPF programs[4]) or modify behavior by updating BPF maps from userspace.
  • It's safe! You can do all these things with a Linux Kernel Module (LKM), sure, but the BPF Virtual Machine (BPF-VM) and verifier ensure (or at least try really hard to ensure[17]) that you can't panic or break the kernel.
  • It's container aware, or at least it can be. Instrumentation alternatives like `auditd` tend to struggle in containerized environments, returning values that only make sense in certain namespaces, or losing track of things entirely. BPF, on the other hand, can give you information in whatever context you want (as long as you program it that way).
  • It's fast. Do as much work as possible in the BPF program before sending information up to userspace, and you can avoid expensive context switching or race-prone data enrichment.
  • It's atomic (sort of). BPF programs generally aren't preemptable (there are exceptions[5]). This applies to tail calls as well, so you can set up some fairly complex logic in your instrumentation without worrying too much about reentrancy.

The problem with BPF.

From this writer's perspective, there are two main problems with BPF: 1) it's now being used in ways it was never designed for (i.e., it's evolving naturally over time) and 2) there's a large overlap in maintainers of the Linux kernel's BPF subsystem and userspace BPF tooling.

The following are concrete issues stemming from (1):

  • BPF isn't really CO-RE (Compile Once - Run Everywhere), it's more CE-RO (Compile Everywhere - Run Once). A lot of userspace tooling "achieves" CO-RE in practice by compiling on the host machine. New true CO-RE features are... new. Chances are you need to support a kernel that doesn't have them. End-of-life doesn't mean much when the host is running a business-critical function and the suits see too much risk in upgrading it. And loading a full toolchain to compile BPF programs on the host is often a no-go too.
  • The toolchains and libraries are designed around `bpftrace`-like use cases. That is, one-off tooling for diagnosing specific problems. Brendan Gregg's book[7] is a great resource for this. Now that BPF for long-running daemons is gaining popularity, the maintainers are working hard adding features to support this (like the aforementioned true CO-RE). Unfortunately, again, these features probably won't exist on the kernels you need to support.
  • There are many different types of BPF programs (which we'll cover), that all have varying load and run semantics. Depending on whether you want to run a kprobe or a TC classifier, you'll have to use entirely different methods to do so. And while you're writing them, the helpers available to you can vary wildly. And the documentation is incomplete, scattered, and often out of date, because...

And here are some specific examples of issues stemming from (2):

  • Because of the overlap, documentation of the pure BPF interface(s) (there's a plethora, we'll cover that) is lacking. The people that maintain it write the userspace tooling, so they don't need in-depth documentation. Seriously, go check out the BPF manpage for whatever distro you're on. Chances are it's missing a ton of helpers and there's more than one "TODO: fill this out" that's been sitting there for years. Why not use their userspace tooling? Well...
  • Their userspace tooling is a magic labyrinth. In order to get close to CO-RE in a backwards compatible way, it's filled with kludges you probably don't need. Ideally, you'd interface directly with the underlying syscalls and use only what you need. But doing that is undocumented. And, because of the documentation issues, there's really no community drive to simplify these libraries. Because these libraries cover (until recently, see (1)) the majority of historical use cases, there's no drive to improve the documentation. Even if you did, you'd have to backport and patch your documentation to cover all the little idiosyncrasies across kernel versions, and boy are there a lot of those.

Implementation Issues

While working with BPF we ran into a number of implementation specific problems that lead to us building and publishing those three ([0], [1], [2]) BPF tools.
If you're a defender, or work in security, and you're considering getting started with BPF here's a list of things you'll probably want to know.
Presented in somewhat logical order.

The verifier sucks, but the alternatives are worse

-...- Problem -...-

You will run into lots of problems with the verifier. For example, what's the difference in the following two code snippets?

u32 *p = 1;
u32 i = *p;

u32 *p = 1;
u32 i = NULL;
__builtin_memcpy(&i, p, sizeof(u32));

I'll tell you: the first one fails the verifier, the second one does not. But only sometimes, except when it works. Which depends on the kernel version. Maybe. I mean, they should compile to the same thing, right? Apparently not, and subtle differences can completely throw the verifier off.

The real problem with the verifier is that it's getting better all the time. As BPF use cases settle out, the maintainers are changing the BPF verifier to better support them. That means on older kernel versions, without these patches, you'll have to perform strange workarounds to get your code working with older verifiers.

A few more verifier problems you'll likely encounter supporting a wide range of kernel versions:

  • The verifier hates looping. But, sometimes, it also hates loop unrolling. If `clang` generates enough jumps and gotos, even if you tell it to unroll everything, the verifier might (depending on version) fail it anyway. The verifier needs to be able to keep track of all branches and ensure a maximum depth limit. If it can't (whether it's because you're looping or because the verifier can't keep up) your program will fail to verify.
  • In older kernel versions (but not newer ones) variable reads and writes are a big no-no. All offsets must be known at compile time. That means you can't do things like set `some_array[variable_index] = some_value`. This, plus the aversion to loops, makes it nearly impossible to read strings from memory on kernels without the `read_str` family of helpers. The kernel's own `qstr` involves variable memory access—and good luck finding (or setting) the null terminator on your own.
  • Everything that might be a pointer must be null checked. If you don't, the verifier will refuse to load your program even if it's safe. This makes it hard to work with programs that might expect a null value. The convention that most pleases the verifier is to return immediately after a failed null check, and getting around this is tricky and involves trial and error.

There are alternatives to the kernel verifier, such as PREVAIL[8], but they have their own set of issues. For what it's worth, PREVAIL is an impressive project and Microsoft will be basing their Windows BPF verifier off of it. But, unfortunately, it doesn't match the expected behavior of kernel verifier. Just because something passes PREVAIL doesn't mean it will pass the kernel verifier. Just because something fails PREVAIL doesn't mean it will fail the kernel verifier (even though it probably should).

-...- Solution -...-

Run early, run often, run everywhere. Your development environment should make it as easy as possible to test your code on all the kernels you need to support (or as close to a representative sample as you can get). The only way to know if the kernel verifier will accept your program is to run it through the kernel verifier, the real kernel verifier, on the specific kernel you're targeting. Note that this means the distro-specific kernel, with all their modifications and backports. For example, the older Enterprise Linux (red hat, centos, and so on) kernels (2.x and 3.x) have backported BPF features that might surprise you, since they don't line up with mainline kernel version numbers. The only way to know what's supported is to try it.

Enable logging. This one comes with a caveat. You need to provide the verifier with a large buffer of memory to write its verification logs into. If you don't give it enough space it will fail verification, even if the program would otherwise pass. If your programs are complex, then make sure your buffer is large enough (but not too large, or loading will take forever) and be sure to turn off verifier logs in production to avoid issues with programs failing to load when you know they should.

The error messages you get will seem cryptic at first. The BPF verifier uses a lot of terminology and has a lot of restrictions that are undocumented (of course) that you'll learn with time. If you get stuck, the BPF Compiler Collection (BCC) GitHub repo's issue tracker[9] is a great resource. You can probably find a Brendan Gregg ticket that goes over at least the broad class of error you're getting.

BPF doesn't really exist

-...- Problem -...-

BPF is really just an instruction set, for which the Linux kernel provides a VM, verifier, and some helper functions. You run your programs inside this execution context, and call the helper functions to extend the VM's capabilities. When you write a BPF program, what you're really writing is a kprobe, or a uprobe, or an eXpress Data Path (XDP) classifier, or a Traffic Control (TC) classifier, or one of the many other types of kernelspace programs that have been offloaded to the BPF subsystem. There's a ton of BPF program types and more are being added all the time, for a variety of use cases. It turns out being able to safely execute code in the kernel enables a ton of interesting and helpful functionality. Unfortunately, every program type has its own way to load, run, and clean up after it, most of which is entirely undocumented.

On certain distros, the tools you'll need to load these programs might not be enabled by default. For example, some distros don't automatically mount `debugfs`, which you'll need to load kprobes on older kernels.

When you do figure out how to load your program, the ABI for defining programs is entirely based on undocumented, implicit, convention. For example, you'll see a lot of `SEC("kprobe/my_kprobe")` to tell the loader that you're loading a kprobe.

/* helper macro to place programs, maps, license in
* different sections in elf_bpf file. Section names
* are interpreted by elf_bpf loader
#define SEC(NAME) __attribute__((section(NAME), used))

This is actually entirely unnecessary, on the syscall level, and is merely a common convention. As you can see in the above snippet, it's just a macro to set the section name in the compiled ELF executable. There's nothing BPF-specific about it. So you not only have to know the requirements of the program type you're trying to load, but also the conventions used by the tools that load and run it.

-...- Solution -...-

Figure out what you want your program to do first. Do you want visibility into the kernel? Then you'll probably want a kprobe or tracepoint. Do you want to drop inbound packets? You probably want XDP. Do you want to build detections on outbound traffic? You might want TC, or you might want a kprobe in the kernel network stack. Figure this out, then figure out what you're going to need to run it the way you want to run it (one off? daemon?). When we made `oxidebpf` we had to optimize for stability in the features we needed most (e.g., kprobes) over coverage of all the different BPF program types.

If you can't find libraries to suit your needs for your chosen program type, you'll probably have to write it yourself (or contribute it to an open source project). Because everything is poorly documented, you'll have to dig through a lot of source code to put together the real set of necessary functionality. The official-ish libraries like libbpf and libbcc tend to work the best, but there's issues there (that we'll get to).

I highly recommend using `bpftool` for debugging while developing. It provides the easiest-to-use view into what programs and maps are loaded and where. It lets you visualize data in maps, dump programs, and more. The only problem with `bpftool` is that it's never in the same package. Some distros and repos let you install it with a `yum install bpftool`. Others require you `apt-get install linux-oem-tools`. Sometimes you need to `apt-get install linux-oem-tools-`uname -r``. It depends. Whatever you're running, though, you'll probably want this tool installed.

I hope you find constraints fun

-...- Problem -...-

Are you one of the dozen or so people unreasonably upset that `0x10c`[10] was never released? Me too! I find working within constraints challenging and enjoyable. And let me tell you, BPF programs have a lot of constraints.

You get 512 bytes of stack space for your program, half a kilobyte. This doesn't appear to be something that has or will ever change. It's also unclear if this applies to tail calls. Some documentation implies that tail calls use the same stack space, so you're limited to 512 bytes total, but in practice it seems to be 512 bytes per program. And `clang` probably won't be able to help you. BPF programs, for whatever reason, don't like to reclaim stack space. Your variables will get hoisted and instantiated at the start of execution. If you want to do things like dump syscall arguments or `pt_regs`, especially when working with strings, you'll find yourself running out of stack space very quickly.

There's a practical instruction limit of about 4096 instructions. The instruction limit in the past was set (as far as I can tell) based on what the verifier could verify before declaring "this has gone on too long, I can't verify this won't halt, so I'm failing it." You can get more instructions by manipulating the verifier, and doing other tricks you'll find in the mailing lists, if you really want to put in the effort. Newer verifiers let you get upwards of 1,000,000 instructions, but that'll only help you if you're supporting newer kernels.

-...- Solution -...-

To work around the instruction limit, you can use tail calls. Tail calls are the closest BPF has to a true function call. You transfer flow over to whichever program you call into. You can chain tail calls like this together up to 33 times. There are some caveats, which we'll get to later.

There are a few tricks you can use to work around the stack limit. One trick is to explicitly reuse stack space. For example, reusing variables or instantiating a struct of bytes to act as your scratch space and manually reusing offsets within it. Another trick is to build your own stack with maps. On some kernel versions you can request a struct from a map and get a pointer to it. If the requested struct doesn't exist (e.g., the array map at the requested index was empty) you'll still get back a pointer to an empty struct that you can manipulate. Other kernel versions require this map-struct to be copied to the stack before being modified, so your mileage may vary.

With all that said, I want to offer some practical advice. If the information you're retrieving is too big to ever fit on the stack, you should just send it out as you read it. Create a messaging type and pipeline for chunking and rebuilding data in userspace, copy as much as you can to the stack, and then send it up through a map. This will run on a wider range of kernel versions, and you won't have to worry about if your host kernel allows directly manipulating and emitting map memory. You can reconstruct it in userspace at will. This is what companies like Google are doing for their BPF telemetry.

The good stuff is GPL

-...- Problem -...-

All the useful helper functions (like `perf_event_output`[6]) are exported as GPL-only. If you want your program to do anything useful, you're going to have to license it under GPL. That makes it hard to make proprietary programs based on BPF. If your program is only internal, and never distributed, you're fine. But if you start distributing your programs (to customers, friends, wherever) you need to publish it under GPL.

-...- Solution -...-

Short answer: Make the world a better place, release your BPF code and tools.

Long answer: BPF is still a niche and complex discipline, so open sourcing your tooling doesn't reduce competitive effectiveness for a business. From an individual perspective, open sourcing your tooling gets your name out there and makes you more valuable as an employee. From an employer perspective, the more accessible BPF becomes the easier it will be to hire people to build and maintain it. From the community perspective, we can all learn from each other by working in the open. Perhaps you have an interesting use case that the maintainers of other libraries would want to know about, or could offer advice on. Everybody wins.

By default, you get the default

Alternatively, BPF is only good with containers if you tell it to be.

-...- Problem -...-

Be careful with the assumptions you make about the information you retrieve from a BPF program. If you grab the retcode of a `fork` call, it's going to give you the retcode of the `fork` call: the pid in the namespace of the calling process. Maybe this is what you wanted, or maybe you really wanted the pid of the child process un-namespaced. Maybe you ask the BPF program to gather the pid (with the `get_pid_tgid` helper). You take the upper 32 bits, corresponding to the pid, but nothing lines up. Well, you're executing in kernelspace which means the `pid` you probably want is actually the `tgid`, and what you got was a `tid`. Unless you wanted a `tid`, in which case you should get the `pid`. The kernelspace understanding of a `pid` is not the same as the userspace understanding of a `pid`. If you want to identify a file, you probably want the inode number and device number, a file descriptor won't be as useful.

-...- Solution -...-

If you want your program to retrieve information, think about what information you need to retrieve. Make sure you know where that information exists (what structs, where they live in memory, and how to get there) and then find a place (assuming you're launching a kprobe) in the kernel you can attach your program as close to that information as possible. For example, if you really need the root namespace pid of the child process of fork, you probably want to hook somewhere in the path of the new child process so you can grab the `pid` from the `task_struct`.

Be aware that this location might change between kernel versions, or the information may take a different form. You may have to choose a less optimal probe point that is available on more systems. Or you may have to change the information you're gathering to something else that exists on all the kernels you support. That leads us to the next two issues.

CO-RE (probably) won't help you

-...- Problem -...-

The maintainers are constantly adding feature to help BPF developers compile once-run everywhere their BPF programs. Unfortunately, you'll likely find yourself trying to target kernel versions that don't have these features. Or, if you do, since these features are added in piecemeal, it may not have all the CO-RE features you expect.

For example, the BTF feature makes it possible to reference struct members directly, even if they've been compiled in a randomized layout, across different kernel versions, and without recompiling. This feature was added in April of 2018[11]. You will probably need to write code for kernels from before April 2018. This means something like `current->real_parent->pid` is not guaranteed to work without recompiling for (or on) the host.

-...- Solution -...-

There's really no way around this one. It's what we're doing, Microsoft is doing it too for their Linux machines, and I'm sure there are others. First, you determine the offsets of struct members for your desired kernel version and then you load them dynamically into your BPF program at runtime. For example, this code snippet from [12] shows how we read struct offsets from a map and use that in our `bpf_probe_read` to retrieve values.

static __always_inline int read_value(
void *base, u64 offset, void *dest, size_t dest_size
/* null check the base pointer first */
if (!base)
return -1;

u64 _offset = (u64)bpf_map_lookup_elem(&offsets, &offset);
if (_offset)
return bpf_probe_read(dest, dest_size, base + *(u32 *)_offset);
return -1;

To actually find these offset values in the first place, we built the `linux-kernel-component-cloud-builder`, or `LKCCB`, which builds hundreds if not thousands of kernel modules with debug enabled for all our target kernel versions and extracts structure offset information from the LKM's `DWARF` debug info[1].

Stable interfaces aren't

-...- Problem -...-

You'll often find, when working with the kernel, that there aren't as many stable interfaces as you thought there'd be. Even syscalls, which are supposed to be a big part of the stable user interface, aren't necessarily stable.

For example, if you somehow traveled back in time and wanted to monitor process forks, you'd probe the `fork` syscall. That'd work fine for a bit, until `clone` is introduced. If you stopped paying attention, you'd lose your data altogether when glibc (and everything with it) switched `fork()` to be a wrapper around `clone()`.

Maybe you want to get the arguments of a syscall. Should be easy, you're given `pt_regs`, just access the registers that hold the arguments! Except if you're on x86_64, and on a kernel older than 4.17, you'll probably be given the `pt_regs` of the syscall wrapper function, that in turn calls the real syscall function. And of course, it all shuffles around if you need to add aarch64 support, which has its own set of calling conventions.

Sometimes a symbol that's marked as being exported can't be attached to, almost inexplicably. Usually this is due to GCC inlining the function, and the symbol being renamed to something like `symbol_name.part.1213`. Trying to bind `symbol_name` won't work.

-...- Solution -...-

For different architectures you can probably get away with macros that conditionally compile depending on what architecture you're targeting, and then building one copy per architecture. For the syscall wrappers, you can do something similar but build targeting different kernel versions. In practice, you may find you need many variants and copies of a single program, all with slight differences, to support different kernels and architectures.

For the symbol name problem, it comes back to run early and run often. It's often worth spinning up a VM of a few of the kernels you're targeting and double checking that the locations you're hooking are indeed in `/proc/kallsyms`. Sometimes you'll find the functions you were looking at don't exist in different versions, or have been renamed and relocated. I recommend getting comfortable with Bootlin's Elixir cross referencer (but you still need to run and see, because distros do their own backports which won't match what's in the mainline cross referencer).

Running BPF programs as intended involves magic

-...- Problem -...-

If you use libbcc, libbpf, bpftrace, or any other other high level BPF tools you'll quickly notice that they do a lot of magic for you. BCC (the python interface) will more or less rewrite your programs for you so they work on your host system. You'll end up getting error messages on code that the library wrote for you. They also help you get around CO-RE limitations by compiling on the host, and using different tricks and kludges to get the same program code to operate in many environments as cleanly as possible. This doesn't help a ton when you need to build and distribute actual raw BPF programs in their own ELF file.

These libraries are also pretty convoluted. There's a lot of overlap in maintainers between these libraries and the people working on BPF in the kernel, so documenting the interactions isn't a priority. But you don't actually need all the stuff these libraries are doing. After a while, especially with the rewriting, you'll find yourself wanting to write and load pure BPF C code. Here's a snippet of a map I put together when trying to figure out what syscalls were actually being made when libbpf loaded a program.

|-> bpf_attach_kprobe()
|-> bpf_attach_probe()
|-> bpf_try_perf_event_open_with_probe()
|-> bpf_find_probe_type()
|-> bpf_get_retprobe_bit()
|-> syscall(__NR_perf_event_open)
|-> create_probe_event()
|-> enter_mount_ns()
|-> setns()
|-> exit_mount_ns()
|-> setns()
|-> bpf_attach_tracing_event()
|-> bpf_close_perf_event_fd()

These libraries are also GPL, which means your userspace program would end up being licensed under GPL and not just your BPF programs. As great as this is for users, if you work for a company that likes to make money you might not be allowed to touch GPL. It might even be in your contract.

-...- Solution -...-

If you're writing complex BPF programs for security, you'll probably want to write it in C without the "help" of something like BCC. You'll also want a bit more control and transparency when loading and attaching your programs. In my experience, libbpf wasn't great at cleaning up after itself and it got frustrating.

Use a clean, simple, library built from the ground up for loading BPF in your language of choice. For Rust, `aya`[16] is a good one, and I worked on `oxidebpf`[0]. Golang also has some good options. One common theme of these projects is the amount of effort that went into reverse engineering the undocumented program loading logic and reimplementing it. Take advantage of that work and use it to load your programs. They're also permissively licensed!

Speed kills

-...- Problem -...-

After getting into BPF, you may benchmark a few of your programs and be surprised at how much faster BPF is than what you've been using before, like audit. This makes it very tempting to trace and probe more than you probably should. For example, if you want to trace socket closes you may be tempted to put a kprobe on the `close` syscall. This syscall is called all the time, and probing it will slow your system down unnecessarily. Most of the messages will be discarded since you only care about sockets. There are plenty of other interesting areas that can't be reasonably instrumented due to the performance impact.

-...- Solution -...-

Trace only what you need, and scope it down as much as possible. Going back to the `close` example, you'd be better off probing somewhere downstream where the individual `tcp_close` or `udp_close` functions are called.

struct proto tcp_prot = {
// ...
.close = tcp_close,
// ...

struct proto udp_prot = {
// ...
.close = udp_lib_close,
// ...

Brendan Gregg's book[7], again, has a great table that shows the overall performance impact of tracing different points in the kernel. You could also just reason intuitively about how often you think each area you want to probe is exercised. The more commonly a code path is executed, the more expensive it will be to probe it.

Even after doing your best to scope down and optimize your BPF programs, you'll probably want to run benchmarks as you tweak things to see what performs best in your target environment. Flamegraphs[13] are a great way to see where most of your overhead is coming from, especially if combined with a benchmarker like UnixBench[14]. The results may surprise you.

I'd also recommend processing BPF events in batches. You'll probably be sending out a lot of information through maps that needs to be read from userspace. If you're getting information that's too big for the stack, the information will be sent in chunks that need to be reconstructed in userspace. It's definitely possible to loop a blocking poll+read on the perfmap or BPF ring buffer, but doing so will result in significant overhead. You're much better off letting the buffers fill a bit, and processing them in batches (batch process, don't stream process). Doing that netted me significant performance gains in benchmarks for the BPF programs I write at work.

Don't Panic

-...- Problem -...-

BPF programs will generally live as long as something holds a file descriptor that points to them. However, sometimes you need to manually clean up after them (such as when using `debugfs`). If your userspace program crashes or panics things may not get cleaned up properly. This can lead to all sorts of problems when you restart your probes, such as receiving duplicate events.

If you're building short lived, one off, tools this is less of a concern. But if you're managing several probes as part of a long-lived monitoring daemon then this is something you need to be careful with.

-...- Solution -...-

Make sure you design the userspace component of the BPF program to keep your programs alive for as long as you'll need them. Gracefully handle all errors in the thread that keeps your BPF programs alive and make sure you clean up after yourself in the event of failure or shutdown. Keep in mind that many older BPF tools are built around short-lived programs, meant for things like `bpftrace` or production debugging.

Know your limits

-...- Problem -...-

Your program will have instructions and will probably use maps. These take up space, which the BPF syscall will handily memlock for you. On many distros, this is fine. On others, however, the default memlock ulimit is quite low[15]. See the following output of `ulimit -l` on various distributions.

vagrant@ubuntu2004:~$ ulimit -l
[vagrant@centos7 ~]$ ulimit -l
vagrant@opensuse15:~> ulimit -l

If you can't memlock enough memory to fit your instructions and maps, you'll get rejected with cryptic verifier error messages.

-...- Solution -...-

Calculate the amount of memory your programs and maps will need, and check the memlock limits on your target systems. You may be fine, or you may need to raise it first. Some libraries (like the one we wrote![0]) can try to take care of this for you.

Tail calls aren't guaranteed

-...- Problem -...-

Think of tail calls like the BPF equivalent of `execve`, except less powerful. It'll start running a new probe, with the original context argument, and replace everything you were previously doing. You can't provide it with custom arguments, and the tail call needs to pass the verifier independently. This means if you want to communicate between tail calls you'll need to use maps. You're also limited to chaining 33 tail calls in a single execution, after that the tail call execution will simply fall through.

You can't call into another program with a tail call directly, either. You need to reference an index in a tail call program map (a type of BPF map) which needs to be set from userspace. For example, if you want to tail call from `prog_a()` into `prog_b()`, you'll need to load `prog_a()` and `prog_b()` first. At this point if `prog_a()` fires, the tail call into `prog_b()` will fizzle. Then, from userspace, you need to update a map to say "`prog_b()` is at index 5, if anyone tries to tail call into index 5, send them to `prog_b()`." Tracking and maintaining all these indexes can be cumbersome.

And there's not always a guarantee that the tail call will fire. You could reach an execution limit, or a memory limit, or some other weird verifier edge case that prevents the tail call from firing. Your programs need to handle this gracefully.

-...- Solution -...-

First, you'll have to write your tail calls as though they were independent programs. Think of designing each one to grab a different bit of information you're looking for. If you find yourself re-calculating the same things in each program or otherwise need to communicate across calls, store and retrieve information from a map.

For managing indexes, use an enum for your tail calls and reference that from your userspace application. For example, have an `enum tail_calls { PROBE_A, PROBE_B };` and then reference it from inside your programs and when loading the program map from userspace. The file descriptor for `probe_a()` goes at index `PROBE_A`, and so on. If you want to call into `probe_a()`, you get there by asking for `PROBE_A` with `bpf_tail_call(ctx, &tail_call_table, PROBE_A);`.

Your program should also have a plan for what happens if the tail call doesn't go off. Do you want to send up an error? Ignore it? Send up a message that execution was completed? Something else? For example, if you're using recursive tail calls to read a string value you may want to return a message that says you hit your tail call limit before you finished reading the string.

You can't just return what you want

Alternatively, you're on your own with error handling.

-...- Problem -...-

This one was a problem that we didn't even realize we had because it was so subtle. In C it's pretty customary to return `0` on success and `-1` (or some other negative error code) in the event of a failure. The actual returned value is usually written to a buffer or some other pointer given as a function parameter. You check the return code for success or failure and take actions appropriately (in theory). While writing BPF programs in C, especially kprobes, you might be tempted to follow this pattern. After all, it makes sense. The actual value you return is sent out through perf or written into a map so userspace can grab it, so the return value of the probe itself should be `0` to indicate success or `-1` to indicate failure, right? Just like every other C program? Wrong! For program types other than kprobes (remember, BPF is just an execution environment) it's more obvious that the return codes have special meaning. For example, XDP programs have explicit return codes to drop, pass, or re-process packets.

For kprobes, `return 0;` means "I'm done with this kprobe, you can move on." You indicate that you want to keep the probe hanging around with _literally any other return code_ (including `return -1;`). That's probably not what you want. Take a look at this function from the kprobe handler in [18]:

/* Kprobe profile handler */
static int
kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
// ...
if (bpf_prog_array_valid(call)) {
// ...
ret = trace_call_bpf(call, regs);
// ...
if (!ret)
return 0;

head = this_cpu_ptr(call->perf_events);
if (hlist_empty(head))
return 0;

dsize = __get_data_size(&tk->tp, regs);
__size = sizeof(*entry) + tk->tp.size + dsize;
size = ALIGN(__size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);

entry = perf_trace_buf_alloc(size, NULL, &rctx);
if (!entry)
return 0;

entry->ip = (unsigned long)tk->;
memset(&entry[1], 0, dsize);
store_trace_args(&entry[1], &tk->tp, regs, sizeof(*entry), dsize);
perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,
head, NULL);
return 0;

There's two things you should notice in that snippet. The line `ret = trace_call_bpf(call, regs);` and `if (!ret) return 0;`. That means if `trace_call_bpf()` returns _anything but `0`_ (including `-1`) we go through the remainder of the function, buffer allocation, trace argument storage, and so on. We can grab the internals of that function at [19]:

* trace_call_bpf - invoke BPF program
* @call: tracepoint event
* @ctx: opaque context pointer
* kprobe handlers execute BPF programs via this helper.
* Can be used from static tracepoints in the future.
* Return: BPF programs always return an integer which is interpreted by
* kprobe handler as:
* 0 - return from kprobe (event is filtered out)
* 1 - store kprobe event into ring buffer
* Other values are reserved and currently alias to 1

unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
unsigned int ret;

// ...

* ...

ret = BPF_PROG_RUN_ARRAY(call->prog_array, ctx, bpf_prog_run);

// ...

return ret;

As you can see, this is the function that actually invokes the kprobe. It gets `ret`, which it returns, from `BPF_PROG_RUN_ARRAY()` which, as you might expect, runs the BPF program. The documentation on this function is also pretty explicit, which is nice. When we return `0`, we've returned from the kprobe and don't need to keep any details about it hanging around. When we return `1` (which anything besides `0` aliases to) we store information about the kprobe in a ringbuffer for later.

-...- Solution -...-

The solution here is to always `return 0;` in your kprobes, unless you have an explicit need to `return 1;`. If you want to know if your kprobe failed or is in some incomplete state, you'll need to architect your message-passing to handle that case. For example, you might want to include a success code flag in the struct(s) you pass through a perfmap which you can check for failure. Or you might want to build your system around a best-effort event reconstruction for more complicated returns involving multiple messages. In any case, you'll have to engineer your error checking and handling independently of the BPF VM system. Those return codes are reserved, you gotta make your own.

Wow, that looks hard. Can you summarize it for me?

Certainly! BPF is really good at getting visibility (almost) anywhere in the entire system, it allows dynamic reinstrumentation, can be made container aware, is faster than alternatives, and is (usually) safe to run as long as you can load it. Consider using BPF if any of the following apply to you:

  • You have the in-house resources and expertise to build and maintain a long-lived BPF telemetry source.
  • You only want to use BPF for live debugging or other short-lived, one off, use cases such as bpftrace.
  • You're lucky enough to only have to support a single kernel version.
  • You don't really care if the project succeeds, you just want to get experience with BPF (this might legitimately apply in R&D orgs).

If you don't have the resources and need to support a wide range of kernels, you might be better off looking for an alternative (there are many free and open source options thanks to GPL), or paying someone else to build it for you.

Long running BPF programs for security are a relatively new use case. The tooling around this use case is getting better all the time, but there's still a lot to consider before diving in.


