CVE-2020-8835: Linux Kernel Privilege Escalation via Improper eBPF Program Verification
April 16, 2020 | Guest BloggerDuring the recent Pwn2Own 2020 competition, Manfred Paul (@_manfp) of RedRocket CTF used an improper input validation bug in the Linux kernel to go from a standard user to root. Manfred used this bug during the contest to win $30,000 in the Privilege Escalation category. He has graciously put together this write-up of his research describing the bug and the exploit used during the contest.
This blog explains the technical details of an exploit using the Linux eBPF feature to achieve local privilege escalation. Due to the nature of the bug(s), which are rather subtle misbehaviors of a safety-critical feature called the “verifier”, some explanations about the inner workings of eBPF need to be provided first. It may be helpful to have the kernel source (kernel/bpf/verifier.c in particular) as a reference alongside these explanations. This bug has been assigned CVE-2020-8835 and was patched on March 30, 2020. Here’s a quick video demonstration of the exploit in action:
How eBPF works
eBPF
Since version 3.15, the Linux kernel supports a general tracing feature called “extended Berkeley Packet Filters”, or eBPF for short. This feature allows users to run eBPF programs, which are written in an assembly-like instruction set, directly in kernel space and can be used to trace certain kernel functionalities. This feature can also be used to filter network packets.
All BPF features are accessed through the BPF syscall, which supports various commands. The man page for BPF(2) states:
In the current implementation, all bpf() commands require the caller to have the CAP_SYS_ADMIN capability.
This is incorrect. Since Linux 4.4, any user can load eBPF programs and run them by attaching them to a socket they own.
eBPF Programs
eBPF uses an instruction set that is quite similar to a (very) limited subset of standard x86 assembly. There are 10 registers (plus one stack pointer), and we can execute all basic copy and arithmetic operations on them, including bitwise operations and shifts. For example:
BPF_MOV64_IMM(BPF_REG_3, 1)
sets register 3 to value 1, and
BPF_ALU64_REG(BPF_ARSH, BPF_REG_7, BPF_REG_5)
arithmetically shifts register 7 right by the contents of register 5. Note that the suffix _REG
denotes a register as second operand, while _IMM
-instructions take an immediate value.
There are also branching jump instructions:
BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 3)
jumps over the next three instructions, if register 3 is not equal to 0.
For each mov
, alu
and jmp
instruction, there is also a corresponding 32-bit version that operates just on the lower 32 bits of the registers (results are zero-extended where needed).
Finally, there are memory loads and stores:
BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_9, 24)
loads a 64-bit value from [reg9+24]
into reg3
, and
BPF_STX_MEM(BPF_DW, BPF_REG_8, BPF_REG_6, 0)
stores the contents of register 6 in [reg8+0]
.
To execute an eBPF program, it first has to be loaded via the BPF_PROG_LOAD
command. This returns a file descriptor corresponding to the program. This file descriptor can then be attached to a socket. The program is then executed for every packet that comes through this socket.
eBPF Maps
To store data or communicate with user space programs (or each other), eBPF programs can use a feature called maps. All maps represent key-value-mappings of some kind. There are different map types, for example, queues and stacks, however for this exploit, the only one that is used is the arraymap
. As the name implies, this type of map is just a contiguous array of entries. It is defined by three parameters:
-- key_size
is the size in bytes of each index used to access an element. The exploit always uses key_size=sizeof(int)=4
.
-- value_size
is the size of each array element. This size can be arbitrary, within reasonable limits.
-- max_entries
is the length of the array. We will always just set value_size
as high as we need, and set max_entries
to 1.
One advantage to setting max_entries
to 1, rather than having a small value_size
but more entries, is that we can then get a pointer to the single value inside the eBPF program, so that the arraymap
is in effect just a single memory chunk. This will be very convenient later on.
Similar to programs, maps are created by the BPF_MAP_CREATE
command of the bpf
syscall and are identified by a file descriptor.
The exploit will use three maps:
-- inmap
is a small map containing all parameters the exploit needs to run (e.g. the offset of an out-of-bounds read to perform). Note that although there will be multiple parameters, they will all be stored inside a single larger array entry.
-- outmap
is a very small map containing any output from the exploit program (for Out-of-Bounds (OOB) reads, the read value).
-- explmap
is a larger map that will be used for the exploit itself.
JIT Compiler
For performance reasons, all eBPF programs are JIT compiled into native machine code when they are loaded (except when CONFIG_BPF_JIT_ALWAYS_ON
is disabled).
The JIT compilation process is pretty much straightforward, as it is very easy to find x86 instructions corresponding to most eBPF instructions. The compiled program runs in kernel space with no additional sandboxing.
Verifier
Obviously, running arbitrary JIT-compiled eBPF instructions would trivially allow arbitrary memory access, because the load and store instructions are translated into indirect mov
s. Therefore, the kernel runs a verifier on each program to make sure no OOB memory access can be performed, and also that no kernel pointers can be leaked. The verifier enforces roughly the following (some of these apply only to programs loaded by non-privileged processes):
-- No pointer arithmetic or comparisons can be performed, except for the addition or subtraction of a pointer and a scalar value (a scalar value is any value that is not derived from a pointer value).
-- No pointer arithmetic that leaves the bounds of known-safe memory regions (i.e. maps) can be performed.
-- No pointer values can be returned from maps, nor can they be stored in maps, where they would be readable from user space.
-- No instruction is reachable from itself, meaning that the program may not contain any loops.
To do this, the verifier must track – for each program instruction – which registers contain pointers and which contain scalar values. In addition, the verifier must perform range computations to ensure that pointers can never leave their appropriate memory regions. It must also perform range tracking for scalar values, because without knowing lower and upper bounds it would be impossible to tell if adding a register containing a scalar value to one containing a pointer would result in an out-of-bounds pointer.
For tracking the range of possible values of each register, the verifier keeps track of three separate bounds:
1- umin
and umax
track the minimum and maximum value that the register can contain when interpreted as an unsigned integer.
2- smin
and smax
track the minimum and maximum value that the register can contain when interpreted as a signed integer.
3- var_off
contains information about certain bits that are known to be 0 or 1. The type of var_off
is a structure known as tnum
, which is short for “tracked number” or “tristate number”. A tnum
has two fields. One field, named value
, has all bits set that are known to be 1 in the register under consideration. The other field, named mask
, has all bits set where the corresponding bit in the register is unknown. For example, if value
is binary 010
and mask
is binary 100
, then the register could contain either binary 010
or binary 110
.
To see why 1 and 2 are both necessary, consider a register whose signed bounds are -1 and 0. Interpreted as an unsigned integer, those values can range from 0 to 2ˆ64-1, which has the same representation as signed -1. On the other hand, the unsigned range from 2ˆ63-1 to 2ˆ63 includes both the smallest and largest possible signed values.
All of these bounds are regularly used to update each other. For example, if umax
is below 2ˆ63, then smin
is set to 0 (if it was negative before), as all such numbers are positive. Similarly, if var_off
indicates that all except the last three bits are 0, then umax
can be safely set to 7.
The verifier checks each possible execution path, meaning that at each branch, both outcomes are checked separately. For the two branches of a conditional jump, some additional information can be learned. For example, consider:
BPF_JMP_IMM(BPF_JGE, BPF_REG_5, 8, 3)
which takes the branch if register 5 is greater or equal than 8 in an unsigned comparison. While analyzing the false-branch, the verifier is able to set umax
of register 5 to 7, because any higher unsigned value would have resulted in taking the other branch.
ALU Sanitation
In response to a large number of security vulnerabilities that were due to bugs in the verifier, a feature called “ALU sanitation” was introduced. The idea is to supplement the verifier’s static range checks with runtime checks on the actual values being processed by the program. Recall that the only permitted calculations with pointers are to add or subtract a scalar. For every arithmetic operation that involves a pointer and a scalar register (as opposed to an immediate), an alu_limit
is determined as the maximal absolute value that can safely be added to or subtracted from to the pointer without exceeding the allowable range.
Pointer arithmetic where the sign of the scalar operand is unknown is disallowed. For the rest of this subsection, assume that each scalar is treated as positive; the negative case is analogous.
Before each arithmetic instruction that has an alu_limit
, the following sequence of instructions is added. Note that off_reg
is the register containing the scalar value, and BPF_REG_AX
is an auxiliary register.
BPF_MOV32_IMM(BPF_REG_AX, aux->alu_limit - 1)
BPF_ALU64_REG(BPF_SUB, BPF_REG_AX, off_reg)
BPF_ALU64_REG(BPF_OR, BPF_REG_AX, off_reg)
BPF_ALU64_IMM(BPF_NEG, BPF_REG_AX, 0)
BPF_ALU64_IMM(BPF_ARSH, BPF_REG_AX, 63)
BPF_ALU64_REG(BPF_AND, off_reg, BPF_REG_AX)
If the scalar exceeds alu_limit
, then the first subtraction will be negative, so that the leftmost bit of BPF_REG_AX
will be set. Similarly, if the scalar that is supposed to be positive is in fact negative, the BPF_OR
instruction will set the leftmost bit of BPF_REG_AX
. The negation followed by the arithmetic shift will then fill BPF_REG_AX
with all 0s, so that the BPF_AND
will force off_reg
to zero, replacing the offending scalar. On the other hand, if the scalar falls within the appropriate range 0 <= off_reg <= alu_limit
, the arithmetic shift will fill BPF_REG_AX
with all 1s, so that the BPF_AND
will leave the scalar register unchanged.
To make sure that setting the register to 0 does not itself introduce a new out-of-bounds condition, the verifier will track an additional “speculative” branch where the operand is treated as 0.
Exploiting the Verifier
Breaking Range Tracking
As mentioned, jump conditionals also exist in a 32-bit variant. However, because all bounds are tracked exclusively for the full 64-bit-wide registers, there is no straightforward way to use 32-bit branches to update register bounds, as can be done for 64-bit branches as detailed above. Because this resulted in improperly rejected programs, for every 32-bit branching jump an additional function call was added, in an attempt to maximize the bounds information that can be gleaned. The idea is to use umin
and umax
to narrow down the last 32 bits in var_off
. The code is as follows, found in kernel/bpf/verifier.c
:
To understand this code, we must first explain the purpose of several functions used here.
-- tnum_range
is a function that generates a tnum
corresponding to the possible values in a given range of unsigned integers.
-- tnum_cast
creates a new tnum
based on the lowermost bits of an existing tnum
. Here, it is used to return the lower 32 bits of reg->var_off
.
-- tnum_lshift
/ tnum_rshift
perform shifts on tnum
s. Here, they are used together to clear the lower 32 bits of a tnum
.
-- tnum_intersect
takes two tnum
arguments both pertaining to a single value, and returns a single tnum
that synthesizes all knowledge conveyed by the arguments.
-- tnum_or
returns a tnum
indicating our knowledge of the bitwise OR of two operands, where the two arguments indicate our knowledge of the two operands.
Now consider a register with umin_value = 1
and umax_value = 2ˆ32+1
, and var_off
otherwise unconstrained. Then reg->umin_value & mask
and reg->umax_value & mask
will both be 1. Thus, the resulting tnum_range
will indicate that the lowest bit is known to be 1, and that all other bits are known to be 0. tnum_intersect
will preserve this information and mark all lower 32 bits as known in its output. Finally, tnum_or
will reintroduce uncertainty in the upper 32 bits, but because hi32
indicates that the lower 32 bits are known zeros, the lower 32 bits from the tnum_intersect
will be preserved. This means that after this function finishes, the lower half of var_off
will be marked as known binary 00...01
.
However, this conclusion is not justified. Just because umin
and umax
both end in binary 00...01
doesn’t mean that each value in between also does. For example, the register’s true value could easily have been 2. As we will see, this initial incorrect assumption made by the verifier can be used to break further assumptions.
Triggering the Bug
To actually trigger the described bug, we need to have a register that fulfills the following conditions:
-- During execution, the actual value in the register is 2.
-- The register’s umin_val
is set to 1, and umax_val
is set to 2ˆ32+1.
-- A conditional jump with a 32-bit comparison on this register is executed.
We cannot directly load the value 2 into the register with a mov
, as the verifier would then know that umin_val=umax_val=2
. However, there is an easy workaround. If we load the register from a map (we’ll use our input map inmap
), then the verifier will have no information about its value, as map values can be changed during runtime.
To set umin_val
and umax_val
we can use the verifier’s jump branch logic:
BPF_JMP_IMM(BPF_JGE, BPF_REG_2, 1, 1)
BPF_RAW_INSN(BPF_JMP | BPF_EXIT, 0, 0, 0, 0)
The conditional jump will result in two branches. In the one where the branch is taken, the verifier knows that BPF_REG_2 >= 1
, while the other branch ends with the exit instruction and is discarded. Thus, for all further instructions the umin_val
of register 2 will be 1.
Analogously, another conditional jump can be used to set umax_val
to 2ˆ32+1
. However, here we need to compare against a register, because only 32-bit immediate values are supported. After this, we have umin_val
and umax_val
set as we wanted.
Now, any conditional 32-bit jump can be used to trigger the bug:
BPF_JMP32_IMM(BPF_JNE, BPF_REG_2, 5, 1),
BPF_RAW_INSN(BPF_JMP | BPF_EXIT, 0, 0, 0, 0),
The verifier now thinks that the last 32 bits of register 2 are binary 00...01
when in reality they are binary 00...10
. After two additional instructions:
BPF_ALU64_IMM(BPF_AND, BPF_REG_2, 2),
BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 1),
The verifier now assumes that register 2 has to be 0 because the AND
instruction necessarily has to result in 0 if the second-last bit of the register 2 was 0, but in reality it is (2&2)>>1 = 1
. This is a very useful primitive, as we can now multiply register 2 with any number to create an arbitrary value that the verifier believes to be 0.
Bypassing ALU Sanitation
While rechecking all pointer arithmetic in runtime seems like a good idea, the implementation is severely lacking. The main problem seems to be that the values aren’t actually limited to the range that the static portion of the verifier assumes them to be in, but rather to a larger limit that seems safe at that moment but isn’t necessarily so with respect to the future. To understand the distinction, consider the following example:
-- Register 1 contains a pointer p to the beginning of 4000 bytes of safe memory. Assume that the type of *p
is just one byte in size. This means that adding 3999 to p
is perfectly fine but adding 4000 would leave the bounds.
-- Register 2 contains a scalar a
, which the static verifier believes to be 1000.
-- Now, add register 2 to register 1. The static verifier has no problem with this, as p+1000
is still within the bounds of the memory region. Because this is arithmetic involving a pointer, the ALU sanitizer will set an alu_limit
. However, even though the verifier believes that a
is no larger than 1000, the limit is not set to 1000, but rather to 3999, because this is the largest value still believed to be safe.
-- Add register 2 to register 1 again. This time, the static part of the verifier believes that p
points to position 1000 in the 4000 byte memory block. The alu_limit
will now be set to 2999, because this is the largest legal value that can still be added.
Now consider what happens if we have used a bug in the static analysis to trick the verifier into believing that a = 1000
, while in reality we set a = 2500
. The value will still pass both runtime checks without being forced to 0, because 2500 is less than both 3999 and 2999. However, in total we have added 5000 to the pointer p
, resulting in an address well outside the safe memory range. While each alu_limit
is sufficiently low individually, in combination they are insufficient. The second limit calculation only catches errors that the static analysis makes about the second addition, while assuming that it was completely correct about the first one.
Achieving OOB Read & Write
Now achieving OOB memory access is just a matter of combining the pieces. First, we use the JMP32
bug to set a register to 1, while the verifier believes it has to be 0. Assume that we have a pointer p
to a map of size 8000 bytes (the exploit uses slightly different values). For the purpose of the exploit, it’s most convenient to get OOB access before the beginning of the map, not after the end, so the first step is to add a large value like 6000 to p
so that we can “mirror” the ALU sanitation bypass described above.
Next, we manufacture a register that has an actual value of 6000 while the verifier thinks it contains 0. We do this by multiplying our original “fake” register with 6000. Now we can subtract that from p
. This works because the alu_limit
will be set to 6000, the largest legal amount we can subtract. However, the static part of the verifier will still assume that p
points to position 6000 in our map. Thus, we can now subtract any value up to 6000 from p
, while p
in reality points to the start of the map.
We now have a pointer up to 6000 bytes before the start of the map’s contents. Because the verifier still believes this pointer to be inside the map’s boundaries, it now allows us to read from and write to this address. For convenience, we can even make the operation and offset (and for writes, the value written) depend on parameters in our input map inmap
. For an OOB read, the read value is written to outmap
. As maps can be read and modified from user space through the bpf
syscall, we can now load this program just once and then repeatedly trigger it with different parameters for an arbitrary OOB read/write gadget.
Escalating Privileges
Leaking KASLR
Conveniently for us, the map’s contents are not stored in a heap chunk by itself, but rather they are placed at the end of a larger struct called struct bpf_map
(defined in include/linux/bpf.h
). This struct contains some useful members:
Particularly useful is the pointer ops
which points to a vtable of functions, depending on the map type. In this case, as the map is an arraymap
, it points to array_map_ops
. As this is a constant struct at a fixed location in rodata
(it’s even an exported symbol), reading this can be used to directly bypass KASLR.
Arbitrary Read
Also useful in the struct is the pointer struct btf *btf
, which points to an additional struct containing debug info. This pointer is normally unused (and set to NULL), which makes it a good target to overwrite without messing things up too much.
It turns out that this pointer can be conveniently accessed through the bpf_map_get_info_by_fd-function
, which in turn is called by the BPF_OBJ_GET_INFO_BY_FD
command of the bpf
syscall if the file descriptor of the map is provided. This function effectively performs the following (the full function is found in kernel/bpf/syscall.c
):
This means that if we use our OOB-write primitive to set map->btf
to someaddr - offsetof(struct btf, id)
, then BPF_OBJ_GET_INFO_BY_FD
returns *someaddr
in info.btf_id
. Since the id field of struct btf
is a u32, this primitive can be used to read 4 bytes at a time from an arbitrary address.
Finding process structs
To find the cred
and files
structs of the process, we first search for init_pid_ns
, the default process namespace (if running in a container, we would need to find the corresponding namespace first). This might not be the quickest way, but it works.
There are two ways to find init_pid_ns
:
-- If we know the offset between array_map_ops
and init_pid_ns
, we can simply add it to the address of array_map_ops
, which we already know. This offset does not depend on KASLR, but it isn’t stable across kernel updates.
-- Instead, we can find init_pid_ns
in the ksymtab
and kstrtab
segments. To do this, we first find the beginning of kstrtab
by iterative search beginning at array_map_ops
. Then, we find the null-terminated string “init_pid_ns” in kstrtab
, again by a simple iterative search. A final iteration over ksymtab
finds the symbol entry that references the kstrtab
entry, and also contains the relative address of init_pid_ns
.
Once init_pid_ns
is found, its radix tree can be iterated to find a pointer into the task
struct that corresponds to the pid of the exploit process. In fact, this is exactly the mechanism the kernel itself uses to find a task
by its pid.
Using the tasks struct, both the cred
struct containing the user privileges and an array of open file descriptors can be found. The latter can be indexed by the file descriptor of the loaded explmap
to obtain the corresponding struct file of the map. In turn, the private_data
of this structure points to the struct bpf_map
. This means now we also know the address of the contents of explmap
.
Arbitrary Write
Note that by overwriting ops
using OOB-write we can control RIP to any function to which we have a pointer. However, the first argument in RDI will always be set to the bpf_map-struct
, which rules out a lot of existing functions. Thus, it seems natural to overwrite certain elements of array_map_ops
to different map operations, which can at least handle the first argument correctly.
To do this, we first load a complete copy of the array_map_ops
table into the data of explmap
using the arbitrary read primitive as well as the bpf
syscall with BPF_MAP_UPDATE_ELEM
. The only modification to this copy is that the map_push_elem
member, normally unused in arraymaps
, is overwritten with the map_get_next_key
operation.
The arraymap
implementation of map_get_next_key
is as follows (from kernel/bpf/arraymap.c
):
If we control next_key
and key
, then *next = index + 1;
can be used as an arbitrary write primitive under the condition that index < array->map.max_entries
. If map->max_entries
can be set to 0xffffffff
, this check will always pass (except for index=0xffffffff
, which is actually what we are going to use, but this is fine as *next
is still set to 0=index+1
).
Because we already obtained a pointer to the data of explmap
, we can now overwrite explmap->array_map_ops
to point to our modified operations table.
Note that the signature of map_push_elem
, which we have overwritten with map_get_next_key
, is:
However, map_push_elem
is called by the BPF_MAP_UPDATE_ELEM
command only if the map has type BPF_MAP_TYPE_STACK
or BPF_MAP_TYPE_QUEUE
. But if this is the case, we can directly control the flags
argument, which will be interpreted as next_key
, as well as the value
argument.
To trigger the arbitrary write gadget, we have to perform the following OOB writes in sequence:
-- Set ops
to our fake vtable inside the explmap
buffer.
-- Set explmap->spin_lock_off
to 0 to pass some additional checks.
-- Set explmap->max_entries
to 0xffffffff
to pass the check in array_map_get_next_key
.
-- Set explmap->map_type
to BPF_MAP_TYPE_STACK
to be able to reach map_push_elem
.
The arbitrary 32-bit write can now be triggered by passing appropriate arguments to BPF_MAP_UPDATE_ELEM
. After all writes are performed, the fields should be reset to their original values to prevent crashes on cleanup.
Getting root privileges
Getting root is now trivial: We just need to set the uid
to 0 in the cred
struct of the process. Now we can execute arbitrary privileged commands or drop into an interactive shell.
Thanks again to Manfred for such a thorough write-up of this local privilege escalation. This was Manfred’s first time at Pwn2Own, and we hope to see more of his work in the future. Until then, follow the team for the latest in exploit techniques and security patches.