now or never exec
Some early followup from efforts to improve browser security with more details about possible refinements to W^X.
The first obvious improvement would be to simply enforce W^X in the kernel. Userland isn’t ready, not nearly ready, for this change, though of course making such a change would go a long way towards assessing success. How do we know we’re done until we know there cannot be any W|X mappings? (Referring here to ports and the extended userland. OpenBSD userland is ready.)
By itself, this is a trivial two line change to mmap.
if (prot & (PROT_WRITE | PROT_EXEC) == (PROT_WRITE | PROT_EXEC)) return EPERM;
And the same for mprotect. Probably EINVAL is the better error, but EPERM shows up nicer in ktrace for now.
This isn’t too interesting on its own. It’s obviously not something that couldn’t have been done ten years ago, if it weren’t for a concern for collateral damage, but it’s both the start and end point for our little story. Right now, if one runs
sudo procmap -p `pgrep firefox` | grep write/exec the output looks something like this.
00001068947EE000 64K read/write/exec [ anon ] 000010689FA8A000 64K read/write/exec [ anon ] 00001068B5883000 64K read/write/exec [ anon ] 00001068FB43E000 64K read/write/exec [ anon ] 0000106912BBF000 64K read/write/exec [ anon ] 0000106913B06000 64K read/write/exec [ anon ] 0000106922BCF000 64K read/write/exec [ anon ] 0000106925ED3000 64K read/write/exec [ anon ] 000010692E8E3000 64K read/write/exec [ anon ] 0000106965C60000 64K read/write/exec [ anon ] 0000106975C3B000 64K read/write/exec [ anon ] 000010698D84E000 64K read/write/exec [ anon ]
That’s not pretty enough for me.
A vaguely related change I made to luajit a little while ago is worth revisiting. On i386 in particular, the initial page protections for an allocation determine where in the address space it goes. That’s because i386 does not have per page protections, but instead restricts executable permissions by setting the code segment to only extend so far upwards. Just far enough to cover the program’s text (and libraries), but not the heap or stack. As executable mappings are added to a process, the segment must be extended to cover them. However, the segment must also cover all the mappings in between.
Usually, this isn’t such a problem. i386 binaries and libraries are built so that text and data are virtually separated by a gap, and UVM knows, based on the initial mmap protections, where in the address space to place allocations. However, a JIT confuses things. The JIT engine writes out some new code, mprotects it, and suddenly some parts of the heap become executable. Except this grows the segment, and now most or even all of the heap has become executable. If only UVM had some way of knowing that the JIT would like to exec this memory...
Oh, that’s easy. Have the initial mmap specify PROT_EXEC. Then it gets properly hinted into the executable part of the address space. Such an allocation isn’t much use, though, without an immediate mprotect to make it writable so that the cool JIT codes can be written, requiring a kind of two step dance. The good news is it’s harmless on systems other than OpenBSD, and probably not that costly in the big picture. So no kernel changes here, and just a short diff for userland.
At the current time, i386 benefits the most from mmap protection hints, but one can imagine amd64 and other architectures adjusting their ASLR policy based on protection. Maybe we want to move the code farther away from other heap structures, or hide it all below the heap, so that it’s harder to hit.
If we combine the two above ideas, we get what I’m tentatively calling now or never exec, with the awesomely ambiguous codename nonexec. UVM also has a concept of max protection, although this feature isn’t utilized much. For example, if you mmap a file opened readonly, maxprot excludes PROT_WRITE, so that the mapping can’t be made writable later. We can apply a similar principle to mmap with yet another two line diff. Roughly.
-maxprot = PROT_READ | PROT_WRITE | PROT_EXEC; +maxprot = PROT_READ | PROT_WRITE | (prot & PROT_EXEC);
In effect, the advisory protection hint from before is now mandatory. It’s the only way to allocate memory that will ever become executable. No longer can arbitrary regions be upgraded to exec, even if one drops write permissions at the same time. This is probably the strictest protection policy one can reasonably enforce. People like their JITs, and JITs like their executable allocations. Hopefully most JIT engines are structured so that they have some advance knowledge of what allocations will become executable in advance.
What does this accomplish? Over W^X? Honestly, perhaps not much. I can try to imagine an exploit that tricks a JIT into mprotecting the wrong part of the heap and then jumps to its shellcode (which this policy blocks), but such an exploit would likely have more tricks up its sleeve. Nevertheless, being conservative in what one accepts has complicated many an exploit by removing convenient stepping stones. The implementation cost is practically nil. The runtime cost is very low. The conversion cost, assuming working W^X, is low. And as a side benefit, it will help identify code that doesn’t provide exec hints.
There is a diff, but it’s secret. :)
The nonexec policy is certainly helpful as a development aid, but that doesn’t mean it’s the best policy for OpenBSD. There’s a lot of software out there to fix. Fix may even be the wrong word. W^X is the right thing to do in general, but until just this instant nobody was told there would also be rules about X first, W later. We could provide a sysctl, and in fact
vm.nonexec is exactly what I have. Unfortunately, such an option doesn’t have much value to users.
Maybe all your software has been fixed and you run with nonexec=1. But by virtue of all your software being fixed, it doesn’t need the kernel to deny its W|X mappings; it doesn’t request any W|X mappings.
Or you depend on a program that hasn’t been fixed and you run with nonexec=0. Your program does request W|X mappings, but the kernel has been told that’s ok.
This sounds like a knob where the intersection between the desired setting and the necessary setting may be rather empty.
The luajit patch above was only for 32 bit platforms. At the time, I was only concerned with i386 segments, meaning luajit didn’t work with nonexec on amd64 out of the gate, but a similarly short diff restored everything to working order.
While getting up to speed with node, I ran into a minor build issue requiring a two line patch. Good news! An important part of this project is getting changes pushed upstream. We could for a short while keep a pile of patches in the OpenBSD ports tree, but such patches inevitably decay and cease applying after their expiration date. Getting to submit a small fix and gauge upstream’s response is probably better than unloading a more intrusive W^X on them. Alas, I was the not the first to discover this bug, but actually the fifth. I’ll be watching this bug and a duplicate. One question answered, anyway. (Fixed at last!)