Sandboxing: Conclusion

In total I’ve written five methods for sandboxing code. These are certainly not the only methods but they’re mostly simple to use, and they’re what I’ve personally used.

A large part of this sandboxing was only possible because I built the code to work this way. I split everything into privileged and unprivileged groups, and I determined my attack surface. By moving the sandboxing after the privileged code and before the attack surface I minimized risk of exploitation. Considering security before you write any code will make a very big difference.

One caveat here is that SyslogParse can no longer write files. What if, after creating rules for iptables and apparmor, I want to write them to files? It seems like I have to undo all of my sandboxing. But I don’t – there is a simple way to do this. All I need is to have SyslogParse spawned by another privileged process, and have that process get the output from SyslogParse, validate it, and then write that to a file.

One benefit of this “broker” process architecture is that you can actually move all of the privileged code out of SyslogParse. You can launch it in another user, in a chroot environment, and pass it a file descriptor or buffer from the privileged parent.

The downside is that the parent must remain root the entire time, and flaws in the parent could lead to it being exploited – attacks like this should be difficult as the broker could would be very small.

Hopefully others can read these articles and apply it to their own programs. If you build a program with what I’ve written in mind it’s very easy to write sandboxed software, especially with a broker architecture. You’ll make an attacker miserable if you can make use of all of this – their only real course of action is to attack the kernel, and thanks to seccomp you’ve made that a pain too.

Before you write your next project, think about how you can lock it down before you start writing code.

If you have anything to add to what I’ve written – suggestions, corrections, random thoughts – I’d be happy to read comments about it and update the articles.

Here’s a link to all of the articles:

Seccomp Filters:

Linux Capabilities:

Chroot Sandbox:


And here’s a link to the GitHub for SyslogParse:

Sandboxing: Seccomp Filters

This is the first installment on a series of various sandboxing techniques that I’ve used in my own code to restrict an applications capabilities. You can find a shorter overview of these techniques here. This article will be discussing seccomp filters.

What is Seccomp? An Introduction:

System calls are your way of asking the kernel to do something for you. You send a message saying “Hey, open a file for me” and it’ll probably do it for you, barring permission errors or some other issue. But, if you can talk to the kernel, you can exploit the kernel. Many vulnerabilities are found in kernel system calls, leading to full root privileges – bypassing sandboxing techniques like SELinux, Apparmor, namespaces, chroots, you name it. So, how do we deal with this without patching the kernel, as a developer? Seccomp filters.

Seccomp is a way for a program to register a set of rules with the kernel. These rules deal with the system calls a program can make, and which parameters it can send with them.

When you create your rules you get a nice overview of your kernel attack surface. Those calls are the ways your attacker can attack the kernel. On top of that ,you’ve just reduced kernel attack surface – if an attacker requires system call A and you’ve only allowed system calls B through D, they can’t attack with system call A.

Another nice benefit is the ability to restrict capabilities. If your program never writes a file, don’t give it access to the write() system call. Now you’ve reduced the kernel attack surface, but you’ve also stopped the program from writing files.

The Code:

Seccomp code is fairly simple to use, though I haven’t found any really good documentation. Here is the seccomp code used in my program, SyslogParse, to restrict its system calls.

scmp_filter_ctx ctx;
ctx = seccomp_init(SCMP_ACT_KILL);

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(rt_sigreturn), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(tgkill), 0);

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(access), 0);

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 2,

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(fstat), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(open), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mprotect), 1,

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 2,

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(munmap), 2,

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(madvise), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(futex), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(execve), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(clone), 0);

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(getrlimit), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(rt_sigaction), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(rt_sigprocmask), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(set_robust_list), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(set_tid_address), 0);

if(seccomp_load(ctx) != 0) //activate filter
err(0, “seccomp_load failed”);

I’ll go through this bit by bit.

scmp_filter_ctx ctx;
ctx = seccomp_init(SCMP_ACT_KILL);

This should be fairly simple to understand if you’ve written basically any code. This instantiates the seccomp filter, “ctx”, and then initializes it to kill on rule violations. Simple.

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(futex), 0);

This line is a rule for the “futex” system call. The first parameter, “ctx”, is our instantiated filter. “SCMP_ACT_ALLOW”, the second parameter, is saying to allow when a condition is met. The third is a macro for the futex() system call, as that’s the call we want to allow through the filter. The last parameter, “0”, is how many rules we want to add to this that deal with parameters.

Simple. So this rule will allow any futex system call regardless of parameters.

I chose futex in this example to demonstrate that seccomp can not protect you from every attack. Despite the heavy amount of sandboxing I’ve done in this program, this filter will do nothing to stop attacks that use the futex system call. Recently, one vulnerability was found that could do just that – a call to futex would lead to control over the kernel. Seccomp just isn’t all powerful, but it’s a big improvement.

Note: I found all of these syscalls by repeatedly running strace on SyslogParse with different parameters. Strace will list all of the system calls as well as their arguments and makes creating rules very easy.

if(seccomp_load(ctx) != 0) //activate filter
err(0, "seccomp_load failed");

seccomp_load(ctx) will load up the filter and from this point on it is enforced. In this case I’ve wrapped it to ensure that it either loads properly or the program won’t run.

And that’s it. That’s all the code it takes. If the program makes a call to any other system call it crashes with “Bad System Call”.

Seccomp is quite easy to use and is the first thing I’d make use of if you are considering sandboxing. All sandboxing relies on a strong keernel, but as a developer you can only change your own program, and seccomp is a good way to reduce kernel attack surface and make all other sandboxes more effective.

Linux has something like 200 system calls (can’t find a good source, anyone know a more definitive number?), and SyslogParse has dropped that down to about 22. That’s a nice drop in privileges and attack surface.

Next Up: Linux Capabilities

Explaining Chrome’s Linux Sandbox

Note: The documentation for Chrome’s Linux sandbox is lacking. This is my attempt to make sense of it and clarify how it works for users who may not want to sift through multiple docs on the subject. If I have misinterpreted, let me know, some of the docs are out of date and I may not have been informed.

Chrome is well known for its sandbox, which has held up incredibly well over the years – not a single in the wild attack against it. But on the Linux side of things it’s even more impressive, Chrome’s sandbox is immensely more powerful than on Windows. Though the architecture is similar, the mechanism is fairly different.

Chrome’s architecture is made up of multiple parts – on Linux there is a broker, your SetUID Sandbox process, and your tabs, renderer, plugins, and extensions (the Zygote processes).

The Chrome-Sandbox SUID Helper Binary launches when Chrome does, and sets up the sandbox environment. The sandbox environment is meant to be restrictive to the file system and other processes, attempting to isolate various Chrome parts (such as the renderer) from the system.

A sandboxed process is put inside a Chroot, a sort of a virtual file system (chroot = change root, it’s a new root). It basically gets its own file system to work with, an din this case, it’s not given any write access to the system. The limitations imposed on the process prevent it from escaping the chroot.

The sandboxed process is also provided a PID namespace (a way for a process to look like it’s standalone on the machine,  or among a subset of processes), denying it the ability to use ptrace() or kill() other processes. ptrace() in particular is dangerous as it allows processes to read or manipulate data in other processes. Sandboxed processes are unable to ptrace() each other as well (set to undumpable).

A network namespace is used as well in order to prevent sandboxed processes from connecting out – not much documentation on this.

The Broker process, which remains unrestricted by SUID, is what handles decisions about downloading files, writing to the disk, etc. It handles the dangerous stuff, and is unrestricted, but it is separated from the areas of the program that are most open to attack. Using an Apparmor profile will allow restriction even of the broker process. Otherwise it remains confined purely by DAC.

The next layer of restriction is provided by the Seccomp-BPF sandbox. Seccomp filters are something I’ve written about before. Their goal isn’t to protect the system from damage, like the SUID sandbox does, but to protect the system from further exploitation.

Seccomp-BPF works by restricting the system calls that programs can make. The implications of this are covered in this post. A quick summary is that a sandbox, or any form of access control, is only as powerful as the kernel. It is very often the case that, rather than trying to find issues with the sandbox itself, an attacker can simply go after the big buggy kernel running underneath it. An attack on the kernel allows for a full bypass of the sandbox.*

Seccomp works by restricting access to the kernel by filtering the ‘calls’ that can be made to it. The fewer calls a program can make the fewer ways it can exploit the system. Suddenly the kernel isn’t this massive glob of attack surface, it’s a much smaller are, with monitored interaction between it and the program.

Chrome on Windows had its sandbox broken at Pwn2Own by MWRLabs. It was, in fact, a local kernel vulnerable that allowed them to bypass the sandbox once they’d gained access to the renderer. Such an attack would be far more difficult on a Linux system with Seccomp enabled.

Overall the sandbox works by reducing the potential for damage and reducing the potential for local exploitation. Chrome is, as always, pouring work into their security. Their sandbox is very impressive, and I would love to see some research into breaking it.

There was a ‘partial reward’ for PinkiePie exploiting ChromeOS, but it was unreliable. No details have been released yet, quite unfortunately.

*Plug for Grsecurity here. See my guide for setting up a hardened Grsec kernel. Seccomp limits kernel attack surface, Grsecurity makes the entire kernel more difficult to exploit.

Chrome Seccomp-BPF Sandbox

Chrome://sandbox has gotten an update reflecting the newly implemented Mode 2 Seccomp Filters implemented through the Berkley Packet Filter (BPF). To learn more about Syscall and Seccomp Filtering you can read this post and learn about how Chrome’s new sandbox on Linux.

Chrome’s seccomp sandbox is a powerful restriction on how Chrome can interact with the system’s kernel. This limitation is an effective way to prevent kernel exploitation, which is a wonderful reinforcement to Chrome’s SUID sandbox. The seccomp sandbox is ideal for a program like chrome, programs that already implement some form of sandboxing. The best way to escape from a sandbox, outside of a sandbox design issue, is to exploit the kernel – doing so allows you to bypass almost any security implemented, and the seccomp sandbox attempts to mitigate this threat.

Check to make sure that you’re adequately sandboxed by going to chrome://sandbox.


Why I Sandbox Chrome With AppArmor

Google Chrome is a browser designed with least privilege in mind. The Chrome multiprocess architecture sandboxes each tab, the renderer,  the GPU, and extensions and has them use IPC to talk to the ‘browser’ process, which runs with higher rights. The idea is that all untrusted code (websites) is dealt with on the lowest possible level (the renderer has virtually no rights) and then the renderer deals with the trusted browser process. It’s very effective and there hasn’t been a single Chrome exploit in the wild.

On Linux the Chrome sandbox makes use of a Chroot, seccomp mode 2 filters, SUID, and a few other techniques. On the outside this seems really secure, the problem is that the documentation is outdated and not nearly as clear as the Windows documentation.

To use Chroot you need root, so for the browser process to Chroot the other processes it needs root. Chrome seems to find a way around this using SUID where it runs as root under a separate name, I don’t really know, again the documentation doesn’t cover this at all.

Basically, it sounds really strong but if I don’t understand something I can’t consider it secure.

That’s why I apparmor Chrome. I know how AppArmor works, I know it’s track record, I know what my profile allows and what it doesn’t allow. And I know that even if Chrome is running at root my apparmor profile will limit it.

I would post my AppArmor profile for Chrome up here but it’s fairly specific to my needs. For those of you looking to sandbox Chrome make sure you use a separate profile for the sandbox, chrome itself, and the native client bootstrap.

Seccomp Mode 2 Filters

Just a short post to bring attention to seccomp mode 2 filters. There is not enough hype about this, probably because it’s not in the vanilla kernel yet (that I know of.)

Seccomp filters let programs whitelist calls that they can make to the kernel. Whitelisting syscalls reduces kernel attack surface, which will prevent privilege escalation exploits. Seccomp is already built into Chrome/ Chromium to reinforce the Chrome Linux sandbox, OpenSSL 6.0 supports it as well as vsftpd. I’d really like to see it in cupsd and various other services (actually I’d like to see a lot compiled with it.)

The central idea of Seccomp filters is to limit interaction with the Linux Kernel. If you can’t access code you’re gonna have a hell of a hard time exploiting it – limiting interaction limits attack surface. Support for verifying syscall parameters is still in the works but the sandbox is very powerful. Any system using the Linux 3.5 kernel has support for Seccomp.