On 12th February 2019, Linux announced that it had released a patch for a critical vulnerability reported by Adam Iwaniuk and Borys Popławski. According to Aleksa; one of the runC engineers, the vulnerability allows a malicious container to access root-level code execution by overwriting the host runC binary.
The vulnerability severity was rated high since a malicious container could overwrite the host binary with minimal user interaction. To execute code as root within a container, users were only supposed to run commands either in the following contexts;
• Creating a new container using an attacker-controlled image.
• Attaching (docker exec) into an existing container which the attacker had previous write access to.
What is runC?
RunC is a container runtime that was initially created as part of a Docker. However, RunC was later extracted and started functioning as an open source tool and library. RunC is a “low level” container runtime, but it is mostly utilised by “high level” container runtimes such as Docker during spawning.
Various “High Level”, container runtimes such as Docker, has several functions like image creation and management. To perform these functions, they use runC to perform running container functionalities such as creating a container and attaching a process to an existing container among others.
An attacker can launch an attack either when runC is attaching to a running container or when starting a container running a specially crafted image.
To fully exploit the vulnerability, an attacker with the container root access will utilise proc/[runc-pid]/exe to reference the runC binary and then overwrite it.
For example, when runC attaches to a container the attacker can trick
it into executing itself. This could be done by replacing the target
binary inside the container with a custom binary pointing back at the
runC binary itself. As an example, if the target binary was /bin/bash,
this could be replaced with an executable script specifying the
interpreter path #!/proc/self/exe (/proc/self/exec is a symbolic link
created by the kernel for every process which points to the binary
that was executed for that process).
If the runC was attaching an image to a running container, the target binary could be placed as bin/bash. However, if an attacker replaces the target with a symbolic link #!/proc/self/exe, the loader will execute #!/proc/self/exe, and this will point to the host target runC binary. The attacker will then proceed to try and overwrite the runC library, but the kernel will block the attempt to overwrite since runC is executing.
To overcome the kernel restriction, use the O_PATH flag to open a file descriptor to proc/self/fd/
The process of writing the host binary using #!/proc/self/exe will succeed when the runC binary exists, and thus it can be compromised and used to launch attacks on other containers.
LXC is also affected by this vulnerability. However, no CVE number has been since LXC does not allow the use of privileged containers.
"As privileged containers are considered unsafe, we typically will not
consider new container escape exploits to be security issues worthy of
a CVE and quick fix. We will however try to mitigate those issues so
that accidental damage to the host is prevented."
To prevent attacks emerging as a result of the runC vulnerability, LXC has been patched.
To prevent this attack, LXC has been patched to create a temporary
copy of the calling binary itself when it starts or attaches to
containers. To do this LXC creates an anonymous, in-memory file using
the memfd_create() system call and copies itself into the temporary
in-memory file, which is then sealed to prevent further modifications.
LXC then executes this sealed, in-memory file instead of the original
on-disk binary. Any compromising write operations from a privileged
container to the host LXC binary will then write to the temporary
in-memory binary and not to the host binary on-disk, preserving the
integrity of the host LXC binary. Also as the temporary, in-memory LXC
binary is sealed, writes to this will also fail.
Note: memfd_create() was added to the Linux kernel in the 3.17
In this strategy, when LXC starts or attaches to a container, it re-executes from a temporally copy of itself. In the same way, the /proc/[runc-pid]/exe redirects to the temporary file and this makes it difficult to access runC binary from within the container. To do that, the LXC uses the memfd-create() system call to create an anonymous in-memory file. The LXC then copies itself to the temporary in-memory file and seals it to avoid any other modification.
LXC runs the sealed file instead of the main file stored in the on-disk binary. This means that if an attacker uses root container privileges to compromise the host LXC binary, the attacker will compromise the temporary in-memory binary instead of the original host binary on-disk. This makes it difficult to compromise the host LXC binary and thus conserve its integrity.
The LXC temporary files are also sealed to remove its writing rights making it difficult for an attacker to exploit the runC vulnerability.
PoC for CVE-2019-5736
This is a Go implementation of CVE-2019-5736, a container escape for Docker. >The exploit works by overwriting and executing the host systems runc binary >from within the container.
How does the exploit work?
There are 2 use cases for the exploit. The first (which is what this repo >is), is essentially a trap. An attacker would need to get command execution >inside a container and start a malicious binary which would listen. When >someone (attacker or victim) uses docker exec to get into the container, this >will trigger the exploit which will allow code execution as root.
One of the affected Linux operating systems is Fedora because it runs on its default SELinux policy. The SELinux default policy’s containers run as a container_runtime_t which is vulnerable. Also, any other operating system running on the default AppArmor policy is vulnerable.
However, RedHat Enterprise Linux and CentOS are not vulnerable because they have upgraded their SELinux permissions and do not run on default SELinux policies.
The vulnerable mainly occurs if the running process is hostile or untrusted. However, if the process running inside a container is not hostile or is trusted, then the system is not vulnerable.
There are two primary strategies of mitigating the runC vulnerability. One is to mitigate the vulnerability, and the other one is to upgrade the runC version to a stable one.
Run non-0 user
For the runC exploit to occur, the user must be running on root privileges. Therefore, one mitigation strategy is to ensure that all the system containers are running as a non-0 user.
The execution privileges can be set through the pod specification or the container image.
apiVersion: v1 kind: Pod metadata: name: run-as-uid-1000 spec: securityContext: runAsUser: 1000 # ...
The execution user privileges can also be enforced using the PodSecurityPolicy
apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: name: non-root spec: privileged: false allowPrivilegeEscalation: false runAsUser: # Require the container to run without root privileges. rule: 'MustRunAsNonRoot'
Setting up a policy to ensure that all running container processes are not using root privileges is very crucial in mitigating the runC vulnerability.
Vetting the container images
Another mitigation strategy is ensuring that all container images are vetted to ensure they are trustworthy. This can be achieved by either personally building all the images or by vetting the image contents and then pinning a hash to the image version.
(image: external/[email protected]:314234123412dfeeasd)
Another runC vulnerability mitigation strategy is through upgrading the package runC or upgrading the system operating system image if you are using immutable images.
One strategy of preventing the runC vulnerability attack on LXC is through patching. Therefore, users should install the patch to mitigate the runC vulnerability.