All-Things-Docker-and-Kubernetes

Seccomp

Seccomp in Docker

Seccomp (Secure Computing Mode) is a Linux kernel feature that allows you to restrict the system calls available to a process. It provides a way to create a secure execution environment by limiting the set of allowed system calls, reducing the attack surface of a program.

Simply put, we can limit the syscalls that programs can use.

To check if seccomp is supported by the Kernel, check boot config file.

Built-in Seccomp Filters on Docker

Docker containers have a builtin Seccomp fitler that is used when a container is created, provided that the host Kernel has Seccomp enabled.

Below is a snippet of the full Seccomp profile. The complete profile includes a more extensive list of allowed syscalls and actions.

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "architectures": [
    "amd64",
    "x86_64"
  ],
  "syscalls": [
    {
      "name": "accept4",
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "name": "access",
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "name": "adjtimex",
      "action": "SCMP_ACT_ALLOW"
    },
    // ... additional syscalls ...
  ]
}

Seccomp Modes

Seccomp (Secure Computing Mode) operates in these modes:

Filter Mode

In filter mode, Seccomp allows or denies system calls based on a predefined filter set by the process.

Example of a Seccomp Filter in JSON format:

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "architectures": [ "amd64" ],
  "syscalls": [
    { "name": "read" },
    { "name": "write" },
    { "name": "exit" }
  ]
}

User Notification Mode

User notification mode allows a process to receive a notification (signal) when a specified system call is about to be executed.

Example of Using User Notification Mode in C:

#include <linux/seccomp.h>
#include <stddef.h>
#include <stdio.h>
#include <sys/prctl.h>

int main() {
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, NULL);

    // Specify a system call for notification
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_NOTIFY, SECCOMP_RET_TRAP);

    // ... Your application code ...

    return 0;
}

Seccomp Profiles

Seccomp profiles define the set of system calls allowed for a process.

A Seccomp Profile consist of objects:

Here’s an example of a simple seccomp profile in JSON format that allows only a few basic syscalls:

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "architectures": [ "amd64" ],
  "syscalls": [
    { "name": "read" },
    { "name": "write" },
    { "name": "exit" },
    { "name": "exit_group" }
  ]
}

Note that there are two types of profiles:

Below is an example:

Specifying a Custom Seccomp Profile

We can create a custom Seccomp profile and use it when running containers:

## custom.json 
{
  "defaultAction": "SCMP_ACT_ALLOW",
  "architectures": ["amd64"],
  "syscalls": [
    { "name": "read", "action": "SCMP_ACT_ALLOW" },
    { "name": "write", "action": "SCMP_ACT_ALLOW" },
    { "name": "exit", "action": "SCMP_ACT_ALLOW" },
    { "name": "exit_group", "action": "SCMP_ACT_ALLOW" },
    { "name": "open", "action": "SCMP_ACT_ALLOW" },
    { "name": "close", "action": "SCMP_ACT_ALLOW" },
    { "name": "fstat", "action": "SCMP_ACT_ALLOW" },
    { "name": "arch_prctl", "action": "SCMP_ACT_ALLOW" },
    { "name": "brk", "action": "SCMP_ACT_ALLOW" },
    { "name": "munmap", "action": "SCMP_ACT_ALLOW" },
    { "name": "mmap", "action": "SCMP_ACT_ALLOW" }
    // Add more syscalls as needed
  ]
}

To use the seccomp profile, pass it when running th container.

docker run \
--security-opt seccomp=/path/to/custom.json \
-it ubuntu:latest

Disable Seccomp when Running Container

We can also tell the Docker container to completely ignore any seccomp profile completely:

docker run \
--security-opt seccomp=unconfined   \
-it ubuntu:latest

By doing this, the container should be able to use all avaiable syscalls from within the container.

This is NOT RECOMMENDED.

Seccomp in Kubernetes

Below is an example of a Docker container. This container displays the runtime used and the list of blocked syscalls.

Now, if we try to run a pod using the image, we’ll see a different output.

From the pod logs above, we see that there’s lesser blocked syscalls, and the Seccomp is set to disabled. This is because Kubernetes doesn’t implement Seccomp by default.

To implement Seccomp in the Pod, specify it as a Security Context in the Pod definition file.

### pod.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: seccomp-pod
spec:
  securityContext:
    seccompProfile:
      tye: RuntimeDefault
  containers:
  - name: my-container
    image: nginx:latest
    securityContext:
      allowPrivilegeEscalation: false
  # Add more containers or configurations if needed

Using Custom Seccomp Profile

If we want to use a custom Seccomp profile, we could also specify it in the Pod definition file.

But first, ensure that the /var/lib/kubelet/seccomp/profiles/ directory is created. Inside this directory, create the custom.json file.

Create the custom json file.

## audit.json 
{
    "defaultAction": "SCMP_ACT_LOG"
}
apiVersion: v1
kind: Pod
metadata:
  name: seccomp-pod
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/audit.json
  containers:
  - name: my-container
    image: nginx:latest
    securityContext:
      allowPrivilegeEscalation: false    

  # Add more containers or configurations if needed

Once pod is created, syslog calls made by the container in the pod will be logged in the /var/log/syslog file.

From the syslog output above, we could see the syslog call number made by the container in the pod. Note that this number are mapped to specific syscall names, which we can check in the /usr/include/asm/unistd_64.h

Below are just some of the syscall numbers and their corresponding syscall names.

Reject all syscalls made by the container

We can create another custom profile which will reject all syscalls made by the container. Create the violation.json.

## /var/lib/kubelet/seccomp/profiles/violation.json 
{
    "defaultAction": "SCMP_ACt_ERRNO"
}

We can then specify this profile in the Pod definition file.

apiVersion: v1
kind: Pod
metadata:
  name: test-violation
spec:
  restartPolicy: Never
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/violation.json
  containers:
  - name: my-container
    image: nginx:latest
    securityContext:
      allowPrivilegeEscalation: false    

Once we apply the manifest and check the pod, we’ll see that the pod has a “ContainerCannotRun” status.


Back to first page