An undocumented container escape method

An undocumented container escape method

Author: Nitro @360GearTeam

background

Recently, an undisclosed container escape method was discovered. When a container shares the host PID namespace and runs with uid 0 (the user namespace is not enabled and no additional capabilities are added), it can be achieved by using the symbolic links of certain processes /proc/[pid]/root. Container escapes.

analyze

/proc/[pid]/root introduction

According to proc(5)the manual, through /proc/[pid]/rootsymbolic links, the rootfs of any process can be accessed, regardless of whether the current process and the target process belong to the same mount namespace. Then I found a description in the manual about /proc/[pid]/rootpermission issues when accessing symbolic links:

Permission to dereference or read (readlink(2)) this
symbolic link is governed by a ptrace access mode
PTRACE_MODE_READ_FSCREDS check; see ptrace(2).

In other words, to access this symbolic link, you need to go through ptrace(2)relevant permission checks. In fact, this description is quite vague. Why does accessing a symbolic link require checking the permissions related to the ptrace system call?

By looking at ptrace(2)the manual and finding PTRACE_MODE_READ_FSCREDSthe section related to flags:

Ptrace access mode checking
       Various parts of the kernel-user-space API (not just ptrace()
       operations), require so-called "ptrace access mode" checks, whose
       outcome determines whether an operation is permitted (or, in a
       few cases, causes a "read" operation to return sanitized data).
       These checks are performed in cases where one process can inspect
       sensitive information about, or in some cases modify the state
       of, another process.  The checks are based on factors such as the
       credentials and capabilities of the two processes, whether or not
       the "target" process is dumpable, and the results of checks
       performed by any enabled Linux Security Module (LSM)—for example,
       SELinux, Yama, or Smack—and by the commoncap LSM (which is always
       invoked).

       Prior to Linux 2.6.27, all access checks were of a single type.
       Since Linux 2.6.27, two access mode levels are distinguished:

       PTRACE_MODE_READ
              For "read" operations or other operations that are less
              dangerous, such as: get_robust_list(2); kcmp(2); reading
              /proc/[pid]/auxv, /proc/[pid]/environ, or
              /proc/[pid]/stat; or readlink(2) of a /proc/[pid]/ns/*
              file.

       PTRACE_MODE_ATTACH
              For "write" operations, or other operations that are more
              dangerous, such as: ptrace attaching (PTRACE_ATTACH) to
              another process or calling process_vm_writev(2).
              (PTRACE_MODE_ATTACH was effectively the default before
              Linux 2.6.27.)

       Since Linux 4.5, the above access mode checks are combined (ORed)
       with one of the following modifiers:

       PTRACE_MODE_FSCREDS
              Use the caller's filesystem UID and GID (see
              credentials(7)) or effective capabilities for LSM checks.

       PTRACE_MODE_REALCREDS
              Use the caller's real UID and GID or permitted
              capabilities for LSM checks.  This was effectively the
              default before Linux 4.5.

       Because combining one of the credential modifiers with one of the
       aforementioned access modes is typical, some macros are defined
       in the kernel sources for the combinations:

       PTRACE_MODE_READ_FSCREDS
              Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.

       PTRACE_MODE_READ_REALCREDS
              Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.

       PTRACE_MODE_ATTACH_FSCREDS
              Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.

       PTRACE_MODE_ATTACH_REALCREDS
              Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.

After seeing this description, I understand why accessing /proc/[pid]/rootsymbolic links requires ptrace(2)relevant permission checks, because /proc/[pid]/rootthe operation of accessing the rootfs of the target process through symbolic links is similar to tracking a process through the ptrace system call, in that one process accesses another process. data.

PTRACE_MODE_READ_FSCREDSThe flag bit is a combination of the PTRACE_MODE_READand PTRACE_MODE_FSCREDSflag bits, so the calling process needs to have sufficient file system permissions or capabilities to /proc/[pid]/rootaccess the rootfs of the target process through the symbolic link.

**How ​​exactly does the kernel perform permission checks? **To answer this question, you need to analyze the relevant kernel source code.

The relevant function call relationship diagram is as follows:
Insert image description here

The implementation of most files in the proc file system is in /fs/proc/base.cfiles. .get_linkWhen accessing a symbolic link, the kernel will get the corresponding actual path through the method in the symbolic link file inode . For /proc/[pid]/root, the method used .get_linkis proc_pid_get_link()the function.

proc_pid_get_link()The function will call the function in the same file proc_fd_access_allowed()to check whether the calling process has sufficient permissions.

proc_fd_access_allowed()The function gets the instance of the target process through the inode of the symbolic link task_struct, and then calls ptrace_may_access()the function to check the permissions. The value of the second parameter when calling this function is PTRACE_MODE_READ_FSCREDS.

ptrace_may_access()Functions are defined in /kernel/ptrace.cfiles and the actual work is delegated to __ptrace_may_access()functions:

/* Returns 0 on success, -errno on denial. */
static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
{
    
    
    const struct cred *cred = current_cred(), *tcred;
    struct mm_struct *mm;
    kuid_t caller_uid;
    kgid_t caller_gid;

    if (!(mode & PTRACE_MODE_FSCREDS) == !(mode & PTRACE_MODE_REALCREDS)) {
    
    
        WARN(1, "denying ptrace access check without PTRACE_MODE_*CREDS\n");
        return -EPERM;
    }

    /* May we inspect the given task?
     * This check is used both for attaching with ptrace
     * and for allowing access to sensitive information in /proc.
     *
     * ptrace_attach denies several cases that /proc allows
     * because setting up the necessary parent/child relationship
     * or halting the specified task is impossible.
     */

    /* Don't let security modules deny introspection */
    if (same_thread_group(task, current))
        return 0;
    rcu_read_lock();
    if (mode & PTRACE_MODE_FSCREDS) {
    
    
        caller_uid = cred->fsuid;
        caller_gid = cred->fsgid;
    } else {
    
    
        /*
         * Using the euid would make more sense here, but something
         * in userland might rely on the old behavior, and this
         * shouldn't be a security problem since
         * PTRACE_MODE_REALCREDS implies that the caller explicitly
         * used a syscall that requests access to another process
         * (and not a filesystem syscall to procfs).
         */
        caller_uid = cred->uid;
        caller_gid = cred->gid;
    }
    tcred = __task_cred(task);
    if (uid_eq(caller_uid, tcred->euid) &&
        uid_eq(caller_uid, tcred->suid) &&
        uid_eq(caller_uid, tcred->uid)  &&
        gid_eq(caller_gid, tcred->egid) &&
        gid_eq(caller_gid, tcred->sgid) &&
        gid_eq(caller_gid, tcred->gid))
        goto ok;
    if (ptrace_has_cap(tcred->user_ns, mode))
        goto ok;
    rcu_read_unlock();
    return -EPERM;
ok:
    rcu_read_unlock();
    /*
     * If a task drops privileges and becomes nondumpable (through a syscall
     * like setresuid()) while we are trying to access it, we must ensure
     * that the dumpability is read after the credentials; otherwise,
     * we may be able to attach to a task that we shouldn't be able to
     * attach to (as if the task had dropped privileges without becoming
     * nondumpable).
     * Pairs with a write barrier in commit_creds().
     */
    smp_rmb();
    mm = task->mm;
    if (mm &&
        ((get_dumpable(mm) != SUID_DUMP_USER) &&
         !ptrace_has_cap(mm->user_ns, mode)))
        return -EPERM;

    return security_ptrace_access_check(task, mode);
}

You can see that if the current process and the target process are in the same thread group, they have full permissions.

Next, because the value of mode contains PTRACE_MODE_FSCREDSthe flag bit, it is first checked whether the sum of the calling process fsuidis consistent with the sum fsgidof the target process . If they are inconsistent, the function is called to check whether the calling process has permissions in the user namespace of the target process . If not, access is denied.fsuidfsgidptrace_has_cap()CAP_SYS_PTRACE

CAP_SYS_PTRACENext, when the target process is set to nondumpable and the calling process does not have permissions in the target process's user namespace , access is denied.

Finally call security_ptrace_access_check()the function to perform the final check. This function is related to LSM. Here we only focus on commcapthe implementation of and not on other implementations such as Yama, AppArmor, etc.

For commcap, security_ptrace_access_check()the final call is the function /security/commcap.cin the file cap_ptrace_access_check():

/**
 * cap_ptrace_access_check - Determine whether the current process may access
 *               another
 * @child: The process to be accessed
 * @mode: The mode of attachment.
 *
 * If we are in the same or an ancestor user_ns and have all the target
 * task's capabilities, then ptrace access is allowed.
 * If we have the ptrace capability to the target user_ns, then ptrace
 * access is allowed.
 * Else denied.
 *
 * Determine whether a process may access another, returning 0 if permission
 * granted, -ve if denied.
 */
int cap_ptrace_access_check(struct task_struct *child, unsigned int mode)
{
    
    
    int ret = 0;
    const struct cred *cred, *child_cred;
    const kernel_cap_t *caller_caps;

    rcu_read_lock();
    cred = current_cred();
    child_cred = __task_cred(child);
    if (mode & PTRACE_MODE_FSCREDS)
        caller_caps = &cred->cap_effective;
    else
        caller_caps = &cred->cap_permitted;
    if (cred->user_ns == child_cred->user_ns &&
        cap_issubset(child_cred->cap_permitted, *caller_caps))
        goto out;
    if (ns_capable(child_cred->user_ns, CAP_SYS_PTRACE))
        goto out;
    ret = -EPERM;
out:
    rcu_read_unlock();
    return ret;
}

First, it is PTRACE_MODE_FSCREDSdecided to use the effective capability set (effective capability set) or the permitted capability set (permitted capability set) to perform permission check based on whether the mode flag is set. Then if the calling process and the target process belong to the same user namespace, and the permission capability set of the target process is a subset of the calling process' capability set, then the calling process passes the permission check. Otherwise, it then checks whether the calling process has capabilities in the user namespace where the target process is located CAP_SYS_PTRACE. If so, the permission check is passed, and if not, access is denied.

At this point we can answer the question raised at the beginning of the article, that is, under the default configuration, all container processes have the same capability set, so this container that shares the host PID namespace can access the rootfs of other container processes. The rootfs of a process running as a non-root user on the host cannot be accessed because the fsuid and fsgid of the calling process do not match the euid, suid, uid, egid, sgid, and gid of the target process respectively. The rootfs of a process running as the root user on the host machine cannot be accessed because the process running as the root user on the host machine has all the capabilities, and its capability set is not a subset of the capability set of the calling process.

The final summary is as follows:

  • Access is allowed if the calling process and the target process belong to the same process group.
  • If the access mode flag is specified PTRACE_MODE_FSCREDS, the filesystem UID (fsuid) and filesystem GID (fsgid) of the calling process will be used in subsequent filesystem permission checks. If the access mode flag is specified PTRACE_MODE_REALCREDS, the real UID (uid) and real GID (gid) of the calling process will be used in subsequent file system permission checks.
  • If any of the following conditions cannot be met, access is denied:
    • The fsuid and fsgid of the calling process match the euid, suid, uid, egid, sgid, and gid of the target process respectively.
    • The calling process has capabilities in the target process's user namespace CAP_SYS_PTRACE.
  • If the target process is set to nondumpable and the calling process has no capabilities in the target process's user namespace CAP_SYS_PTRACE, then access is denied.
  • The kernel's commcap LSM module denies access if any of the following conditions are not met:
    • The calling process and the target process belong to the same user namespace, and the capability set of the calling process is a superset of the permitted capability set of the target process.
    • The calling process has capabilities in the same user namespace as the target process CAP_SYS_PTRACE.

Use ideas

From the above research, we can summarize a new idea for container escape.

According to the above research, the reason why the container cannot access the rootfs of the process running as a non-root user on the host in this case is that the fsuid and fsgid of the container process are respectively the same as the euid, suid, uid and egid, sgid, and gid of the target process. Mismatch. So how can we make them match? In fact, it is very simple. After finding the process running as a non-root user on the host, we create a user in the container with the same UID and GID as the target process, and then use the su command to switch to the user and have permission to access the target. The process is /proc/[pid]/rootover. Of course, it should be noted that the target process must be dumpable.

In the following example, a Pod is first created that shares the host PID namespace, and then a command is run on the host as an ordinary user sleep 36000. Finally, the host file system can be accessed through this process in the Pod:

Insert image description here

At the same time, access rights can be expanded in the container through the concept of creating and joining auxiliary groups . The following example adds the current user to the docker group to access the host's Docker engine in the container and achieve container escape.

In the example, a container sharing the host PID namespace is first started through Docker, and then the uid and gid of the host's ordinary user are found through the ps command, and the corresponding user is created in the container. /proc/[pid]/rootThen access the host directory through any ordinary process /runand find that the socket file used to communicate with the Docker engine /run/docker.sockis allowed to be accessed by users with gid 969. Then create a group with gid 969 in the container, and add the previously created user to this group. Finally, install the Docker client to access the Docker engine of the host machine.

Insert image description here

Insert image description here

NOTE : When the container shares the host PID namespace, it also has the CAP_SYS_PTRACE capability, /proc/[pid]/rootmaking it easier to escape through symbolic links.

In the following example, a Pod with the shared host PID namespace and the CAP_SYS_PTRACE capability is added in the cluster.

Then in the Pod, /proc/1/rootyou can access the host file system through .

Insert image description here
Insert image description here

Defense and detection

From the perspective of security hardening, in order to avoid the above problems, the following needs to be done in the Kubernetes cluster:

  • Containers should be run as a non-root user.
  • It is prohibited to share the host PID namespace at will.
  • It is prohibited to grant arbitrary permissions to containers CAP_SYS_PTRACE.

/proc/<pid>/rootAs for the detection method, you can consider monitoring file access operations within the container with as the path prefix in the runtime real-time threat detection system .

References

  • https://man7.org/linux/man-pages/man2/ptrace.2.html
  • https://man7.org/linux/man-pages/man5/proc.5.html

What needs to be done in the rnetes cluster:

  • Containers should be run as a non-root user.
  • It is prohibited to share the host PID namespace at will.
  • It is prohibited to grant arbitrary permissions to containers CAP_SYS_PTRACE.

/proc/<pid>/rootAs for the detection method, you can consider monitoring file access operations within the container with as the path prefix in the runtime real-time threat detection system .

References

  • https://man7.org/linux/man-pages/man2/ptrace.2.html
  • https://man7.org/linux/man-pages/man5/proc.5.html

Original link: https://www.anquanke.com/post/id/290540

Guess you like

Origin blog.csdn.net/LSW1737554365/article/details/132824589