前段时间,我们讨论了Tetragon产品实时阻断能力的实现原理,那你知道它为什么没选择eBPF LSM吗?
系统内核版本要求是最大限制,eBPF LSM需要5.7以后版本。但对于安全产品,阻断一个函数的调用,远比杀死一个进程影响要小。bpf_send_signal
颗粒度是进程,eBPF LSM的颗粒度是函数,更精确。除此之外,控制范围也不一样,可以对函数调用堆栈做调整,达到替换执行的目标函数。业务场景就是内核漏洞的热更新了。
而本文就是一个简单的eBPF LSM实现思路,核心内容是确定精准HOOK点的思路。怎么找HOOK点?HOOK点挂载后,对性能影响是什么?如何做权衡?接下来,我们一起了解一下。
前言
Linux Security Modules(LSM)是一个钩子的基于框架,用于在Linux内核中实现安全策略和强制访问控制。 直到现在,能够实现实施安全策略目标的方式只有两种选择,配置现有的LSM模块(如AppArmor、SELinux),或编写自定义内核模块。
Linux Kernel 5.7引入了第三种方式:LSM扩展伯克利包过滤器(eBPF)(简称BPF LSM)。 LSM BPF允许开发人员编写自定义策略,而无需配置或加载内核模块。 LSM BPF程序在加载时被验证,然后在调用路径中,到达LSM钩子时被执行。
实践出真知
Namespaces 命名空间
现代操作系统提供了允许对内核资源进行partitioning
的工具。 例如FreeBSD有jails
,Solaris有zones
。 Linux是不同的—它提供了一组看似独立的工具,每个进程都允许隔离特定的资源。 他就是Namespaces
,经过多年来不停迭代,孕育了Docker
、lxc
、firejail
应用。 大部分Namespaces
是没有争议的,如UTS命名空间,它允许主机系统隐藏主机名和时间。其他的则比较复杂但简单明了————众所周知,NET和NS(mount)命名空间很难让人理解。最后,还有一个非常特殊、非常有趣的USER Namespaces
。
USER Namespaces
很特别,因为它允许所有者作为root
操作。其工作原理超出了本文的范围,但是,可以说它是让Docker
等工具不作为真正的root操作,或者说是rootless
容器。
由于其特性,允许未授权用户访问USER Namespaces
总是会带来很大的安全风险。其中最大的风险是提权
。
提权原理
提权
是操作系统的常见攻击面。 user获得权限的一种方法是通过unshare syscall将其命名空间映射到root
空间,并指定CLONE_NEWUSER
标志。 这会告诉unshare
创建一个具有完全权限的新用户命名空间,并将新用户和Group ID映射到以前的命名空间。 即使用unshare(1)程序将root映射到原始命名空间:
$ id</span> uid=1000(fred) gid=1000(fred) groups=1000(fred) … $ unshare -rU # id uid=0(root) gid=0(root) groups=0(root),65534(nogroup) # cat /proc/self/uid_map 0 1000 1
ps:需要注意执行unshare之后执行id前的命令提示符和前面的命令提示符是不一样的,即一个是#,一个是$。#说明unshare是交互式执行命令的。
多数情况下,使用unshare
是没有风险的,都是以较低的权限运行。 但是,已经被用于提权了,比如CVE-2022-0492,那么本文就重点以这个场景为例。
Syscalls clone
和clone3
也很值得研究,都有CLONE_NEWUSER
的功能。但在这篇文章中,我们将重点关注unshare
。
Debian用add sysctl to disallow unprivileged CLONE_NEWUSER by default补丁解决了这个问题,但它没有被合并到源码mainline主线中。另一个类似的补丁“sysctl: allow CLONE_NEWUSER to be disabled”尝试合并到mainline,但被拒绝了。理由是在某些特定应用中无法切换到该特性。 在Controlling access to user namespaces一文中,作者写道:
… 目前的补丁似乎没有一条通往mainline主线的捷径。
如你所示,补丁最终没有包含到vanilla内核中。
我们的解决方案LSM BPF
基于上面一些经验,可以看到限制USER Namespaces
的代码似乎行不通,我们决定使用LSM BPF
来规避这些问题。并且不需要修改内核,还可以自定义检测防御的规则。
寻找合适的候选钩子
首先,让我们跟踪我们的目标系统调用。 我们可以在include/linux/syscalls.h文件中找到原型。
/* kernel/fork.c */
很清晰的看到,在kernel/fork.c文件中,注释部分中留下了下一个位置的线索。 在ksys_unshare()那里调用。深入研究该函数,发现了一个对unshare_userns()的调用。这让我看到了希望。
现在,我们已经确定了syscall实现,但是接下来的问题是用哪些钩子?怎么选择合适的钩子?
从man-pages中了解到unshare用于改变task
,那么,我们重点关注include/linux/lsm_hooks.h中的关于task
的钩子。 在函数unshare_userns()中,可以找到对prepare_creds()的调用。对于cred_prepare的HOOK来说看上去不看。 为了验证对prepare_creds()的理解是否正确,接下来继续分析security_prepare_creds()的调用,可以确认,其最终会调用这个HOOK:
… rc = call_int_hook(cred_prepare, <span class="hljs-number">0</span>, <span class="hljs-keyword">new</span>, old, gfp); …
暂不过多讨论这个问题,现在能确认的是这个HOOK比较合适,因为prepare_creds()
正好在unshare_userns()
中的create_user_ns()
之前被调用,而unshare_userns()是我们试图阻止的操作。
LSM BPF解决方案
我们将使用eBPF编译一次到处运行(CO-RE)的方法对代码进行编译。 在不同版本内核的IDC中,会特别适用。(不过,国内外大部分五至十年的互联网公司,都有着大量低于5.0的内核版本)。本文的演示,将只对x86_64 CPU架构系统验证。ARM64的LSM BPF仍在开发中。你可以订阅BPF邮件列表来了解最新进展。
此解决方案在Kernel Version >=5.15
上进行了测试,配置如下:
BPF_EVENTS BPF_JIT BPF_JIT_ALWAYS_ON BPF_LSM BPF_SYSCALL BPF_UNPRIV_DEFAULT_OFF DEBUG_INFO_BTF DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT DYNAMIC_FTRACE FUNCTION_TRACER HAVE_DYNAMIC_FTRACE
如果CONFIG_LSM列表中不包含bpf
,则需要你自己重新编译,并开启lsm=bpf
选项.
内核空间代码
开始看内核空间代码:deny_unshare.bpf.c
:
<code class="language-c hljs cpp"><span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string"><linux/bpf.h></span></span> <span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string"><linux/capability.h></span></span> <span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string"><linux/errno.h></span></span> <span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string"><linux/sched.h></span></span> <span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string"><linux/types.h></span></span> <span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string"><bpf/bpf_tracing.h></span></span> <span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string"><bpf/bpf_helpers.h></span></span> <span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string"><bpf/bpf_core_read.h></span></span> <span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> X86_64_UNSHARE_SYSCALL 272</span> <span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> UNSHARE_SYSCALL X86_64_UNSHARE_SYSCALL</span></code>
CO-RE
接下来,我们以下列方式为CO-RE重新定位建立必要的结构:
deny_unshare.bpf.c
:
<code class="language-c hljs cpp">… <span class="hljs-keyword">typedef</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> <span class="hljs-keyword">gfp_t</span>; <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">pt_regs</span> {</span> <span class="hljs-keyword">long</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> di; <span class="hljs-keyword">long</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> orig_ax; } __attribute__((preserve_access_index)); <span class="hljs-keyword">typedef</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">kernel_cap_struct</span> {</span> __u32 cap[_LINUX_CAPABILITY_U32S_3]; } __attribute__((preserve_access_index)) <span class="hljs-keyword">kernel_cap_t</span>; <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">cred</span> {</span> <span class="hljs-keyword">kernel_cap_t</span> cap_effective; } __attribute__((preserve_access_index)); <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">task_struct</span> {</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> flags; <span class="hljs-keyword">const</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">cred</span> *<span class="hljs-title">cred</span>;</span> } __attribute__((preserve_access_index)); <span class="hljs-keyword">char</span> LICENSE[] SEC(<span class="hljs-string">"license"</span>) = <span class="hljs-string">"GPL"</span>; …</code>
用户空间
加载程序并将其附加到目标的钩子上是用户空间的功能。有几种方法可以做到这一点:
- Cilium ebpf项目
- Rust bindings
- ebpf.io项目
landscape
展示的其他类库
这里,我们将使用原生libbpf。
<code class="language-c hljs cpp"><span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string"><bpf/libbpf.h></span></span> <span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string"><unistd.h></span></span> <span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">"deny_unshare.skel.h"</span></span> <span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">int</span> <span class="hljs-title">libbpf_print_fn</span><span class="hljs-params">(<span class="hljs-keyword">enum</span> libbpf_print_level level, <span class="hljs-keyword">const</span> <span class="hljs-keyword">char</span> *format, va_list args)</span> </span>{ <span class="hljs-keyword">return</span> <span class="hljs-built_in">vfprintf</span>(<span class="hljs-built_in">stderr</span>, format, args); } <span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">main</span><span class="hljs-params">(<span class="hljs-keyword">int</span> argc, <span class="hljs-keyword">char</span> *argv[])</span> </span>{ <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">deny_unshare_bpf</span> *<span class="hljs-title">skel</span>;</span> <span class="hljs-keyword">int</span> err; libbpf_set_strict_mode(LIBBPF_STRICT_ALL); libbpf_set_print(libbpf_print_fn); <span class="hljs-comment">// Loads and verifies the BPF program</span> skel = deny_unshare_bpf__open_and_load(); <span class="hljs-keyword">if</span> (!skel) { <span class="hljs-built_in">fprintf</span>(<span class="hljs-built_in">stderr</span>, <span class="hljs-string">"failed to load and verify BPF skeleton\n"</span>); <span class="hljs-keyword">goto</span> cleanup; } <span class="hljs-comment">// Attaches the loaded BPF program to the LSM hook</span> err = deny_unshare_bpf__attach(skel); <span class="hljs-keyword">if</span> (err) { <span class="hljs-built_in">fprintf</span>(<span class="hljs-built_in">stderr</span>, <span class="hljs-string">"failed to attach BPF skeleton\n"</span>); <span class="hljs-keyword">goto</span> cleanup; } <span class="hljs-built_in">printf</span>(<span class="hljs-string">"LSM loaded! ctrl+c to exit.\n"</span>); <span class="hljs-comment">// The BPF link is not pinned, therefore exiting will remove program</span> <span class="hljs-keyword">for</span> (;;) { <span class="hljs-built_in">fprintf</span>(<span class="hljs-built_in">stderr</span>, <span class="hljs-string">"."</span>); sleep(<span class="hljs-number">1</span>); } cleanup: deny_unshare_bpf__destroy(skel); <span class="hljs-keyword">return</span> err; }</code>
Makefile
最后,进行编译,这里使用Makefile
<code class="language-c hljs cpp">CLANG ?= clang<span class="hljs-number">-13</span> LLVM_STRIP ?= llvm-strip<span class="hljs-number">-13</span> ARCH := x86 INCLUDES := -I/usr/include -I/usr/include/x86_64-linux-gnu LIBS_DIR := -L/usr/lib/lib64 -L/usr/lib/x86_64-linux-gnu LIBS := -lbpf -lelf .PHONY: all clean run all: deny_unshare.skel.h deny_unshare.bpf.o deny_unshare run: all sudo ./deny_unshare clean: rm -f *.o rm -f deny_unshare.skel.h # # BPF is kernel code. We need to pass -D__KERNEL__ to refer to fields present # in the kernel version of pt_regs struct. uAPI version of pt_regs (from ptrace) # has different field naming. # See: https:<span class="hljs-comment">//git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fd56e0058412fb542db0e9556f425747cf3f8366</span> # deny_unshare.bpf.o: deny_unshare.bpf.c $(CLANG) -g -O2 -Wall -target bpf -D__KERNEL__ -D__TARGET_ARCH_$(ARCH) $(INCLUDES) -c $< -o $@ $(LLVM_STRIP) -g $@ # Removes debug information deny_unshare.skel.h: deny_unshare.bpf.o sudo bpftool gen skeleton $< > $@ deny_unshare: deny_unshare.c deny_unshare.skel.h $(CC) -g -Wall -c $< -o $@.o $(CC) -g -o $@ $(LIBS_DIR) $@.o $(LIBS) .DELETE_ON_ERROR:</code>
结果
打开一个新终端,运行命令
<code class="language-shell hljs"><span class="hljs-meta">$</span><span class="bash"> make run</span> … LSM loaded! ctrl+c to exit.</code>
在另一个终端里,可以看到成功的被阻止了。
<code class="language-shell hljs"><span class="hljs-meta">$</span><span class="bash"> unshare -rU</span> unshare: unshare failed: Cannot allocate memory <span class="hljs-meta">$</span><span class="bash"> id</span> uid=1000(fred) gid=1000(fred) groups=1000(fred) …</code>
这个策略有个附加的特性,可以允许传递授权。
<code class="language-shell hljs"><span class="hljs-meta">$</span><span class="bash"> sudo unshare -rU</span> <span class="hljs-meta">#</span><span class="bash"> id</span> uid=0(root) gid=0(root) groups=0(root)</code>
在无特权场景中,系统调用会提前中止。 有特权情况下的性能影响是什么?
性能对比
我们将使用一行unshare命令来映射用户命名空间,并在中执行一个命令来进行测量:
<code class="language-shell hljs"><span class="bash">unshare -frU --<span class="hljs-built_in">kill</span>-child -- bash -c <span class="hljs-string">"exit 0"</span></span></code>
使用系统调用unshare enter/exit的CPU周期间隔,我们将以root用户身份测量以下内容:
- 命令在没有策略的情况下运行
- 与策略一起运行的命令
我们将使用ftrace记录测量结果:
<code class="language-shell hljs"><span class="hljs-meta">$</span><span class="bash"> sudo su</span> <span class="hljs-meta">#</span><span class="bash"> <span class="hljs-built_in">cd</span> /sys/kernel/debug/tracing</span> <span class="hljs-meta">#</span><span class="bash"> <span class="hljs-built_in">echo</span> 1 > events/syscalls/sys_enter_unshare/<span class="hljs-built_in">enable</span> ; <span class="hljs-built_in">echo</span> 1 > events/syscalls/sys_exit_unshare/<span class="hljs-built_in">enable</span></span></code>
此时,我们将专门为unshare启用对系统调用enter/exit
的跟踪。 现在,我们设置enter/exit
调用的time-resolution
来计算CPU周期:
<code class="language-shell hljs"><span class="hljs-meta">#</span><span class="bash"> <span class="hljs-built_in">echo</span> <span class="hljs-string">'x86-tsc'</span> > trace_clock </span></code>
接下来,我们开始评测
<code class="language-shell hljs"><span class="hljs-meta">#</span><span class="bash"> unshare -frU --<span class="hljs-built_in">kill</span>-child -- bash -c <span class="hljs-string">"exit 0"</span> &</span> [1] 92014</code>
在新终端里运行策略,执行下一个syscall
<code class="language-shell hljs"><span class="hljs-meta">#</span><span class="bash"> unshare -frU --<span class="hljs-built_in">kill</span>-child -- bash -c <span class="hljs-string">"exit 0"</span> &</span> [2] 92019</code>
现在,我们收集到两种CALLS结果进行对比
# cat trace # tracer: nop # # entries-in-buffer/entries-written: 4/4 #P:8 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | unshare-92014 [002] ..... 762950852559027: sys_unshare(unshare_flags: 10000000) unshare-92014 [002] ..... 762950852622321: sys_unshare -> 0x0 unshare-92019 [007] ..... 762975980681895: sys_unshare(unshare_flags: 10000000) unshare-92019 [007] ..... 762975980752033: sys_unshare -> 0x0
分别是:
- unshare-92014 used 63294 cycles.
- unshare-92019 used 70138 cycles.
可以看到二者之间有6,844(~10%)个周期的差异,还行。
两次测量之间有6,844(~10%)个周期损失。不错嘛!
这些数字是针对单个系统调用的,代码调用的频率越高,这些数字就越多。 Unshare通常在任务创建时调用,在程序的正常执行期间不会重复调用。 对于你的场景,需要仔细考虑评估。
结尾
我们了解了LSM BPF
是什么,如何使用unshare将user
映射到root
,以及如何通过在eBPF中实现程序来解决真实场景的问题。跟踪准确的钩子不是一件容易的事,需要有丰富的编码经验,以及丰富的内核知识。 这些策略代码是用C语言编写的,所以我们可以因地制宜,不同的问题做不同的策略,代码轻微调整,就可以快速扩展,增加其他钩子点等。最后,我们对比了这个LSM程序的性能影响,性能与安全的权衡,是你需要考虑的问题。
Cannot allocate memory
(无法分配内存)不是拒绝权限的最准确的描述。 我们提出了一个补丁,用于将错误代码从cred_prepare
挂钩传到调用堆栈。
最后,我们的结论就是eBPF LSM
钩子非常适合实时修复Linux内核漏洞,你要来试试吗?
转自:https://www.cnxct.com/linux-kernel-hotfix-with-ebpf-lsm/
Live-patching security vulnerabilities inside the Linux kernel with eBPF Linux Security Module
转自:https://blog.cloudflare.com/live-patch-security-vulnerabilities-with-ebpf-lsm/
Linux Security Modules (LSM) is a hook-based framework for implementing security policies and Mandatory Access Control in the Linux kernel. Until recently users looking to implement a security policy had just two options. Configure an existing LSM module such as AppArmor or SELinux, or write a custom kernel module.
Linux 5.7 introduced a third way: LSM extended Berkeley Packet Filters (eBPF) (LSM BPF for short). LSM BPF allows developers to write granular policies without configuration or loading a kernel module. LSM BPF programs are verified on load, and then executed when an LSM hook is reached in a call path.
Let’s solve a real-world problem
Modern operating systems provide facilities allowing “partitioning” of kernel resources. For example FreeBSD has “jails”, Solaris has “zones”. Linux is different – it provides a set of seemingly independent facilities each allowing isolation of a specific resource. These are called “namespaces” and have been growing in the kernel for years. They are the base of popular tools like Docker, lxc or firejail. Many of the namespaces are uncontroversial, like the UTS namespace which allows the host system to hide its hostname and time. Others are complex but straightforward – NET and NS (mount) namespaces are known to be hard to wrap your head around. Finally, there is this very special very curious USER namespace.
USER namespace is special, since it allows the owner to operate as “root” inside it. How it works is beyond the scope of this blog post, however, suffice to say it’s a foundation to having tools like Docker to not operate as true root, and things like rootless containers.
Due to its nature, allowing unpriviledged users access to USER namespace always carried a great security risk. One such risk is privilege escalation.
Privilege escalation is a common attack surface for operating systems. One way users may gain privilege is by mapping their namespace to the root namespace via the unshare syscall and specifying the CLONE_NEWUSER flag. This tells unshare to create a new user namespace with full permissions, and maps the new user and group ID to the previous namespace. You can use the unshare(1) program to map root to our original namespace:
<code class=" language-sh">$ id uid=1000(fred) gid=1000(fred) groups=1000(fred) … $ unshare -rU # id uid=0(root) gid=0(root) groups=0(root),65534(nogroup) # cat /proc/self/uid_map 0 1000 1 </code>
In most cases using unshare is harmless, and is intended to run with lower privileges. However, this syscall has been known to be used to escalate privileges.
Syscalls clone and clone3 are worth looking into as they also have the ability to CLONE_NEWUSER. However, for this post we’re going to focus on unshare.
Debian solved this problem with this “add sysctl to disallow unprivileged CLONE_NEWUSER by default” patch, but it was not mainlined. Another similar patch “sysctl: allow CLONE_NEWUSER to be disabled” attempted to mainline, and was met with push back. A critique is the inability to toggle this feature for specific applications. In the article “Controlling access to user namespaces” the author wrote: “… the current patches do not appear to have an easy path into the mainline.” And as we can see, the patches were ultimately not included in the vanilla kernel.
Our solution – LSM BPF
Since upstreaming code that restricts USER namespace seem to not be an option, we decided to use LSM BPF to circumvent these issues. This requires no modifications to the kernel and allows us to express complex rules guarding the access.
Track down an appropriate hook candidate
First, let us track down the syscall we’re targeting. We can find the prototype in the include/linux/syscalls.h file. From there, it’s not as obvious to track down, but the line:
<code class=" language-c"><span class="token comment">/* kernel/fork.c */</span> </code>
Gives us a clue of where to look next in kernel/fork.c. There a call to ksys_unshare() is made. Digging through that function, we find a call to unshare_userns(). This looks promising.
Up to this point, we’ve identified the syscall implementation, but the next question to ask is what hooks are available for us to use? Because we know from the man-pages that unshare is used to mutate tasks, we look at the task-based hooks in include/linux/lsm_hooks.h. Back in the function unshare_userns() we saw a call to prepare_creds(). This looks very familiar to the cred_prepare hook. To verify we have our match via prepare_creds(), we see a call to the security hook security_prepare_creds() which ultimately calls the hook:
<code class=" language-c">… rc <span class="token operator">=</span> <span class="token function">call_int_hook</span><span class="token punctuation">(</span>cred_prepare<span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">,</span> new<span class="token punctuation">,</span> old<span class="token punctuation">,</span> gfp<span class="token punctuation">)</span><span class="token punctuation">;</span> … </code>
Without going much further down this rabbithole, we know this is a good hook to use because prepare_creds() is called right before create_user_ns() in unshare_userns() which is the operation we’re trying to block.
LSM BPF solution
We’re going to compile with the eBPF compile once-run everywhere (CO-RE) approach. This allows us to compile on one architecture and load on another. But we’re going to target x86_64 specifically. LSM BPF for ARM64 is still in development, and the following code will not run on that architecture. Watch the BPF mailing list to follow the progress.
This solution was tested on kernel versions >= 5.15 configured with the following:
<code>BPF_EVENTS BPF_JIT BPF_JIT_ALWAYS_ON BPF_LSM BPF_SYSCALL BPF_UNPRIV_DEFAULT_OFF DEBUG_INFO_BTF DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT DYNAMIC_FTRACE FUNCTION_TRACER HAVE_DYNAMIC_FTRACE </code>
A boot option lsm=bpf
may be necessary if CONFIG_LSM
does not contain “bpf” in the list.
Let’s start with our preamble:
deny_unshare.bpf.c:
<code class=" language-c"><span class="token macro property">#<span class="token directive keyword">include</span> <span class="token string"><linux/bpf.h></span></span> <span class="token macro property">#<span class="token directive keyword">include</span> <span class="token string"><linux/capability.h></span></span> <span class="token macro property">#<span class="token directive keyword">include</span> <span class="token string"><linux/errno.h></span></span> <span class="token macro property">#<span class="token directive keyword">include</span> <span class="token string"><linux/sched.h></span></span> <span class="token macro property">#<span class="token directive keyword">include</span> <span class="token string"><linux/types.h></span></span> <span class="token macro property">#<span class="token directive keyword">include</span> <span class="token string"><bpf/bpf_tracing.h></span></span> <span class="token macro property">#<span class="token directive keyword">include</span> <span class="token string"><bpf/bpf_helpers.h></span></span> <span class="token macro property">#<span class="token directive keyword">include</span> <span class="token string"><bpf/bpf_core_read.h></span></span> <span class="token macro property">#<span class="token directive keyword">define</span> X86_64_UNSHARE_SYSCALL 272</span> <span class="token macro property">#<span class="token directive keyword">define</span> UNSHARE_SYSCALL X86_64_UNSHARE_SYSCALL</span> </code>
Next we set up our necessary structures for CO-RE relocation in the following way:
deny_unshare.bpf.c:
<code class=" language-c">… <span class="token keyword">typedef</span> <span class="token keyword">unsigned</span> <span class="token keyword">int</span> gfp_t<span class="token punctuation">;</span> <span class="token keyword">struct</span> pt_regs <span class="token punctuation">{</span> <span class="token keyword">long</span> <span class="token keyword">unsigned</span> <span class="token keyword">int</span> di<span class="token punctuation">;</span> <span class="token keyword">long</span> <span class="token keyword">unsigned</span> <span class="token keyword">int</span> orig_ax<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token function">__attribute__</span><span class="token punctuation">(</span><span class="token punctuation">(</span>preserve_access_index<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">typedef</span> <span class="token keyword">struct</span> kernel_cap_struct <span class="token punctuation">{</span> __u32 cap<span class="token punctuation">[</span>_LINUX_CAPABILITY_U32S_3<span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token function">__attribute__</span><span class="token punctuation">(</span><span class="token punctuation">(</span>preserve_access_index<span class="token punctuation">)</span><span class="token punctuation">)</span> kernel_cap_t<span class="token punctuation">;</span> <span class="token keyword">struct</span> cred <span class="token punctuation">{</span> kernel_cap_t cap_effective<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token function">__attribute__</span><span class="token punctuation">(</span><span class="token punctuation">(</span>preserve_access_index<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">struct</span> task_struct <span class="token punctuation">{</span> <span class="token keyword">unsigned</span> <span class="token keyword">int</span> flags<span class="token punctuation">;</span> <span class="token keyword">const</span> <span class="token keyword">struct</span> cred <span class="token operator">*</span>cred<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token function">__attribute__</span><span class="token punctuation">(</span><span class="token punctuation">(</span>preserve_access_index<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">char</span> LICENSE<span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token function">SEC</span><span class="token punctuation">(</span><span class="token string">"license"</span><span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token string">"GPL"</span><span class="token punctuation">;</span> … </code>
We don’t need to fully-flesh out the structs; we just need the absolute minimum information a program needs to function. CO-RE will do whatever is necessary to perform the relocations for your kernel. This makes writing the LSM BPF programs easy!
deny_unshare.bpf.c:
<code class=" language-c"><span class="token function">SEC</span><span class="token punctuation">(</span><span class="token string">"lsm/cred_prepare"</span><span class="token punctuation">)</span> <span class="token keyword">int</span> <span class="token function">BPF_PROG</span><span class="token punctuation">(</span>handle_cred_prepare<span class="token punctuation">,</span> <span class="token keyword">struct</span> cred <span class="token operator">*</span>new<span class="token punctuation">,</span> <span class="token keyword">const</span> <span class="token keyword">struct</span> cred <span class="token operator">*</span>old<span class="token punctuation">,</span> gfp_t gfp<span class="token punctuation">,</span> <span class="token keyword">int</span> ret<span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">struct</span> pt_regs <span class="token operator">*</span>regs<span class="token punctuation">;</span> <span class="token keyword">struct</span> task_struct <span class="token operator">*</span>task<span class="token punctuation">;</span> kernel_cap_t caps<span class="token punctuation">;</span> <span class="token keyword">int</span> syscall<span class="token punctuation">;</span> <span class="token keyword">unsigned</span> <span class="token keyword">long</span> flags<span class="token punctuation">;</span> <span class="token comment">// If previous hooks already denied, go ahead and deny this one</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>ret<span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> ret<span class="token punctuation">;</span> <span class="token punctuation">}</span> task <span class="token operator">=</span> <span class="token function">bpf_get_current_task_btf</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> regs <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token keyword">struct</span> pt_regs <span class="token operator">*</span><span class="token punctuation">)</span> <span class="token function">bpf_task_pt_regs</span><span class="token punctuation">(</span>task<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// In x86_64 orig_ax has the syscall interrupt stored here</span> syscall <span class="token operator">=</span> regs<span class="token operator">-></span>orig_ax<span class="token punctuation">;</span> caps <span class="token operator">=</span> task<span class="token operator">-></span>cred<span class="token operator">-></span>cap_effective<span class="token punctuation">;</span> <span class="token comment">// Only process UNSHARE syscall, ignore all others</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>syscall <span class="token operator">!=</span> UNSHARE_SYSCALL<span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// PT_REGS_PARM1_CORE pulls the first parameter passed into the unshare syscall</span> flags <span class="token operator">=</span> <span class="token function">PT_REGS_PARM1_CORE</span><span class="token punctuation">(</span>regs<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// Ignore any unshare that does not have CLONE_NEWUSER</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span><span class="token punctuation">(</span>flags <span class="token operator">&</span> CLONE_NEWUSER<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// Allow tasks with CAP_SYS_ADMIN to unshare (already root)</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>caps<span class="token punctuation">.</span>cap<span class="token punctuation">[</span><span class="token function">CAP_TO_INDEX</span><span class="token punctuation">(</span>CAP_SYS_ADMIN<span class="token punctuation">)</span><span class="token punctuation">]</span> <span class="token operator">&</span> <span class="token function">CAP_TO_MASK</span><span class="token punctuation">(</span>CAP_SYS_ADMIN<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">return</span> <span class="token operator">-</span>EPERM<span class="token punctuation">;</span> <span class="token punctuation">}</span> </code>
Creating the program is the first step, the second is loading and attaching the program to our desired hook. There are several ways to do this: Cilium ebpf project, Rust bindings, and several others on the ebpf.io project landscape page. We’re going to use native libbpf.
deny_unshare.c:
<code class=" language-c"><span class="token macro property">#<span class="token directive keyword">include</span> <span class="token string"><bpf/libbpf.h></span></span> <span class="token macro property">#<span class="token directive keyword">include</span> <span class="token string"><unistd.h></span></span> <span class="token macro property">#<span class="token directive keyword">include</span> <span class="token string">"deny_unshare.skel.h"</span></span> <span class="token keyword">static</span> <span class="token keyword">int</span> <span class="token function">libbpf_print_fn</span><span class="token punctuation">(</span><span class="token keyword">enum</span> libbpf_print_level level<span class="token punctuation">,</span> <span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span>format<span class="token punctuation">,</span> va_list args<span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">return</span> <span class="token function">vfprintf</span><span class="token punctuation">(</span><span class="token constant">stderr</span><span class="token punctuation">,</span> format<span class="token punctuation">,</span> args<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token keyword">int</span> argc<span class="token punctuation">,</span> <span class="token keyword">char</span> <span class="token operator">*</span>argv<span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token keyword">struct</span> deny_unshare_bpf <span class="token operator">*</span>skel<span class="token punctuation">;</span> <span class="token keyword">int</span> err<span class="token punctuation">;</span> <span class="token function">libbpf_set_strict_mode</span><span class="token punctuation">(</span>LIBBPF_STRICT_ALL<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">libbpf_set_print</span><span class="token punctuation">(</span>libbpf_print_fn<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// Loads and verifies the BPF program</span> skel <span class="token operator">=</span> <span class="token function">deny_unshare_bpf__open_and_load</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token operator">!</span>skel<span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token function">fprintf</span><span class="token punctuation">(</span><span class="token constant">stderr</span><span class="token punctuation">,</span> <span class="token string">"failed to load and verify BPF skeleton\n"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">goto</span> cleanup<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token comment">// Attaches the loaded BPF program to the LSM hook</span> err <span class="token operator">=</span> <span class="token function">deny_unshare_bpf__attach</span><span class="token punctuation">(</span>skel<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">if</span> <span class="token punctuation">(</span>err<span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token function">fprintf</span><span class="token punctuation">(</span><span class="token constant">stderr</span><span class="token punctuation">,</span> <span class="token string">"failed to attach BPF skeleton\n"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">goto</span> cleanup<span class="token punctuation">;</span> <span class="token punctuation">}</span> <span class="token function">printf</span><span class="token punctuation">(</span><span class="token string">"LSM loaded! ctrl+c to exit.\n"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// The BPF link is not pinned, therefore exiting will remove program</span> <span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token punctuation">;</span><span class="token punctuation">;</span><span class="token punctuation">)</span> <span class="token punctuation">{</span> <span class="token function">fprintf</span><span class="token punctuation">(</span><span class="token constant">stderr</span><span class="token punctuation">,</span> <span class="token string">"."</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">sleep</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token punctuation">}</span> cleanup<span class="token punctuation">:</span> <span class="token function">deny_unshare_bpf__destroy</span><span class="token punctuation">(</span>skel<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">return</span> err<span class="token punctuation">;</span> <span class="token punctuation">}</span> </code>
Lastly, to compile, we use the following Makefile:
Makefile:
<code class=" language-makefile">CLANG <span class="token operator">?=</span> clang-13 LLVM_STRIP <span class="token operator">?=</span> llvm-strip-13 ARCH <span class="token operator">:=</span> x86 INCLUDES <span class="token operator">:=</span> -I/usr/<span class="token keyword">include</span> -I/usr/<span class="token keyword">include</span>/x86_64-linux-gnu LIBS_DIR <span class="token operator">:=</span> -L/usr/lib/lib64 -L/usr/lib/x86_64-linux-gnu LIBS <span class="token operator">:=</span> -lbpf -lelf <span class="token builtin">.PHONY</span><span class="token punctuation">:</span> all clean run <span class="token symbol">all</span><span class="token punctuation">:</span> deny_unshare.skel.h deny_unshare.bpf.o deny_unshare <span class="token symbol">run</span><span class="token punctuation">:</span> all sudo ./deny_unshare <span class="token symbol">clean</span><span class="token punctuation">:</span> rm -f *.o rm -f deny_unshare.skel.h <span class="token comment">#</span> <span class="token comment"># BPF is kernel code. We need to pass -D__KERNEL__ to refer to fields present</span> <span class="token comment"># in the kernel version of pt_regs struct. uAPI version of pt_regs (from ptrace)</span> <span class="token comment"># has different field naming.</span> <span class="token comment"># See: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fd56e0058412fb542db0e9556f425747cf3f8366</span> <span class="token comment">#</span> <span class="token symbol">deny_unshare.bpf.o</span><span class="token punctuation">:</span> deny_unshare.bpf.c <span class="token variable">$</span><span class="token punctuation">(</span>CLANG<span class="token punctuation">)</span> -g -O2 -Wall -target bpf -D__KERNEL__ -D__TARGET_ARCH_<span class="token variable">$</span><span class="token punctuation">(</span>ARCH<span class="token punctuation">)</span> <span class="token variable">$</span><span class="token punctuation">(</span>INCLUDES<span class="token punctuation">)</span> -c <span class="token variable">$<</span> -o <span class="token variable">$@</span> <span class="token variable">$</span><span class="token punctuation">(</span>LLVM_STRIP<span class="token punctuation">)</span> -g <span class="token variable">$@</span> <span class="token comment"># Removes debug information</span> <span class="token symbol">deny_unshare.skel.h</span><span class="token punctuation">:</span> deny_unshare.bpf.o sudo bpftool gen skeleton <span class="token variable">$<</span> > <span class="token variable">$@</span> <span class="token symbol">deny_unshare</span><span class="token punctuation">:</span> deny_unshare.c deny_unshare.skel.h <span class="token variable">$</span><span class="token punctuation">(</span>CC<span class="token punctuation">)</span> -g -Wall -c <span class="token variable">$<</span> -o <span class="token variable">$@.o</span> <span class="token variable">$</span><span class="token punctuation">(</span>CC<span class="token punctuation">)</span> -g -o <span class="token variable">$@</span> <span class="token variable">$</span><span class="token punctuation">(</span>LIBS_DIR<span class="token punctuation">)</span> <span class="token variable">$@.o</span> <span class="token variable">$</span><span class="token punctuation">(</span>LIBS<span class="token punctuation">)</span> <span class="token builtin">.DELETE_ON_ERROR</span><span class="token punctuation">:</span> </code>
Result
In a new terminal window run:
<code class=" language-sh">$ make run … LSM loaded! ctrl+c to exit. </code>
In another terminal window, we’re successfully blocked!
<code class=" language-sh">$ unshare -rU unshare: unshare failed: Cannot allocate memory $ id uid=1000(fred) gid=1000(fred) groups=1000(fred) … </code>
The policy has an additional feature to always allow privilege pass through:
<code class=" language-sh">$ sudo unshare -rU # id uid=0(root) gid=0(root) groups=0(root) </code>
In the unprivileged case the syscall early aborts. What is the performance impact in the privileged case?
Measure performance
We’re going to use a one-line unshare that’ll map the user namespace, and execute a command within for the measurements:
<code class=" language-sh">$ unshare -frU --kill-child -- bash -c "exit 0" </code>
With a resolution of CPU cycles for syscall unshare enter/exit, we’ll measure the following as root user:
- Command ran without the policy
- Command run with the policy
We’ll record the measurements with ftrace:
<code class=" language-sh">$ sudo su # cd /sys/kernel/debug/tracing # echo 1 > events/syscalls/sys_enter_unshare/enable ; echo 1 > events/syscalls/sys_exit_unshare/enable </code>
At this point, we’re enabling tracing for the syscall enter and exit for unshare specifically. Now we set the time-resolution of our enter/exit calls to count CPU cycles:
<code class=" language-sh"># echo 'x86-tsc' > trace_clock </code>
Next we begin our measurements:
<code class=" language-sh"># unshare -frU --kill-child -- bash -c "exit 0" & [1] 92014 </code>
Run the policy in a new terminal window, and then run our next syscall:
<code class=" language-sh"># unshare -frU --kill-child -- bash -c "exit 0" & [2] 92019 </code>
Now we have our two calls for comparison:
<code class=" language-sh"># cat trace # tracer: nop # # entries-in-buffer/entries-written: 4/4 #P:8 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | unshare-92014 [002] ..... 762950852559027: sys_unshare(unshare_flags: 10000000) unshare-92014 [002] ..... 762950852622321: sys_unshare -> 0x0 unshare-92019 [007] ..... 762975980681895: sys_unshare(unshare_flags: 10000000) unshare-92019 [007] ..... 762975980752033: sys_unshare -> 0x0 </code>
unshare-92014 used 63294 cycles.
unshare-92019 used 70138 cycles.
We have a 6,844 (~10%) cycle penalty between the two measurements. Not bad!
These numbers are for a single syscall, and add up the more frequently the code is called. Unshare is typically called at task creation, and not repeatedly during normal execution of a program. Careful consideration and measurement is needed for your use case.
Outro
We learned a bit about what LSM BPF is, how unshare is used to map a user to root, and how to solve a real-world problem by implementing a solution in eBPF. Tracking down the appropriate hook is not an easy task, and requires a bit of playing and a lot of kernel code. Fortunately, that’s the hard part. Because a policy is written in C, we can granularly tweak the policy to our problem. This means one may extend this policy with an allow-list to allow certain programs or users to continue to use an unprivileged unshare. Finally, we looked at the performance impact of this program, and saw the overhead is worth blocking the attack vector.
“Cannot allocate memory” is not a clear error message for denying permissions. We proposed a patch to propagate error codes from the cred_prepare hook up the call stack. Ultimately we came to the conclusion that a new hook is better suited to this problem. Stay tuned!
We protect entire corporate networks, help customers build Internet-scale applications efficiently, accelerate any website or Internet application, ward off DDoS attacks, keep hackers at bay, and can help you on your journey to Zero Trust.
Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.
To learn more about our mission to help build a better Internet, start here. If you’re looking for a new career direction, check out our open positions.
转载请注明:jinglingshu的博客 » 使用 eBPF Linux 安全模块实时修补 Linux 内核中的安全漏洞