Kconfig | A | A+ | F | U | U+ |
---|---|---|---|---|---|
VMAP_STACK | Y | Y | Y | Y | Y |
THREAD_INFO_IN_TASK | Y | Y | N | Y | Y |
STACKPROTECTOR_STRONG | Y | Y | Y | Y | N |
A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0
Mitigating Stack Overflows
When a stack is overflowed, an attacker has a chance to overwrite a
security-critical data structure located in another task's stack. For
example, thread_info
traditionally locates at the bottom of the
stack, which one long-size overflow can overwrite to. By overwriting
the thread_info
, an attacker can escalate its privilege by
overwriting the cred
structure.
Preventing stack clash
Using virtually-mapped (vmap) stacks has two benefits: 1) flexible in
allocating larger, non-continuous pages, 2) preventing a potential
stack overflow by placing one guard page between vmapped regions. Such
a guard page can mitigate a stack clash that is overwritten by another task,
if consequently located. As virtually mapped, the guard page
won't be wasting any real page unlike kmalloc()
-based stack.
// @kernel/fork.c
alloc_thread_stack_node()
-> __vmalloc_node_range()
-> __get_vm_area_node()
// by default, vmap_area includes one more page as a guard
if (!(flags & VM_NO_GUARD))
size += PAGE_SIZE;
The way the kernel handles a stack overflow is pretty interesting. When the kernel mode touches a guard page, it generates a page fault, so its handler specifically checks such a condition like below.
// @arch/x86/mm/fault.c
// do_page_fault()
// -> __do_page_fault()
// -> do_kern_addr_fault()
// -> bad_area_nosemaphore()
// -> __bad_area_nosemaphore()
static noinline void
no_context(struct pt_regs *regs, unsigned long error_code,
unsigned long address, int signal, int si_code)
{
...
if (is_vmalloc_addr((void *)address) &&
(((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
// NB. as the stack is likely out-of-space, use the stack for double-fault
unsigned long stack = __this_cpu_ist_top_va(DF) - sizeof(void *);
// NB. invoke handle_stack_overflow() to inform an oops.
asm volatile ("movq %[stack], %%rsp\n\t"
"call handle_stack_overflow\n\t"
"1: jmp 1b"
: ASM_CALL_CONSTRAINT
: "D" ("kernel stack overflow (page fault)"),
"S" (regs), "d" (address),
[stack] "rm" (stack));
unreachable();
}
...
}
It's likely that the page fault handler touches the guard page, again as we are running out of the stack space, which generates a double-fault.
// @arch/x86/kernel/traps.c
void do_double_fault(struct pt_regs *regs, long error_code)
{
...
cr2 = read_cr2();
if ((unsigned long)task_stack_page(tsk) - 1 - cr2 < PAGE_SIZE)
handle_stack_overflow("kernel stack overflow (double-fault)", regs, cr2);
...
}
One difference is that the page fault handler checks one page
before/after the stack, but the double-fault handler checks only the
overflow (when growing downward). This likely misdiagnoses the
condition for STACK_GROWSUP
yet rarely used in practice.
Related CVEs. Numerous CVEs (e.g., CVE-2016-10153, CVE-2017-8068,
CVE-2017-8070, etc) relevant to VMAP_STACK
are recently assigned due
to its implication of DoS or potential memory corruption (unlikely
controllable). The basic idea is that during iterating the
scatter-gather list by a DMA engine, the stack-allocated, vmapped
buffer is unlikely physically contiguous across the page
boundary, potentially overwriting irrelevant page. It's unlikely that
the buffer is large enough to cross the page boundary, otherwise
a developer allocated DMA-able region at the first place. One caveat
however was that under a certain condition
(e.g., DEBUG_VIRTUAL
), __phys_addr()
can trigger BUG()
when the provided address is vmalloc()
-region, resulting in a DoS.
Performance implication
VMAP=n | VMAP=y, #C=0 | #C=2 | #C=3 | #C=5 | |
---|---|---|---|---|---|
kbuild(seconds) | 27.648798683 | - | 27.545090630 | - | - |
iterations, cpu 0 | 106343 | 99673 | 101130 | 100568 | 100025 |
iterations, cpu 2 | 118526 | 93372 | 119380 | 117901 | 116726 |
iterations, cpu 7 | 117700 | 94010 | 117651 | 117719 | 115385 |
The table above shows thread performance results measured using microbenchmarks. The higher the number of iterations, It means the faster the performance. And #C means number of cache entries.
Allocating a stack from the vmalloc area, makes creating a process
with clone()
take about 1.5µs longer.[1] So for fixing this problem,
caching two thread stacks per cpu was introduced.[4]
Thread performance is slower when using virtual mapped stacks.
and the performance is affected by number of cache entries.
Currently, the number of cache entries is two, and if it is increased than two,
the performance is slower a bit. And if CONFIG_VMAP_STACK
set when kernel build,
it is about 0.1 seconds faster then without CONFIG_VMAP_STACK
.
So It's better using CONFIG_VMAP_STACK
and two cache entries can complement
the performance.
References
- LWN: Virtually mapped kernel stacks
- CVE-2016-1583: Exploiting Recursion in the Linux Kernel
- Mailing: Can someone explain all the CONFIG_VMAP_STACK CVEs lately?
- fork: Cache two thread stacks per cpu if CONFIG_VMAP_STACK is set
Protecting thread_info
Hijacking thread_info
or task_struct
is a straight way to achieve
a privilege escalation: overwriting its uid
to the root's, zero. As
they are used to locate at the bottom of the stack (e.g.,
task_struct
in <2.6 or thread_info
in later versions), bugs such
as stack clash, stack overflow, or arbitrary write after a stack
pointer leak, can launch an exploit against them.
An easy mitigation is to completely remove them from the stack:
THREAD_INFO_IN_TASK, as its name implicates, embeds thread_info
into
task_struct
. Since the current
task can be accessed with per-cpu
data structure, thread_info
can be accessed with one additional
memory access. Note that thread_info
is supposed to contain the
architecture-specific information and task_struct
does for
architecture-neutral data. The current effort in x86 virtually
migrates all information to the task_struct
.
// @include/linux/sched.h
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
* For reasons of header soup (see current_thread_info()), this
* must be the first element of task_struct.
*/
struct thread_info thread_info;
#endif
...
}
The bottom of the stack contains thread_info
if not
THREAD_INFO_IN_TASK, which is protected by a magic value,
STACK_END_MAGIC
that shouldn't be considered as security enhancement
or mechanism. end_of_stack()
simply returns the usable stack region
and handles both situation seamlessly.
// @include/linux/sched/task_stack.h
void set_task_stack_end_magic(struct task_struct *tsk)
{
unsigned long *stackend;
stackend = end_of_stack(tsk);
// NB. indicating that current stack is overwritten by an overflow
*stackend = STACK_END_MAGIC;
}
#ifdef CONFIG_THREAD_INFO_IN_TASK
unsigned long *end_of_stack(const struct task_struct *task)
{
return task->stack;
}
// NB. thread_info will be copied as part of task_struct
#define setup_thread_stack(new,old) do { } while(0)
#else
unsigned long *end_of_stack(struct task_struct *p)
{
return (unsigned long *)(task_thread_info(p) + 1);
}
void setup_thread_stack(struct task_struct *p, struct task_struct *org)
{
// NB. copied to the stack end (top)
*task_thread_info(p) = *task_thread_info(org);
task_thread_info(p)->task = p;
}
#endif
Performance implication
- don't expect much
Stack canary (SSP)
FIX. config in 4.15 is named differently
Performance implication
Option Size (KB) Protected functions
None 53,519 KB 0/48,606
STACKPROTECTOR 53,490 KB (-0.05%) 1,390/48,607 (+2.86%)
STACKPROTECTOR_STRONG 55,312 KB (+3.35%) 9,922/48,608 (+20.41%)
STACKPROTECTOR_STRONG
inserts a canary to 20% of functions in the
kernel, unlike STACKPROTECTOR
protects 3%, resulting in about 3%
increment of the binary size. For example, bstat()
is newly
protected with STACKPROTECTOR_STRONG
as it has struct kstat
as a
local variable.
What's interesting is the binary size of STACKPROTECTOR
compared
with the unprotected binary: inserting canary indeed reduces the
binary size. According to our analysis, checking canary at the
epilogue tends to encourage the reuse of common gadgets (e.g., pop
or
ret
) at exit paths, rendering better utilization of instructions.
References
Kconfig | A | A+ | F | U | U+ |
---|---|---|---|---|---|
DEBUG_LIST | N | Y | Y | N | N |
SLAB_FREELIST_HARDENED | Y | Y | Y | Y | Y |
SLAB_FREELIST_RANDOM | Y | Y | Y | Y | Y |
A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0
Mitigating Heap Overflows
Protecting the integrity of heap metadata
Manipulating a list entry, if crafted, introduces a traditionally, well-known attack vector, as known as usafe unlink. Simply put, an attacker can launch an semi-arbitrary memory write by overwriting the list entry (e.g., via heap overflow), illustrated like below:
# TODO. draw a real figure
prev cur next
+------+ +------+ +------+
| |------>| PPPP-|------>| |
| |<------|-VVVV |<------| |
+------+ +------+ +------+
*(PPPP+8) = VVVV
*VVVV = PPPP (restricted)
so to do *dst = val, an attacker can overwrite cur's list by [dst-8, val].
Before deleting a list entry, it firsts performs an integrity check,
__list_del_entry_valid()
, and after the deletion, it poisons
the list entries to assist better debugging---it tends to prevent
data pointer leaks when there is a dangling pointer to the freed object.
// @include/linux/list.h
void list_del(struct list_head *entry)
{
__list_del_entry(entry);
entry->next = LIST_POISON1; // 0xdead000000000100
entry->prev = LIST_POISON2; // 0xdead000000000200
}
void __list_del_entry(struct list_head *entry)
{
if (!__list_del_entry_valid(entry))
return;
__list_del(entry->prev, entry->next);
}
void __list_del(struct list_head * prev, struct list_head * next)
{
next->prev = prev;
prev->next = next;
}
Two conditions are checked:
- whether attempting to perform a deletion on a freed list entry (e.g., double delete);
- the indicated previous entry points to itself, vise versa for the
indicated next entry.
For 1),
POISON1/2
inserted during the deletion process help to recognize the invariant. These checks are similarly performed for the addition.
bool __list_del_entry_valid(struct list_head *entry)
{
struct list_head *prev, *next;
prev = entry->prev;
next = entry->next;
// NB. first check if we are attempting to delete
// previous deleted entry
if (CHECK_DATA_CORRUPTION(next == LIST_POISON1,
"list_del corruption, %px->next is LIST_POISON1 (%px)\n",
entry, LIST_POISON1) ||
CHECK_DATA_CORRUPTION(prev == LIST_POISON2,
"list_del corruption, %px->prev is LIST_POISON2 (%px)\n",
entry, LIST_POISON2) ||
// NB. check the integrity of the link chains; prev's next and
// next's prev correctly point to me
CHECK_DATA_CORRUPTION(prev->next != entry,
"list_del corruption. prev->next should be %px, but was %px\n",
entry, prev->next) ||
CHECK_DATA_CORRUPTION(next->prev != entry,
"list_del corruption. next->prev should be %px, but was %px\n",
entry, next->prev))
return false;
return true;
}
SLAB_FREELIST_RANDOM.
The determinism (i.e., the deterministic order in allocated chunks)
helps (a bit) an attacker
in controlling the overflowing target.
The simple way to disturb the determinism is to randomize
its allocation order;
it can be done by randomizing the free chunks
when the kmem_cache structure is initialized.
The Fisher-Yates algorithm implemented
in freelist_randomize()
can guarantee that each slot has
the equal likelihood for being randomized.
// @mm/slab_common.c
// init_freelist_randomization()
// -> init_cache_random_seq()
// -> cache_random_seq_create()
void freelist_randomize(struct rnd_state *state, unsigned int *list,
unsigned int count)
{
unsigned int rand;
unsigned int i;
for (i = 0; i < count; i++)
list[i] = i;
/* Fisher-Yates shuffle */
for (i = count - 1; i > 0; i--) {
rand = prandom_u32_state(state);
rand %= (i + 1);
swap(list[i], list[rand]);
}
}
CONFIG_SLAB_FREELIST_HARDENED. When a heap object is overflowed, there exist two classes of overflowing targets (i.e., the nearby object located right after), namely,
- a free object, and 2) an allocated object,
with the same type.
In terms of exploits,
one approach is to abuse some specific semantics
of the target objects (e.g., crafting a function pointer in the
struct
), but another approach is to develop the overflow into more preferable primitives (e.g., arbitrary write) for exploitation. In case of the free object (the second case), there exists a generic approach, meaning that the metatdata of heap structures is abused for further exploitation. For example, the link structure, calledfreelist
, that chains all free objects in the cache, can be overwritten in a way that can be crafted for creating dangling pointers (e.g., returning an arbitrary object pointer whenkmalloc()
is invoked).
// TODO. redraw a real figure
---> freelist that link all free chunks
head ---+
V
+------+ +------+ +------+
| |<------|- | | |
+------+ ptr +------+ +------+
| (ptr_addr) ^
+-----------------------------+
SLAB_FREELIST_HARDENED
is proposed to
prevent this direct modification
of the freelist
structure.
The basic approach is to mangle
(xor)
the pointer with a random canary value
(s->random
)
created at the initialization of the cache.
One interesting decision is to add ptr_addr
to the mangled pointer.
Its implication is subtle, but worth mentioning here.
If s->random
is leaked via another channel,
an attacker can place
an arbitrary value (i.e., the value xor-ed with the canary),
allowing the aforementioned exploitation techniques possible again.
The proposed solution is
to mangle the value once more
with another secrete value,
the randomized address of the chunk itself (ptr_addr).
To bypass this protection,
the attacker should be able to locate
the overflowing chunk precisely.
However, one potential concern
would be that
an attacker can reuse its value in a simple arithmetic:
e.g., adding the size of the heap object, say 0x100,
to the leaked data
would likely lead to a controllable situation,
like two freed objects or one allocated object
now in the freelist
.
/*
* Returns freelist pointer (ptr). With hardening, this is obfuscated
* with an XOR of the address where the pointer is held and a per-cache
* random number.
*/
void *freelist_ptr(const struct kmem_cache *s, void *ptr,
unsigned long ptr_addr) {
return (void *)((unsigned long)ptr ^ s->random ^ ptr_addr));
}
Performance implication
TODO. set a target benchmarks in /bench
References
Kernel Address Space Layout Randomization (KASLR)
Kernel Address Space Layout Randomization(KASLR) is a feature that randomize kernel location itself in order to mitigate known exploits which relies on predictable kernel addresses such as retrun-oriented-programming. KASLR implementation for x86-64 randomize three main memory regions : physical mapping, vmalloc and vmemmap.
//@arch/x86/mm/kaslr.c
/*
* Memory regions randomized by KASLR (except modules that use a separate logic
* earlier during boot). The list is ordered based on virtual addresses. This
* order is kept after randomization.
*/
static __initdata struct kaslr_memory_region {
unsigned long *base;
unsigned long size_tb;
} kaslr_regions[] = {
{ &page_offset_base, 0 },
{ &vmalloc_base, 0 },
{ &vmemmap_base, 0 },
};
/* Get size in bytes used by the memory region */
static inline unsigned long get_padding(struct kaslr_memory_region *region)
{
return (region->size_tb << TB_SHIFT);
}
....
void __init kernel_randomize_memory(void)
{
....
....
kaslr_regions[0].size_tb = 1 << (MAX_PHYSMEM_BITS - TB_SHIFT);
kaslr_regions[1].size_tb = VMALLOC_SIZE_TB;
/*
* Update Physical memory mapping to available and
* add padding if needed (especially for memory hotplug support).
*/
BUG_ON(kaslr_regions[0].base != &page_offset_base);
memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING;
/* Adapt phyiscal memory region size based on available memory */
if (memory_tb < kaslr_regions[0].size_tb)
kaslr_regions[0].size_tb = memory_tb;
/*
* Calculate the vmemmap region size in TBs, aligned to a TB
* boundary.
*/
vmemmap_size = (kaslr_regions[0].size_tb << (TB_SHIFT - PAGE_SHIFT)) *
sizeof(struct page);
kaslr_regions[2].size_tb = DIV_ROUND_UP(vmemmap_size, 1UL << TB_SHIFT);
Above code calculate size of memory region in terabytes for physical mapping, vmalloc and vmemmap. Those size of memory region are used to calculate remain_entropy
below.
/* Calculate entropy available between regions */
remain_entropy = vaddr_end - vaddr_start;
for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++)
remain_entropy -= get_padding(&kaslr_regions[i]);
prandom_seed_state(&rand_state, kaslr_get_random_long("Memory"));
for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++) {
unsigned long entropy;
/*
* Select a random virtual address using the extra entropy
* available.
*/
entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
prandom_bytes_state(&rand_state, &rand, sizeof(rand));
entropy = (rand % (entropy + 1)) & PUD_MASK;
vaddr += entropy;
*kaslr_regions[i].base = vaddr;
/*
* Jump the region and add a minimum padding based on
* randomization alignment.
*/
vaddr += get_padding(&kaslr_regions[i]);
vaddr = round_up(vaddr + 1, PUD_SIZE);
remain_entropy -= entropy;
}
}
In the last part of kernel_randomize_memory()
, remain_entropy
is initialized to available space of virtual memory. Actual randomization is done inside the for loop. Entropy is 'distributed' for each region and applied to their base. Note that it prevents monopoly of entropy by dividing remain_entropy
to remain regions. remain_entropy
is updated on each loop for the next region.
References
Kconfig | A | A+ | F | U | U+ |
---|---|---|---|---|---|
REFCOUNT_FULL | N | Y | N | N | N |
A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0
Mitigating Integer Overflows
Preventing refcount overflows
A reference counter (i.e., refcount
) can be overflowed if
incorrectly managed, resulting in a few dangling pointers. Such a
dangling pointer is relatively easy-to-exploit by leading it to a
use-after-free bug (e.g., inserting a fake object to the dangled
object via msgsnd()
, see CVE-2016-0728).
In CVE-2016-0728, a keyring is not correctly freed when joining a
session keyring, and a function table pointing to revoke()
in the
dangled object can hijacked, resulting in a privileged escalation.
If REFCOUNT_FULL
is enabled, all refcount_inc()
are replaced with a
below call. It checked two conditions: 1) if full then remained topped
(i.e., UINT_MAX
) and continue to use the object (i.e., leak), and 2)
if freed then do not use the object. Similarly
refcount_sub_and_test_checked()
checks a underflow condition.
// @lib/refcount.c
bool refcount_inc_not_zero_checked(refcount_t *r) {
unsigned int new, val = atomic_read(&r->refs);
do {
new = val + 1;
if (!val) // NB. refcount is already freed
return false;
if (unlikely(!new)) // NB. refcount is overflowed
return true;
} while (!atomic_try_cmpxchg_relaxed(&r->refs, &val, new));
return true;
}
Optimization. PAX_REFCOUNT
and [2] propose a potential
optimization by trading #refcount by a half, using a sign bit to
indicate overflowed condition. However, the current implementation
just uses a cmpxchg()
with an explicit check of an overflow and
use #refcount upto UINT_MAX
.
lock incl -0xc(%rbp)
js overflowed ; NB. unlikely to be taken
overflowed:
lea -0xc(%rbp),%rcx ; NB. restored to an old refcount
<UD0>
Performance implication
TODO.
References
Tools to prevent integer overflows
Developers have detected integer overflows as the following:
x + y < x //for addition
x - y > x //for substraction
x != 0 && y > c/x //for multiplication
There are a few problems with the above techniques. In case of signed integers, it cannot guarantee the complete checking because it relies on undefined behavior.
Therefore, GCC5 has introduced built in macros to check for integer overflows without undefined behavior. For example, overflows in signed integers are detected like below.
// @include/linux/overflow.h
#define check_add_overflow(a, b, d) \
__builtin_choose_expr(is_signed_type(typeof(a)), \
__signed_add_overflow(a, b, d), \
__unsigned_add_overflow(a, b, d))
/* Checking for unsigned overflow is relatively easy without causing UB. */
#define __unsigned_add_overflow(a, b, d) ({ \
typeof(a) __a = (a); \
typeof(b) __b = (b); \
typeof(d) __d = (d); \
(void) (&__a == &__b); \
(void) (&__a == __d); \
*__d = __a + __b; \
*__d < __a; \
})
/*
* Adding two signed integers can overflow only if they have the same
* sign, and overflow has happened iff the result has the opposite
* sign.
*/
#define __signed_add_overflow(a, b, d) ({ \
typeof(a) __a = (a); \
typeof(b) __b = (b); \
typeof(d) __d = (d); \
(void) (&__a == &__b); \
(void) (&__a == __d); \
*__d = (u64)__a + (u64)__b; \
(((~(__a ^ __b)) & (*__d ^ __a)) \
& type_min(typeof(__a))) != 0; \
})
Performance implication
References
Kconfig | A | A+ | F | U | U+ |
---|---|---|---|---|---|
HARDENED_USERCOPY | Y | Y | Y | Y | Y |
SECURITY_DMESG_RESTRICT | N | Y | N | N | N |
A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0
Preventing information leaks
Hardening usercopy()
To prevent the leakage from the kernel objects,
the slab interface provides a way to specify the region
of each object that is allowed for usercopy.
For example, to protect task_struct
from unintended leakage
(e.g., out-of-bound read beyond the usercopy region of the heap object),
a tuple of useroffset
and userisze
should be provided via
kmem_cache_create_usercopy()
.
void __init fork_init(void) {
...
// NB. calculate the offset and size of the fpu states
task_struct_whitelist(&useroffset, &usersize);
task_struct_cachep = kmem_cache_create_usercopy("task_struct",
arch_task_struct_size, align,
SLAB_PANIC|SLAB_ACCOUNT,
useroffset, usersize, NULL);
}
In task_struct
, fpu
state is allowed for usercopy (see, below),
which are accessed from/to via the ptrace()
syscall.
// @x86_64
// task_struct_cachep->useroffset = 2624 :&fxregs_state
// task_struct_cachep->usersize = 960 :fpu_kernel_xstate_size
//
// arch_ptrace
// -> copy_regset_from_user
// -> xfpregs_get()
// -> user_regset_copyout()
// -> copy_to_user()
int xfpregs_get(struct task_struct *target, const struct user_regset *regset,
unsigned int pos, unsigned int count,
void *kbuf, void __user *ubuf)
{
struct fpu *fpu = &target->thread.fpu;
...
return user_regset_copyout(&pos, &count, &kbuf, &ubuf,
&fpu->state.fxsave, 0, -1);
}
When HARDENED_USERCOPY
enabled,
copy_from/to_user
performs various sanity checks,
including the check for the useroffset
and usersize
.
// copy_from/to_user()
// -> check_object_size()
// -> check_heap_object()
// -> __check_heap_object()
void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
bool to_user)
{
struct kmem_cache *s;
unsigned int offset;
size_t object_size;
/* NB. Fetch kmem_cache to find the object size/redzone. */
s = page->slab_cache;
/* NB. Reject if ptr is not possible to point to the page,
* but the page is directly converted from ptr of its caller,
* this path won't be taken in the current implementation. */
if (ptr < page_address(page))
usercopy_abort("SLUB object not in SLUB page?!", NULL,
to_user, 0, n);
/* Find offset within object. */
offset = (ptr - page_address(page)) % s->size;
/* Adjust for redzone and reject if within the redzone. */
if (kmem_cache_debug(s) && s->flags & SLAB_RED_ZONE) {
if (offset < s->red_left_pad)
usercopy_abort("SLUB object in left red zone",
s->name, to_user, offset, n);
offset -= s->red_left_pad;
}
/* NB. Allow address range falling entirely within usercopy region.
useroffset + +-- offset (from ptr)
| v
v +--n-->|
[ [ ] ]
|<--usersize--->|
*/
if (offset >= s->useroffset &&
offset - s->useroffset <= s->usersize &&
n <= s->useroffset - offset + s->usersize)
return;
/*
* If the copy is still within the allocated object, produce
* a warning instead of rejecting the copy. This is intended
* to be a temporary method to find any missing usercopy
* whitelists.
*/
object_size = slab_ksize(s);
if (usercopy_fallback &&
offset <= object_size && n <= object_size - offset) {
usercopy_warn("SLUB object", s->name, to_user, offset, n);
return;
}
usercopy_abort("SLUB object", s->name, to_user, offset, n);
}
There are a few other similar mitigation schemes
to avoid such a mistake
when performing a copy_to/from_user()
.
For example,
if an object is stack allocated,
then it checks if the object properly locates in the stack
as well as in the proper frame of the stack,
if the architecture provides a simple way
to walk the stack frames (e.g., frame pointer).
// __check_object_size
// -> check_stack_object
// -> arch_within_stack_frames
static inline
int arch_within_stack_frames(const void * const stack,
const void * const stackend,
const void *obj, unsigned long len)
{
const void *frame = NULL;
const void *oldframe;
// NB. return address of the caller
oldframe = __builtin_frame_address(1);
if (oldframe)
// NB. return address of the caller's caller
frame = __builtin_frame_address(2);
/*
* low ----------------------------------------------> high
* [saved bp][saved ip][args][local vars][saved bp][saved ip]
* ^----------------^
* allow copies only within here
*/
while (stack <= frame && frame < stackend) {
/*
* If obj + len extends past the last frame, this
* check won't pass and the next frame will be 0,
* causing us to bail out and correctly report
* the copy as invalid.
*/
if (obj + len <= frame)
// NB. 2 * sizeof(void*): frame pointer + return address
return obj >= oldframe + 2 * sizeof(void *) ?
GOOD_FRAME : BAD_STACK;
oldframe = frame;
frame = *(const void * const *)frame;
}
return BAD_STACK;
}
Restricting kernel pointers
dmesg
command prints debug messages of the kernel buffer. However, the kernel message buffer sometimes contains sensitive information such as register values which is not allowed to users. This makes much easier for an attacker makes an exploit as in CVE-2018-17182. Thus DMESG_RESTRICT
prevents unprivileged users from viewing those messages using dmesg command. When DMESG_RESTRICT
is enabled, only users with system administration privileges are allowed to see the messages. When DMESG_RESTRICT
is not enabled, every user can see the messages.
KPTR_RESTRICT
works as similar to DMESG_RESTRICT
. Kernel pointer is another sensitive information that might have chances of using by malicious users. When KPTR_RESTRICT
is set to 1, %pK format specifier hides kernel pointers to unprivileged users by printing 0s. When KPTR_RESTRICT
is set to 0, %pK works as same as %p which means there is no restriction on printing pointers. When KPTR_RESTRICT
is set to 2, %pK hides pointers regardless of privileges.
References
- STACKLEAK: A Long Way to the Linux Kernel Mainline
- A pair of GCC plugins
- CVE-2018-17182: A Cache Invalidation Bug in Linux Memory Management
Kconfig | A | A+ | F | U | U+ |
---|---|---|---|---|---|
PAGE_TABLE_ISOLATION | Y | Y | Y | Y | Y |
RETPOLINE | Y | Y | Y | Y | Y |
A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0
Mitigating Microarchitecture Side-channel
Mitigating KASLR and Meltdown
Mapping both user-space and kernel-space in one page-table takes a lot of advantages to the linux kernel. But on the other side, it has negative impact on security. Many researchers have exploited that for a profit via a variety of side channels. It allows an attacker to bypass KASLR, even worse, to read kernel memory from user-space that is known as Meltdown. So the linux kernel ended up to isolate the page-table based on its execution mode, either user or kernel. That is dubbed as page table isolation or pti. The implementation of pti is straightforward. allocate two page tables for a process, one is for user mode, another one is for kernel mode, and kernel mode can see entire memory space, on the other hand, user mode is limited not to see kernel space. It effectively closes both KASLR-attacks and Meltdown.
Allocating two page tables in a contiguous area.
// arch/x86/include/asm/pgalloc.h
// #ifdef CONFIG_PAGE_TABLE_ISOLATION
// #define PGD_ALLOCATION_ORDER 1
// #else
// #define PGD_ALLOCATION_ORDER 0
// #endif
....
....
// arch/x86/mm/pgtable.c
static inline pgd_t *_pgd_alloc(void)
{
return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER);
// NB. CONFIG_PAGE_TABLE_ISOLATION --> 8kb, two page tables.
// !CONFIG_PAGE_TABLE_ISOLATION --> 4kb, one page table.
}
Locating the page table at a syscall entry, either user-to-kernel or kernel-to-user.
ENTRY(entry_SYSCALL_64)
UNWIND_HINT_EMPTY
swapgs
// NB. percpu-access before SWITCH_TO_KERNEL_CR3
// the percpu-area should be mapped in user page table.
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
// NB. locating the page table, user-to-kernel.
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
As shown in above code, the kernel entry code accesses per-cpu areas before locating the page table. So the per-cpu areas are needed to be mapped in the page table for user mode.
static void __init setup_cpu_entry_area(unsigned int cpu)
{
....
....
// NB. create a mapping for per-cpu areas
cea_set_pte(&cea->gdt, get_cpu_gdt_paddr(cpu), gdt_prot);
cea_map_percpu_pages(&cea->entry_stack_page,
per_cpu_ptr(&entry_stack_storage, cpu), 1,
PAGE_KERNEL);
....
cea_map_percpu_pages(&cea->tss, &per_cpu(cpu_tss_rw, cpu),
sizeof(struct tss_struct) / PAGE_SIZE, tss_prot);
}
....
....
void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags)
{
unsigned long va = (unsigned long) cea_vaddr;
pte_t pte = pfn_pte(pa >> PAGE_SHIFT, flags);
// NB. _PAGE_GLOBAL indicates a mapping for all page tables
// including both user and kernel.
if (boot_cpu_has(X86_FEATURE_PGE) &&
(pgprot_val(flags) & _PAGE_PRESENT))
pte = pte_set_flags(pte, _PAGE_GLOBAL);
Lastly, the pti leverages PCID or Process Context IDentifier to avoid a TLB collision between user-space and kernel-space. It works by assigning a different PCID to each execution mode. With this hardware support of Intel, the pti has a negligible performance impact.
.macro ADJUST_KERNEL_CR3 reg:req
ALTERNATIVE "", "SET_NOFLUSH_BIT \reg", X86_FEATURE_PCID
// NB. CR3 register contains an address of page table.
// Since the lowest 12 bits are unused, (page-aligned)
// they are used to represent a PCID.
andq $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
.endm
.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
mov %cr3, \scratch_reg
ADJUST_KERNEL_CR3 \scratch_reg
mov \scratch_reg, %cr3
.Lend_\@:
.endm
References
- KASLR is Dead: Long Live KASLR
- Meltdown: Reading Kernel Memory from User Space
- Side Channel Attacks on Linux Kernel
- Deep dive into Page Table Isolation
- Deep dive into Defense against BTB attacks
Mitigating Spectre
Modren CPUs have a branch predictor to optimize their performance. It works by referencing Branch Target Buffer or BTB that is a storage for a key(PC)-value(Target PC) pair. But its size limitation causes a BTB collision that leads to a new side-channel attack. The root cause of the attack is that BTB stores some parts of the bits of PC, not the full bits of PC. Using this primitive, an attacker is able to inject an indirect branch target into the BTB, and consequently run some codes in a speculative context. It can leak a sensitive data across some boundaries. (e.g. between VMs, Processes, ...) The attack is called Spectre Variant2. and retpoline has been introducted to stop the attack. The concept of retpoline is straightforward, but the implementation is a little tricky. Retpoline aims to eliminate all speculating behaviors that can be controlled by an attacker.
Indirect jump replacement with Retpoline. Take a look at how to implement an indirect branch with no speculation.
// Before Retpoline
jmp *%rax
// After Retpoline
(1) call load_label
capture_ret_spec:
(2) pause; LFENCE
(3) jmp capture_ret_spec
load_label:
(4) mov %rax, (%rsp)
(5) ret
// NB. Let's follow a Retpoline gadget.
// two executions are performing in parallel.
// o: original execution
// s: speculative execution
// (1-o) Direct branch to (4).
// Push (2) onto the stack as a return.
// (4-o) Overwrite the return with the real target.
// (5-o) Load the real target from the stack memory.
// (5-s) If speculating here by RSB (Return Stack Buffer),
// consumes RSB entry created in (1-o),
// jumps to (2)
// (2-s) relaxing cpu for the spin-wait loop.
// (3-s) jumps to (2) again.
// it forces the speculative execution
// not be outside of (2)-(3).
// (5-o) Jump to the real target.
// --> There are no speculation!
The linux kernel supports Retpoline as an alternaitve section, which means that an admin can determine to enable/disable retpoline when boot-time via a kernel command-line.
// NB. A retpoline gadget replacing an indirect branch.
.macro RETPOLINE_JMP reg:req
call .Ldo_rop_\@
.Lspec_trap_\@:
pause
lfence
jmp .Lspec_trap_\@
.Ldo_rop_\@:
mov \reg, (%_ASM_SP)
ret
.endm
.macro JMP_NOSPEC reg:req
#ifdef CONFIG_RETPOLINE
// NB. register an indirect branch as an alternative insn.
ANNOTATE_NOSPEC_ALTERNATIVE
ALTERNATIVE_2 __stringify(ANNOTATE_RETPOLINE_SAFE; jmp *\reg), \
__stringify(RETPOLINE_JMP \reg), X86_FEATURE_RETPOLINE, \
__stringify(lfence; ANNOTATE_RETPOLINE_SAFE; jmp *\reg), X86_FEATURE_RETPOLINE_AMD
#else
jmp *\reg
#endif
.endm
void __init alternative_instructions(void)
{
...
// NB. runtime patching to apply retpoline gadgets.
// (__alt_instructions ~ __alt_instructions_end) includes
// a lot of sites for indirect branches.
apply_alternatives(__alt_instructions, __alt_instructions_end);
...
}
MDS (Microarchitectural Data Sampling).
When performing store, load, L1 refill, processors write data into a variety of temporary buffers defined by microarchitecture such as Load Buffer, Store Buffer, Line Fill Buffer. The data in the buffers can be forwarded to load operations as an optimization. Unfortunately this kind of forwarding can across boundary, which means a kernel data can be forwarded to a load operation inside user space. If an attacker can stick the data into a leak gadget inside user space, it eventually leaks a kernel memory. The mitigation against this attack is very straightforward. It's to clear the cpu buffers when returning to user.
static inline void mds_clear_cpu_buffers(void)
{
static const u16 ds = __KERNEL_DS;
....
asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
}
static inline void mds_user_clear_cpu_buffers(void)
{
if (static_branch_likely(&mds_user_clear))
mds_clear_cpu_buffers();
}
__visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
{
....
// NB. When returning from kernel to user,
// it clears cpu buffers that contain in-fligt data.
mds_user_clear_cpu_buffers();
}
L1TF - L1 Terminal Fault. L1TF is a hardware vulnerability which allows unprivileged speculative access to data in the Level 1 Data Cache. The root cause in it is that a physical address in a PTE (page table entry) could be speculative accessed despite the PTE is invalid when peforming page table walk.
Linux kernel applies PTE inversion to some codes relevant to page table maintenance. That modifies PTE to make sure that a physical address in a invalid PTE always points to invalid physical memory. This is an unconditional defense.
// PTE inversion
static inline bool __pte_needs_invert(u64 val)
{
// NB. PTE is exist, but invalid.
return val && !(val & _PAGE_PRESENT);
}
static inline u64 protnone_mask(u64 val)
{
// NB. If PTE inversion needed,
// return the mask for PTE to point to invalid memory.
return __pte_needs_invert(val) ? ~0ull : 0;
}
static inline unsigned long pte_pfn(pte_t pte)
{
phys_addr_t pfn = pte_val(pte);
// NB. Masking PFN (physical address)
// after masking, it's pointing to invalid memory.
pfn ^= protnone_mask(pfn);
return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT;
}
References
- More details about mitigations for the CPU Speculative Execution issue
- Retpoline: a software construct for preventing branch-target-injection
- Spectre Returns! Speculation Attacks using the Return Stack Buffer
- MDS - Microarchitectural Data Sampling
- L1TF - L1 Terminal Fault
- Meltdown strikes back: the L1 terminal fault vulnerability
Kconfig | A | A+ | F | U | U+ |
---|---|---|---|---|---|
BPF | Y | Y | N | Y | Y |
BPF_SYSCALL | Y | Y | Y | Y | Y |
BPF_JIT | Y | Y | Y | Y | Y |
BPF_JIT_ALWAYS_ON | Y | Y | Y | Y | Y |
A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0
Hardening Hostile Code in eBPF
Constant blinding in JIT
The JIT-ed memory is a common target
for an attacker
to place an arbitrary gadget
(e.g., syscall
in userspace).
One popular technique
is to encode a desirable sequence of instructions
as part of immediate values,
as x86-like CISC architectures
provide a way to encode long bytes
into one instruction.
Constant blinding,
as also known as constant folding,
is a technique
to break immediate values,
avoiding the use of attacker-chosen constants
in the executable region.
It's worth noting that
there are numerous other techniques
(e.g., controlling the constant offset of direct branches)
but most of well-known attacks
in the user space
might not be too effective
in the kernel space
as BPF provides only a smaller region
with a smaller set of instructions available.
The implementation of constant blinding
is straightforward; xor the chosen immediate value
with a random constant and before using it,
xor with the mangled value again with the random constant.
int bpf_jit_blind_insn(const struct bpf_insn *from,
const struct bpf_insn *aux,
struct bpf_insn *to_buff)
{
u32 imm_rnd = get_random_int();
switch (from->code) {
case BPF_ALU | BPF_ADD | BPF_K:
case BPF_ALU | BPF_SUB | BPF_K:
case BPF_ALU | BPF_AND | BPF_K:
case BPF_ALU | BPF_OR | BPF_K:
case BPF_ALU | BPF_XOR | BPF_K:
case BPF_ALU | BPF_MUL | BPF_K:
case BPF_ALU | BPF_MOV | BPF_K:
case BPF_ALU | BPF_DIV | BPF_K:
case BPF_ALU | BPF_MOD | BPF_K:
// NB. no more attack controllable instructions inserted
// in the jitted, executable space (e.g., jump in the middle
// of the immediate value)
//
// MOV _, 0xdeadbeef
// => MOV AX, [imm_rnd ^ 0xdeadbeef]
// XOR AX, imm_rnd
// MOV _, AX
*to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
*to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
*to++ = BPF_ALU32_REG(from->code, from->dst_reg, BPF_REG_AX);
break;
...
}
...
}
Preventing Spectre
For non-privileged BPF programs, the JIT engine applies mitigation schemes against microarchitectural side-channel attacks, such as Spectre.
Variant 1 (Bounds Check Bypass). To prevent a speculator from performing an out-of-bound array access, it restricts its uses of an index on arrays that are accessible by an unprivileged user. The Linux kernel places such an check for both its BPF interpreter and JIT code.
// NB. in JIT-ed code:
// array[index] -> array[index & mask]
u32 array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf) {
...
if (map->unpriv_array) {
*insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 4);
*insn++ = BPF_ALU32_IMM(BPF_AND, ret, array->index_mask);
} else {
*insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 3);
}
...
}
// NB. in an fuction called from an eBPF program
static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
{
struct bpf_array *array = container_of(map, struct bpf_array, map);
u32 index = *(u32 *)key;
if (unlikely(index >= array->map.max_entries))
return NULL;
// NB. even after a speculator reaches here, it won't access
// beyond the region of array->pptrs
return this_cpu_ptr(array->pptrs[index & array->index_mask]);
}
Recently, more sophisticated mitigation to thwart generic gadgets for V1 is introduced, which simulates the behavior of a speculator and detects a potential out-of-bound memory access. Please refer to [2] for in-depth explanation.
Variant 2 (Branch Target Injection).
For indirect jumps introduced during the jitting,
BPF applies the Retpoline mitigation,
like the Linux kernel code.
For example, when the BPF_JMP
instruction is a tail call,
it invokes the same bpf program again,
which is commonly implemented with an indirect jump
(jumping right after the prologue).
RETPOLINE_RAX_BPF_JIT
is
introduced to produce
a retpoline-enabled jump gadget
that can replace an indirect call with rax
.
// do_jit() {
// ...
// case BPF_JMP | BPF_TAIL_CALL:
// emit_bpf_tail_call(&prog);
// break;
// }
void emit_bpf_tail_call(u8 **pprog) {
...
/*
* Wow we're ready to jump into next BPF program
* rdi == ctx (1st arg)
* rax == prog->bpf_func + prologue_size
*/
RETPOLINE_RAX_BPF_JIT();
..
}
# define RETPOLINE_RAX_BPF_JIT() \
do { \
EMIT1_off32(0xE8, 7); /* callq do_rop */ \
/* spec_trap: */ \
EMIT2(0xF3, 0x90); /* pause */ \
EMIT3(0x0F, 0xAE, 0xE8); /* lfence */ \
EMIT2(0xEB, 0xF9); /* jmp spec_trap */ \
/* do_rop: */ \
EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ \
EMIT1(0xC3); /* retq */ \
} while (0)
Variant 4 (Speculative Store Bypass). To prevent a speculative memory disambiguation from performing an arbitrary kernel memory read, BPF verifier detects the malicious patterns to trigger the speculation at the time of loading a BPF program, and sanitize the patterns.
// NB: Safe execution flow by sanitizing a pattern
// Detect a case of reusing stack slot, and sanitize it.
// (1) r8 = *(u64 *)(r7 +0) // slow read
// (2) *(u64 *)(r10 -72) = 0 // instruction for sanitizing
// - this store becomes fast due to no depency on (1)
// (3) *(u64 *)(r8 +0) = r3 // this store becomes slow due to r8
// ---- at this time, (2) is likely to be completed before (3),
// ---- so it can perfectly eliminate an arbitrary unsafe address.
// (4) r1 = *(u64 *)(r6 +0) // loads from either sanitized or safe address
// (5) r2 = *(u8 *)(r1 +0) // no leak happens
struct bpf_insn_aux_data {
....
int sanitize_stack_off; /* stack slot to be cleared */
....
}
static int check_stack_write(struct bpf_verifier_env *env, ....
{
....
for (i = 0; i < BPF_REG_SIZE; i++) {
if (state->stack[spi].slot_type[i] == STACK_MISC &&
!env->allow_ptr_leaks) {
int *poff = &env->insn_aux_data[insn_idx].sanitize_stack_off;
int soff = (-spi - 1) * BPF_REG_SIZE;
....
....
// NB: examine a store instruction writing to a stack slot.
// record this offset for detecting reused stack slot.
*poff = soff;
}
state->stack[spi].slot_type[i] = STACK_SPILL;
}
....
}
static int convert_ctx_accesses(struct bpf_verifier_env *env)
{
....
// NB: Is it a reused stack slot?
if (type == BPF_WRITE &&
env->insn_aux_data[i + delta].sanitize_stack_off) {
struct bpf_insn patch[] = {
....
// NB: Sanitize it with 0.
BPF_ST_MEM(BPF_DW, BPF_REG_FP,
env->insn_aux_data[i + delta].sanitize_stack_off,
0),
....
};
}
}
References
- bpf: prevent out-of-bounds speculation
- bpf: prevent out of bounds speculation on pointer arithmetic
- bpf: Prevent memory disambiguation attack
- Speculose: Analyzing the Security Implications of Speculative Execution in CPUs
- A Systematic Evaluation of Transient Execution Attacks and Defenses
Preventing Code Reuse Attack
The System programming languages such as C and C++ give a freedom to optimize
and control their resource. It requires the programmer to manually manage
memory and observe typing rules leads to security vulnerabilities.
Memory corruptions are routinely exploited by attackers. Following defenses are
introduced to mitigate such attacks:
- Address Space Layout Randomization (ASLR)
- Stack canaries
- Data Execution Prevention (DEP)
They protects against code injection attack, but not fully prevent code reuse attack, e.g. ROP.
Return Oriented Programming (ROP)
In a ROP attack, the attacker does not inject new code; instead, the malicious computation is performed by chaining together existing sequences of instructions (called gadgets).
Stack Instructions
+-----------------+
| | -----------> pop eax
|-----------------| /---- ret
| v1 | /
|-----------------| <----/
| | -----------> pop ebx
|-----------------| /---- ret
| v2 | /
|-----------------| <----/
| | -----------> add eax, ebx
| | /----- ret
|-----------------| <----/
| | -----------> pop ecx
|-----------------| /---- ret
| v3 | /
|-----------------| <----/
| | -----------> mov [ecx], eax
+-----------------+ ret
-> Result: mem[v3] = v1 + v2
The attacker finds gadgets within the original program text and causes them
to be executed in sequence to perform a task other than what was intended.
Common objectives of such malicious payloads include arbitrary code execution,
privilege escalation, and exfiltration of sensitive information.
Many ROP attacks use unintended instruction sequences. CFI mitigates such
attacks by guaranteeing the program is in intended execution flow.
Control Flow Integrity (CFI)
CFI is to restrict the set of possible control-flow transfers to those that are
trictly required for correct program execution. This prevents code-reuse
techniques such as ROP from working because they would cause the program to
execute control-flow transfers which are illegal under CFI.
Most CFI mechanisms follow a two-phase process:
- An analysis phase constructs the Control-Flow Graph (CFG) which approximates the set of legitimate control-flow transfers
- The CFG is used at runtime by an enforcement component to ensure that all executed branches correspond to edges in the CFG
<Control Flow Graph>
func1()
/ \
/ \
v v - function call 2 from 1 is allowed
func2() func3() - function call 4 from 3 is forbidden
|
|
V
func4()
However, it is hard to construct fine grained CFG because of indirect branches that are not determined at static analysis so there are approximation in most CFG. In case of RAP, it implements type-based approximated CFG.
PaX Reuse Attack Protector (RAP)
RAP is a defense mechanism against code reuse attack. It is a CFI technology
developed by PaX. RAP is included in grsecurity patch for linux kernel security,
but only the commercial version provides its full functionalities.
RAP is implemented as a GCC compiler plugin, and it consists of two components:
- A deterministic defense limiting function call and return location
- A probabilistic defense to help ensure that a function can return to the location where the function was called
Indirect Control Transfer Protection
RAP implements CFI based on type-based indirect control flow graph (ICFG). It is
based on the idea that the ICFG vertex categorization can have the ICFG
approximation emerge automatically. It means that the analysis can be conducted
in function level without knowledge of the entire program.
It categorizes functions by type: return type, function name and function
parameters. The type information extracted from each function and function
pointer is used to verify matching between function and function pointer
dereference (indirect call, function return, etc). Type matching uses hash
value calculated from appropriate type part of each function.
A different set of type parts can be used for type hash by the sort of function.
Usable parts in type hash | Return | Name | ’this’ | Parameters |
---|---|---|---|---|
non-class or static member function/ptr | Y | N | N/A | Y |
non-virtual method/ptr | Y | N | N | Y |
virtual method/ptr | N | N | N | Y |
ancestor method/virtual method call | Y | Y | Y | Y |
Table: (RAP: RIP ROP) Type Hash Parts
Plugin code for function pointer protection:
// @rap_plugin/rap_fptr_pass.c
static unsigned int rap_fptr_execute(void)
{
...
// ... through a function pointer
fntype = TREE_TYPE(fntype);
gcc_assert(TREE_CODE(fntype) == FUNCTION_TYPE || TREE_CODE(fntype) ==
METHOD_TYPE);
if (enable_type_call) {
rap_instrument_fptr(&gsi);
bb = gsi_bb(gsi);
gcc_assert(call_stmt == gsi_stmt(gsi));
}
if (enable_type_ret) {
hash = rap_hash_function_type(fntype, imprecise_rap_hash_flags);
computed_hash = build_int_cst_type(rap_hash_type_node, -hash.hash);
rap_mark_retloc(&gsi, computed_hash);
}
}
...
// check the function hash of the target of the fptr
static void rap_instrument_fptr(gimple_stmt_iterator *gsi)
{
...
if (TREE_CODE(fntype) == FUNCTION_TYPE) {
computed_hash = build_rap_hash(call_stmt, fntype);
} else {
debug_tree(fntype);
gcc_unreachable();
}
...
target_hash = get_rap_hash(&stmts, loc, fptr, -rap_hash_offset);
gsi_insert_seq_before(gsi, stmts, GSI_SAME_STMT);
// compare target_hash against computed function hash
// bail out on mismatch
check_hash = gimple_build_cond(NE_EXPR, target_hash, computed_hash, NULL_TREE,
NULL_TREE);
gimple_set_location(check_hash, loc);
gsi_insert_before(gsi, check_hash, GSI_NEW_STMT);
...
Plugin code for return location protection:
// @rap_plugin/rap_ret_pass.c
/*
* insert the equivalent of
* if (*(long *)((void *)retaddr+N) != (long)-function_hash) abort();
*/
static void check_retaddr(gimple_stmt_iterator *gsi, tree new_retaddr)
{
...
#ifdef TARGET_386
if (TARGET_64BIT)
target_hash = get_rap_hash(&stmts, loc, new_retaddr, -16);
else
target_hash = get_rap_hash(&stmts, loc, new_retaddr, -10);
#else
...
hash = rap_hash_function_type(TREE_TYPE(current_function_decl),
imprecise_rap_hash_flags);
computed_hash = build_int_cst_type(rap_hash_type_node, -hash.hash);
stmt = gimple_build_cond(NE_EXPR, target_hash, computed_hash, NULL_TREE,
NULL_TREE);
gimple_set_location(stmt, loc);
gsi_insert_after(gsi, stmt, GSI_CONTINUE_LINKING);
...
Return Address Protection
Return Address Protection is an another defense mechanism of RAP. It is conceptually based on the XOR canary approach. RAP encrypts the return address with a key which is stored in a reserved register (r12 on amd64). This key is highly resistant to leaking as it shouldn't be stored or spilled into memory. In addition to this, grsecurity says RAP cookie (the encryption key) does not stay, but changes per task, system call, and iteration in some infinite loops.
Following is an example for RAP: Return Address Protection. Its full implementation is not revealed at PaX test patch.
// RAP example
push %rbx
mov 8(%rsp),%rbx
xor %r12,%rbx
...
xor %r12,%rbx
cmp %rbx,8(%rsp)
jnz .error
pop %rbx
retn
.error:
ud2
References
- Control-Flow Integrity: Precision, Security, andPerformance
- ROP is Still Dangerous: Breaking Modern Defenses
- On the Effectiveness of Type-based Control Flow Integrity (TROP)
- RAP: RIP ROP
- PAX RAP FAQ
- PAX linux patch: test version
ARM's Memory Tagging Extensions (MTE)
XXX. write here
References
Kconfig | A | A+ | F | U | U+ |
---|---|---|---|---|---|
GIN_STACKLEAK | N | Y | N | N | N |
GIN_STRUCTLEAK_USER | N | N | N | N | N |
GIN_STRUCTLEAK_BYREF | N | N | N | N | N |
GIN_STRUCTLEAK_BYREF_ALL | N | Y | N | N | N |
ACK_ALL | N | N | N | N | N |
GIN_RANDSTRUCT | N | N | N | N | N |
GIN_RANDSTRUCT_PERFORMANCE | N | N | N | N | N |
A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0
Compiler-based Security Checkers
Preventing information leaks
Stackleak Plugin.
There are three kinds of vulnerabilities that STACKLEAK kernel security feature wants to defend against: 1) Information disclosure coming frome leaving data on the stack that can be exfiltrated to user space, 2) Targeting Uninitialized variable on the kernel stack to get the stack location, and 3) Stack clash.
The first two of the vulnerabilities are closely related. Chaining attacks
like getting information from left over values on the stack and targeting the
address could happen. These two vulnerabilities are blocked by the feature
called stack poisoning
. This feature overwrites STACKLEAK_POISON(-0xBEEF) to
the used portion of the stack at the end of the syscall, before returning to
the caller. Below is the implementation of stack poisoning. This code checks
the count of unpoisoned space in the kernel stack, and then fills the
STACKLEAK_POISON value from the lowest boundary of the current used stack.
asmlinkage void stackleak_erase(void)
{
unsigned long kstack_ptr = current -> lowest_stack;
unsigned long boundary = (unsigned long)end_of_stack(current);
unsigned int poison_count = 0;
const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);
if (skip_erasing())
return;
while (kstack_ptr > boundary && poison_count <= depth) {
if (*(unsigned long *)kstack_ptr == STACKLEAK_POISON)
poison_count++;
else
poison_count = 0;
kstack_ptr -= sizeof(unsigned long);
}
if (kstack_ptr == boundary)
kstack_ptr += sizeof(unsigned long);
...
if (on_thread_stack())
boundary = current_stack_pointer;
else
boundary = current_top_of_stack();
while (kstack_ptr < boundary) {
*(unsigned long *)kstack_ptr = STACKLEAK_POISON;
kstack_ptr += sizeof(unsigned long);
}
current->lowest_stack = current_top_of_stack() - THREAD_SIZE/64;
}
Stack poisoning prevents the stack from leaving space to be exposed, but it only works for multi-system-call attacks; it cannot protect against attacks that complete during a single system call, since the stack poisoning is done at the end of the system call. And one more, since the stack poison value could be a valid pointer to the stack since the user space address range from 0x0000 to 0x0fff and the kernel space address range from 0xf000 to 0xffff.
The third vulnerability, stack clash, is about prohibiting the memory region. It includes clashing the stack with another memory region, jumping over the stack guard page, and smashing the stack(overwriting the stack with other memory region). When the stack is full, it is automatically extended by using the page fault. If the end of the current stack access to the already allocated page, page fault will not happen and kernel cannot notice that they have reached the stack's end so the stack clash would happen.
Usually, variable length array like alloca() function call is used to consume the stack's allocated space. So the STACKLEAK plugin tried to prevent stack clash by checking all the alloca() calls using panic() and BUG_ON() function. Now Stack-poisoning is included in linux kernel mainline, but alloca() checking has been dropped since it is believed that all VLAs are removed instead.
Structleak Plugin.
There are many structures that are not initialized in kernel code. This may have
interesting values from kernel when copied to user space without initialization.
One example arised in CVE-2013-2141. According to CVE-2013-2141 report, do_tkill
function in kernel/signal.c before kernel 3.8.9 did not initialize a data structure
variable siginfo
. The function do_tkill
is called in system calls tkill and tgkill
which can be invoked by user-level processes. When handling signals delivered
from tkill, kernel memory is visible.
Structleak plugin resolves this issue by initializing uninitialized structures
in the kernel. After gcc finishes type parsing, plugin is invoked. The plugin
currently has three modes: Disabled, BYREF and BYREF_ALL
. When BYREF is marked,
the plugin initializes structures which may had passed by reference and had not
been initialized. When BYREF_ALL
is marked, the plugin initializes any stack
variables passed by reference.
First, PLUGIN_FINISH_TYPE
callback is called after finishing parsing type of code.
Function finish_type() sets TYPE_USERSPACE
on structure variables of interests
which have __user
attribute on declaration.
static bool is_userspace_type(tree type)
{
tree field;
for (field = TYPE_FIELDS(type); field; field = TREE_CHAIN(field)) {
tree fieldtype = get_field_type(field);
enum tree_code code = TREE_CODE(fieldtype);
if (code == RECORD_TYPE || code == UNION_TYPE)
if (is_userspace_type(fieldtype))
return true;
if (lookup_attribute("user", DECL_ATTRIBUTES(field)))
return true;
}
return false;
}
After some declarations are marked as interests, structleak_execute() is executed.
Execution function iterates all local variables and initialize the targets. Execeptions are
auto variables (local variables which are stored in stack region), record or union types
unless BYREF_ALL
is set. If the local declaration is type of our interest(user annotated),
or addressable structures with BYREF set, plugin call initialize functions.
static unsigned int structleak_execute(void)
{
...
/* enumerate all local variables and forcibly initialize our targets */
FOR_EACH_LOCAL_DECL(cfun, i, var) {
tree type = TREE_TYPE(var);
gcc_assert(DECL_P(var));
if (!auto_var_in_fn_p(var, current_function_decl))
continue;
if (byref != BYREF_ALL && TREE_CODE(type) != RECORD_TYPE && TREE_CODE(type) != UNION_TYPE)
continue;
if (TYPE_USERSPACE(type) ||
(byref && TREE_ADDRESSABLE(var)))
initialize(var);
}
return ret;
}
However, the plugin has false positive problems because __user
attribute is just for kernel
static analysis tool such as Sparse, but not an true indication of pointers whether the pointer will
be copied to user space or not. Conversely, there might be another pointers
without __user
attribute but copied to user space.
PaX team, who originally proposed the plugin, are aware of the false positive problems and suggests better solutions to analyzing calls to copy_to_user(). But it seems that they do no longer pay attention to this problem since the original problem CVE-2013-2141 is solved.
Since the plugin initializes structrues passed by reference if BYREF
is set as stated
in the function structleak_execute(), it is highly suggested to set BYREF
or BYREF_ALL
when using this plugin to make it work as expected.
Randomizing kernel data structures
There are lots of juicy target members for attackers in kernel structures (struct or union), for example, function pointers, stack pointers, process credentials, and important flags etc.. Attackers usually try to trick kernel into executing their exploit code by overwriting such members in structures.
Randstruct plugin.
In order to mitigate such attacks, Randstruct
plugin randomizes structure
layout at compile time. Once structure layout is randomized, it will be much
harder for attackers to overwrite specific members of those structure since
they now do not know the layout of the structure.
Randstruct
plugin works in three steps:
- Detect structure to randomize.
- Randomize layout of the structure.
- Find bad casts from/to randomized structure pointer and notify it.
Note: Step 3 works after step 1 and 2 are done for all structures.
Let's see what it does at each step with code.
1. Detection.
When detecting target structure, plugin picks the structure marked with "__randomize_layout" attribute on its declaration, or the structure which contain only function pointers automatically.
static int is_pure_ops_struct(const_tree node)
{
...
(Return 1 if the structure contains only function pointers.)
(There was a bug here which could cause false negative, and we patched it.)
(See `Bug patch` below.)
...
}
static void randomize_type(tree type)
{
...
if (lookup_attribute("randomize_layout", TYPE_ATTRIBUTES(TYPE_MAIN_VARIANT(type))) || is_pure_ops_struct(type))
relayout_struct(type);
...
}
2. Randomization.
Once it has picked the target structure, it randomizes the position of fields with modern in-place Fisher-Yates shuffle algorithm. If the target structure has flexible array member, however, the plugin does not randomize the member (field).
static int relayout_struct(tree type){
...
/*
* enforce that we don't randomize the layout of the last
* element of a struct if it's a 0 or 1-length array
* or a proper flexible array
*/
if (is_flexible_array(newtree[num_fields - 1])) {
has_flexarray = true;
shuffle_length--;
}
shuffle(type, (tree *)newtree, shuffle_length);
...
}
3. Bad casts notification.
\textbf{\textcolor{red}{TODO: M(odot@}}
References
- Check member structs in is_pure_ops_struct()
- Randomizing structure layout
- Introduce struct layout randomization plugin
Kconfig | A | A+ | F | U | U+ |
---|---|---|---|---|---|
FORTIFY_SOURCE | Y | Y | Y | Y | Y |
LIVEPATCH | N | N | N | Y | Y |
A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0
Miscellaneous Topics
Eliminating Variable-length Arrays (VLA)
VLA allows programmers to specify the length of an array at runtime: e.g., using a variable instead of a constant for the array size. This makes it easier to write certain types of programming logics such as packet/buffer handling or string manipulation, but has two critical problems, namely security and performance.
void test_vla(int i) {
long buf[i];
// => 30 instructions w/ div/mul
// char *buf;
// buf -= ((i << 0x3) + 0xf) / 0x10 * 0x10
}
In terms of security,
such a code pattern makes it hard
to estimate the stack usage,
otherwise incurring a stack clash in the kernel space.
Also, although often not emphasized enough,
this pattern
makes it easier to exploit the uninitialized-use vulnerability:
e.g., placing arbitrary data to the proper offset
of the kernel stack.
In terms of performance,
this benign looking code
is translated to
a set of about 30 native instructions
that calculate the proper offset size
and enforce alignment of the stack,
otherwise incurring an exception
in many architectures.
The translated instructions
include a few computational intensive instructions
such as div
and imul
,
so impose unwanted performance overheads
in a common path.
Since v4.20 [1],
the compilation warning on the usage of VLA (i.e., -Wvla
)
has been globally turned on;
any use of VLA prevents
the kernel from compilation,
thereby guiding developers to properly fix
the use of VLA.
Preventing mistakes in switch/case
The usage of switch
case
in C is rather error-prone:
an optional break
statement can be used
in each case
block to indicate the break
of the logic,
otherwise simply executing the next case
block.
As both usage patterns are prevalent,
it is hard to recognize
whether which one is intended code flow or not.
The most recent break
mistake (04/2019 at the time of writing)
is in sock_ioctl()
that is widely used and heavily audited function!
// @net/socket.c
long sock_ioctl(struct file *file, unsigned cmd, unsigned long arg) {
...
case SIOCGSTAMP_OLD:
case SIOCGSTAMPNS_OLD:
if (!sock->ops->gettstamp) {
err = -ENOIOCTLCMD;
break;
}
err = sock->ops->gettstamp(sock, argp,
cmd == SIOCGSTAMP_OLD,
!IS_ENABLED(CONFIG_64BIT));
+ break;
case SIOCGSTAMP_NEW:
case SIOCGSTAMPNS_NEW:
...
break;
To address this error-prone situation,
GCC introduces a compilation warning
on an implicit use of case fall through
(i.e., -Wimplicit-fallthrough
):
to avoid the warning of the fall through case,
developers should explicitly express
the intention, either as a comment
(/* fall through */
) or
as an attribute (__attribute__((fallthrough))
).
+++ b/kernel/compat.c
@@ -346,8 +346,11 @@ get_compat_sigset(...)
return -EFAULT;
switch (_NSIG_WORDS) {
case 4: set->sig[3] = v.sig[6] | (((long)v.sig[7]) << 32 );
+ /* fall through */
case 3: set->sig[2] = v.sig[4] | (((long)v.sig[5]) << 32 );
+ /* fall through */
Fortify
FORTIFY_SOURCE was originally feature from gcc, but adopted to linux kernel later. This option provides support for detecting buffer overflows within various functions. Unfortunately, this option cannot detect all types of buffer overflows(will be discussed in below), but it is useful since it provides extra level of validation with low performance overhead. FORTIFY_SOURCE checks buffer overflow for functions below :
memcpy, mempcpy, memmove, memset, strcpy, stpcpy, strncpy, strcat,
strncat, sprintf, vsprintf, snprintf, vsnprintf, gets
Let's dive into some functions : strcpy() and memcpy().
At first, strcpy() checks object size via __butiltin_object_size(). This function returns object size that is determined on compile-time. However, if the object size is determined on run-time, e.g. object is allocated via kmalloc(), __butiltin_object_size() just returns -1. If both object size determined on run-time, strcpy() skips the overflow tests and passes objects to __builtin_strcpy(). Otherwise, it passes objects to memcpy() which is also fortified. Actual buffer-overflow checks would be done in memcpy(). As you can imagine, fortified strcpy() cannot detect buffer-overflows if size of both objects are determined on run-time, i.e. the case that strcpy passes objects to __builtin_strcpy().
/* defined after fortified strlen and memcpy to reuse them */
__FORTIFY_INLINE char *strcpy(char *p, const char *q)
{
size_t p_size = __builtin_object_size(p, 0);
size_t q_size = __builtin_object_size(q, 0);
if (p_size == (size_t)-1 && q_size == (size_t)-1)
return __builtin_strcpy(p, q);
memcpy(p, q, strlen(q) + 1);
return p;
}
memcpy() also checks object size via __butiltin_object_size(). Both read-overflow and write-overflow check are performed here. If no overflow detected, then it assumes overflow-safe and runs __builtin_memcpy().
__FORTIFY_INLINE void *memcpy(void *p, const void *q, __kernel_size_t size)
{
size_t p_size = __builtin_object_size(p, 0);
size_t q_size = __builtin_object_size(q, 0);
if (__builtin_constant_p(size)) {
if (p_size < size)
__write_overflow();
if (q_size < size)
__read_overflow2();
}
if (p_size < size || q_size < size)
fortify_panic(__func__);
return __builtin_memcpy(p, q, size);
}
References
Livepatch
Livepatch is a feature that applies kernel patches without any system reboot. There are many situations where systems have to keep running and up because of some critical issues such as huge economical costs. For example, In Facebook, it would take about over than 20 minutes to reboot for just one machine. But it is reluctant not to apply patches on kernel when some bugs were found as soon as possible. In order to meet these two requirements, livepatch gives the ways to redirect the buggy code to new code with keeping running.
Consistency model
\textbf{\textcolor{red}{Assigned to Sungbae Yoo}}
Design pattern for modules
\textbf{\textcolor{red}{Assigned to Sungbae Yoo}}
How it works
\textbf{\textcolor{red}{Assigned to Sungbae Yoo}}
Shadow data
\textbf{\textcolor{red}{Assigned to Sungbae Yoo}}
Userspace tool(kpatch)
Kpatch is a feature of the Linux kernel for livepatching made by Red Hat.
kpatch-build is one of the kpatch modules that convert patch files into kernel module.
+---------+ +---------------------+ +--------------+
| patch | | kpatch-build | | patch module |
+---------+ => | ============ | => +--------------+
| *.patch | | Create patch module | | *.ko |
+---------+ +---------------------+ +--------------+
How to make kernel module
- Download and unpack kernel source matching with patches's distro.
- Test patch file with option dry-run
- Read special section data with command (readelf -wi "$VMLINUX")
- alt_instr, bug_entry size, jump_entry size ...
- Build original source with compile options "-ffunction-sections and -fdata-sections"
- Build patched source with compile options "-ffunction-sections and -fdata-sections"
- Extract new and modified ELF sections
- Compare #4's output and #5's output at a section level
- Result: Elf object included {.kpatch.strings, .kpatch.symbols, .kpatch.relocations}
- Build patch module with #6's output
Core data structure: kpatch-elf
kpatch-build uses own data structure which added special data structures to elf format. The special data structures are able to include difference section between the origin object and the patched object.
The intermediate objects of kpatch-build are used in the form of kpatch-elf.
struct kpatch_elf {
Elf *elf;
struct list_head sections;
struct list_head symbols;
struct list_head strings;
int fd;
};
Core module: create-diff-object.c
This file contains the heart of the ELF object differencing engine.
- The tool takes two ELF objects from two versions of the same source file.
- a "base" object and a "patched" object
- These object need to have been compiled with the GCC options.
- -ffunction-sections and -fdata-sections
- The tool compares the objects at a section level to determine what sections have changed.
- Once a list of changed sections has been generated, various rules are applied.
References
- Kernel Live Patching: Consistency Model
- kpatch - live kernel patching
- Anatomy of kpatch
- An overview of kpatch-build
- Taesoo Kim, taesoo@gatech.edu
- Jinbum Park, jinb.park@samsung.com
- Jaemin Ryu, jm77.ryu@samsung.com
- Jonghyuk Song, jonghyk.song@gmail.com
- Heeill Wang, showx0123@gmail.com
- Yujeong Lee, yujeong.lee.14@gmail.com
- Joonwon Kang, kjw1627@gmail.com
- Yeonju Ro, <yeonju.ro@samsung.com, yeonju.samsung@gmail.com>
- Sungbae Yoo, ysbnim@gmail.com
- Sangwan Kwon, bitboom9@gmail.com
- Yeji Kim, yeji01.kim@samsung.com
- Sunmin Lee, sunm.lee@samsung.com
- Daehwan Oh, returnwellbeing@gmail.com