Kconfig	A	A+	F	U	U+
VMAP_STACK	Y	Y	Y	Y	Y
THREAD_INFO_IN_TASK	Y	Y	N	Y	Y
STACKPROTECTOR_STRONG	Y	Y	Y	Y	N

A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0

Mitigating Stack Overflows

When a stack is overflowed, an attacker has a chance to overwrite a security-critical data structure located in another task's stack. For example, thread_info traditionally locates at the bottom of the stack, which one long-size overflow can overwrite to. By overwriting the thread_info, an attacker can escalate its privilege by overwriting the cred structure.

Preventing stack clash

Using virtually-mapped (vmap) stacks has two benefits: 1) flexible in allocating larger, non-continuous pages, 2) preventing a potential stack overflow by placing one guard page between vmapped regions. Such a guard page can mitigate a stack clash that is overwritten by another task, if consequently located. As virtually mapped, the guard page won't be wasting any real page unlike kmalloc()-based stack.


// @kernel/fork.c
alloc_thread_stack_node()
  -> __vmalloc_node_range()
     -> __get_vm_area_node()

    // by default, vmap_area includes one more page as a guard
    if (!(flags & VM_NO_GUARD))
      size += PAGE_SIZE;

The way the kernel handles a stack overflow is pretty interesting. When the kernel mode touches a guard page, it generates a page fault, so its handler specifically checks such a condition like below.


// @arch/x86/mm/fault.c
// do_page_fault()
//   -> __do_page_fault()
//      -> do_kern_addr_fault()
//         -> bad_area_nosemaphore()
//            -> __bad_area_nosemaphore()
static noinline void
no_context(struct pt_regs *regs, unsigned long error_code,
       unsigned long address, int signal, int si_code)
{
...
  if (is_vmalloc_addr((void *)address) &&
       (((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
         address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {

    // NB. as the stack is likely out-of-space, use the stack for double-fault
    unsigned long stack = __this_cpu_ist_top_va(DF) - sizeof(void *);

    // NB. invoke handle_stack_overflow() to inform an oops.
    asm volatile ("movq %[stack], %%rsp\n\t"
                   "call handle_stack_overflow\n\t"
                   "1: jmp 1b"
                   : ASM_CALL_CONSTRAINT
                   : "D" ("kernel stack overflow (page fault)"),
                     "S" (regs), "d" (address),
                   [stack] "rm" (stack));
     unreachable();
  }
...
}

It's likely that the page fault handler touches the guard page, again as we are running out of the stack space, which generates a double-fault.


// @arch/x86/kernel/traps.c
void do_double_fault(struct pt_regs *regs, long error_code)
{
...
  cr2 = read_cr2();
  if ((unsigned long)task_stack_page(tsk) - 1 - cr2 < PAGE_SIZE)
    handle_stack_overflow("kernel stack overflow (double-fault)", regs, cr2);
...
}

One difference is that the page fault handler checks one page before/after the stack, but the double-fault handler checks only the overflow (when growing downward). This likely misdiagnoses the condition for STACK_GROWSUP yet rarely used in practice.

Related CVEs. Numerous CVEs (e.g., CVE-2016-10153, CVE-2017-8068, CVE-2017-8070, etc) relevant to VMAP_STACK are recently assigned due to its implication of DoS or potential memory corruption (unlikely controllable). The basic idea is that during iterating the scatter-gather list by a DMA engine, the stack-allocated, vmapped buffer is unlikely physically contiguous across the page boundary, potentially overwriting irrelevant page. It's unlikely that the buffer is large enough to cross the page boundary, otherwise a developer allocated DMA-able region at the first place. One caveat however was that under a certain condition (e.g., DEBUG_VIRTUAL), __phys_addr() can trigger BUG() when the provided address is vmalloc()-region, resulting in a DoS.

Performance implication

	VMAP=n	VMAP=y, #C=0	#C=2	#C=3	#C=5
kbuild(seconds)	27.648798683	-	27.545090630	-	-
iterations, cpu 0	106343	99673	101130	100568	100025
iterations, cpu 2	118526	93372	119380	117901	116726
iterations, cpu 7	117700	94010	117651	117719	115385

The table above shows thread performance results measured using microbenchmarks. The higher the number of iterations, It means the faster the performance. And #C means number of cache entries.

Allocating a stack from the vmalloc area, makes creating a process with clone() take about 1.5µs longer.[1] So for fixing this problem, caching two thread stacks per cpu was introduced.[4]

Thread performance is slower when using virtual mapped stacks. and the performance is affected by number of cache entries. Currently, the number of cache entries is two, and if it is increased than two, the performance is slower a bit. And if CONFIG_VMAP_STACK set when kernel build, it is about 0.1 seconds faster then without CONFIG_VMAP_STACK. So It's better using CONFIG_VMAP_STACK and two cache entries can complement the performance.

References

Protecting `thread_info`

Hijacking thread_info or task_struct is a straight way to achieve a privilege escalation: overwriting its uid to the root's, zero. As they are used to locate at the bottom of the stack (e.g., task_struct in <2.6 or thread_info in later versions), bugs such as stack clash, stack overflow, or arbitrary write after a stack pointer leak, can launch an exploit against them.

An easy mitigation is to completely remove them from the stack: THREAD_INFO_IN_TASK, as its name implicates, embeds thread_info into task_struct. Since the current task can be accessed with per-cpu data structure, thread_info can be accessed with one additional memory access. Note that thread_info is supposed to contain the architecture-specific information and task_struct does for architecture-neutral data. The current effort in x86 virtually migrates all information to the task_struct.


// @include/linux/sched.h
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
  /*
   * For reasons of header soup (see current_thread_info()), this
   * must be the first element of task_struct.
   */
   struct thread_info thread_info;
#endif
...
}

The bottom of the stack contains thread_info if not THREAD_INFO_IN_TASK, which is protected by a magic value, STACK_END_MAGIC that shouldn't be considered as security enhancement or mechanism. end_of_stack() simply returns the usable stack region and handles both situation seamlessly.


// @include/linux/sched/task_stack.h
void set_task_stack_end_magic(struct task_struct *tsk)
{
  unsigned long *stackend;
  stackend = end_of_stack(tsk);

  // NB. indicating that current stack is overwritten by an overflow
  *stackend = STACK_END_MAGIC;
}

#ifdef CONFIG_THREAD_INFO_IN_TASK
unsigned long *end_of_stack(const struct task_struct *task)
{
  return task->stack;
}
// NB. thread_info will be copied as part of task_struct
#define setup_thread_stack(new,old) do { } while(0)

#else
unsigned long *end_of_stack(struct task_struct *p)
{
  return (unsigned long *)(task_thread_info(p) + 1);
}
void setup_thread_stack(struct task_struct *p, struct task_struct *org)
{
  // NB. copied to the stack end (top)
  *task_thread_info(p) = *task_thread_info(org);
  task_thread_info(p)->task = p;
}
#endif

Performance implication

don't expect much

Stack canary (SSP)

FIX. config in 4.15 is named differently

Performance implication

Option Size (KB) Protected functions

None 53,519 KB 0/48,606

STACKPROTECTOR 53,490 KB (-0.05%) 1,390/48,607 (+2.86%)

STACKPROTECTOR_STRONG 55,312 KB (+3.35%) 9,922/48,608 (+20.41%)

STACKPROTECTOR_STRONG inserts a canary to 20% of functions in the kernel, unlike STACKPROTECTOR protects 3%, resulting in about 3% increment of the binary size. For example, bstat() is newly protected with STACKPROTECTOR_STRONG as it has struct kstat as a local variable.

What's interesting is the binary size of STACKPROTECTOR compared with the unprotected binary: inserting canary indeed reduces the binary size. According to our analysis, checking canary at the epilogue tends to encourage the reuse of common gadgets (e.g., pop or ret) at exit paths, rendering better utilization of instructions.

References

Kconfig	A	A+	F	U	U+
DEBUG_LIST	N	Y	Y	N	N
SLAB_FREELIST_HARDENED	Y	Y	Y	Y	Y
SLAB_FREELIST_RANDOM	Y	Y	Y	Y	Y

A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0

Mitigating Heap Overflows

Protecting the integrity of heap metadata

Manipulating a list entry, if crafted, introduces a traditionally, well-known attack vector, as known as usafe unlink. Simply put, an attacker can launch an semi-arbitrary memory write by overwriting the list entry (e.g., via heap overflow), illustrated like below:


# TODO. draw a real figure

  prev           cur            next
 +------+       +------+       +------+
 |      |------>| PPPP-|------>|      |
 |      |<------|-VVVV |<------|      |
 +------+       +------+       +------+

 *(PPPP+8) = VVVV
 *VVVV = PPPP (restricted)

 so to do *dst = val, an attacker can overwrite cur's list by [dst-8, val].

Before deleting a list entry, it firsts performs an integrity check, __list_del_entry_valid(), and after the deletion, it poisons the list entries to assist better debugging---it tends to prevent data pointer leaks when there is a dangling pointer to the freed object.


// @include/linux/list.h
void list_del(struct list_head *entry)
{
  __list_del_entry(entry);

  entry->next = LIST_POISON1; // 0xdead000000000100 
  entry->prev = LIST_POISON2; // 0xdead000000000200 
}

void __list_del_entry(struct list_head *entry)
{
  if (!__list_del_entry_valid(entry))
    return;
  __list_del(entry->prev, entry->next);
}

void __list_del(struct list_head * prev, struct list_head * next)
{
  next->prev = prev;
  prev->next = next;
}

Two conditions are checked:

whether attempting to perform a deletion on a freed list entry (e.g., double delete);
the indicated previous entry points to itself, vise versa for the indicated next entry. For 1), POISON1/2 inserted during the deletion process help to recognize the invariant. These checks are similarly performed for the addition.


bool __list_del_entry_valid(struct list_head *entry)
{
  struct list_head *prev, *next;

  prev = entry->prev;
  next = entry->next;

  // NB. first check if we are attempting to delete
  // previous deleted entry
  if (CHECK_DATA_CORRUPTION(next == LIST_POISON1,
       "list_del corruption, %px->next is LIST_POISON1 (%px)\n",
       entry, LIST_POISON1) ||
      CHECK_DATA_CORRUPTION(prev == LIST_POISON2,
       "list_del corruption, %px->prev is LIST_POISON2 (%px)\n",
       entry, LIST_POISON2) ||

  // NB. check the integrity of the link chains; prev's next and
  // next's prev correctly point to me

      CHECK_DATA_CORRUPTION(prev->next != entry,
       "list_del corruption. prev->next should be %px, but was %px\n",
       entry, prev->next) ||
      CHECK_DATA_CORRUPTION(next->prev != entry,
       "list_del corruption. next->prev should be %px, but was %px\n",
       entry, next->prev))
    return false;
  return true;
}

SLAB_FREELIST_RANDOM. The determinism (i.e., the deterministic order in allocated chunks) helps (a bit) an attacker in controlling the overflowing target. The simple way to disturb the determinism is to randomize its allocation order; it can be done by randomizing the free chunks when the kmem_cache structure is initialized. The Fisher-Yates algorithm implemented in freelist_randomize() can guarantee that each slot has the equal likelihood for being randomized.


// @mm/slab_common.c
// init_freelist_randomization()
//   -> init_cache_random_seq()
//     -> cache_random_seq_create()
void freelist_randomize(struct rnd_state *state, unsigned int *list,
                        unsigned int count)
{
  unsigned int rand;
  unsigned int i;

  for (i = 0; i < count; i++)
    list[i] = i;

  /* Fisher-Yates shuffle */
  for (i = count - 1; i > 0; i--) {
    rand = prandom_u32_state(state);
    rand %= (i + 1);
    swap(list[i], list[rand]);
  }
}

CONFIG_SLAB_FREELIST_HARDENED. When a heap object is overflowed, there exist two classes of overflowing targets (i.e., the nearby object located right after), namely,

a free object, and 2) an allocated object, with the same type. In terms of exploits, one approach is to abuse some specific semantics of the target objects (e.g., crafting a function pointer in the struct), but another approach is to develop the overflow into more preferable primitives (e.g., arbitrary write) for exploitation. In case of the free object (the second case), there exists a generic approach, meaning that the metatdata of heap structures is abused for further exploitation. For example, the link structure, called freelist, that chains all free objects in the cache, can be overwritten in a way that can be crafted for creating dangling pointers (e.g., returning an arbitrary object pointer when kmalloc() is invoked).


// TODO. redraw a real figure

 ---> freelist that link all free chunks
 
        head ---+
                V
 +------+       +------+       +------+
 |      |<------|-     |       |      |
 +------+  ptr  +------+       +------+
 |              (ptr_addr)     ^
 +-----------------------------+

SLAB_FREELIST_HARDENED is proposed to prevent this direct modification of the freelist structure. The basic approach is to mangle (xor) the pointer with a random canary value (s->random) created at the initialization of the cache. One interesting decision is to add ptr_addr to the mangled pointer. Its implication is subtle, but worth mentioning here. If s->random is leaked via another channel, an attacker can place an arbitrary value (i.e., the value xor-ed with the canary), allowing the aforementioned exploitation techniques possible again. The proposed solution is to mangle the value once more with another secrete value, the randomized address of the chunk itself (ptr_addr). To bypass this protection, the attacker should be able to locate the overflowing chunk precisely. However, one potential concern would be that an attacker can reuse its value in a simple arithmetic: e.g., adding the size of the heap object, say 0x100, to the leaked data would likely lead to a controllable situation, like two freed objects or one allocated object now in the freelist.


/*
 * Returns freelist pointer (ptr). With hardening, this is obfuscated
 * with an XOR of the address where the pointer is held and a per-cache
 * random number.
 */
void *freelist_ptr(const struct kmem_cache *s, void *ptr,
                   unsigned long ptr_addr) {
  return (void *)((unsigned long)ptr ^ s->random ^ ptr_addr));
}

Performance implication

TODO. set a target benchmarks in /bench

References

Kernel Address Space Layout Randomization (KASLR)

Kernel Address Space Layout Randomization(KASLR) is a feature that randomize kernel location itself in order to mitigate known exploits which relies on predictable kernel addresses such as retrun-oriented-programming. KASLR implementation for x86-64 randomize three main memory regions : physical mapping, vmalloc and vmemmap.


//@arch/x86/mm/kaslr.c

/*
 * Memory regions randomized by KASLR (except modules that use a separate logic
 * earlier during boot). The list is ordered based on virtual addresses. This
 * order is kept after randomization.
 */
static __initdata struct kaslr_memory_region {
    unsigned long *base;
    unsigned long size_tb;
} kaslr_regions[] = {
    { &page_offset_base, 0 },
    { &vmalloc_base, 0 },
    { &vmemmap_base, 0 },
};

/* Get size in bytes used by the memory region */
static inline unsigned long get_padding(struct kaslr_memory_region *region)
{
    return (region->size_tb << TB_SHIFT);
}

....

void __init kernel_randomize_memory(void)
{
....
....
kaslr_regions[0].size_tb = 1 << (MAX_PHYSMEM_BITS - TB_SHIFT);
    kaslr_regions[1].size_tb = VMALLOC_SIZE_TB;

    /*
     * Update Physical memory mapping to available and
     * add padding if needed (especially for memory hotplug support).
     */
    BUG_ON(kaslr_regions[0].base != &page_offset_base);
    memory_tb = DIV_ROUND_UP(max_pfn << PAGE_SHIFT, 1UL << TB_SHIFT) +
        CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING;

    /* Adapt phyiscal memory region size based on available memory */
    if (memory_tb < kaslr_regions[0].size_tb)
        kaslr_regions[0].size_tb = memory_tb;

    /*
     * Calculate the vmemmap region size in TBs, aligned to a TB
     * boundary.
     */
    vmemmap_size = (kaslr_regions[0].size_tb << (TB_SHIFT - PAGE_SHIFT)) *
            sizeof(struct page);
    kaslr_regions[2].size_tb = DIV_ROUND_UP(vmemmap_size, 1UL << TB_SHIFT);

Above code calculate size of memory region in terabytes for physical mapping, vmalloc and vmemmap. Those size of memory region are used to calculate remain_entropy below.


    /* Calculate entropy available between regions */
    remain_entropy = vaddr_end - vaddr_start;
    for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++)
        remain_entropy -= get_padding(&kaslr_regions[i]);

    prandom_seed_state(&rand_state, kaslr_get_random_long("Memory"));

    for (i = 0; i < ARRAY_SIZE(kaslr_regions); i++) {
        unsigned long entropy;

        /*
         * Select a random virtual address using the extra entropy
         * available.
         */
        entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
        prandom_bytes_state(&rand_state, &rand, sizeof(rand));
        entropy = (rand % (entropy + 1)) & PUD_MASK;
        vaddr += entropy;
        *kaslr_regions[i].base = vaddr;

        /*
         * Jump the region and add a minimum padding based on
         * randomization alignment.
         */
        vaddr += get_padding(&kaslr_regions[i]);
        vaddr = round_up(vaddr + 1, PUD_SIZE);
        remain_entropy -= entropy;
    }
}

In the last part of kernel_randomize_memory(), remain_entropy is initialized to available space of virtual memory. Actual randomization is done inside the for loop. Entropy is 'distributed' for each region and applied to their base. Note that it prevents monopoly of entropy by dividing remain_entropy to remain regions. remain_entropy is updated on each loop for the next region.

References

KASLR in the arm64 Linux kernel

Kconfig	A	A+	F	U	U+
REFCOUNT_FULL	N	Y	N	N	N

A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0

Mitigating Integer Overflows

Preventing refcount overflows

A reference counter (i.e., refcount) can be overflowed if incorrectly managed, resulting in a few dangling pointers. Such a dangling pointer is relatively easy-to-exploit by leading it to a use-after-free bug (e.g., inserting a fake object to the dangled object via msgsnd(), see CVE-2016-0728).

In CVE-2016-0728, a keyring is not correctly freed when joining a session keyring, and a function table pointing to revoke() in the dangled object can hijacked, resulting in a privileged escalation.

If REFCOUNT_FULL is enabled, all refcount_inc() are replaced with a below call. It checked two conditions: 1) if full then remained topped (i.e., UINT_MAX) and continue to use the object (i.e., leak), and 2) if freed then do not use the object. Similarly refcount_sub_and_test_checked() checks a underflow condition.


// @lib/refcount.c
bool refcount_inc_not_zero_checked(refcount_t *r) {
  unsigned int new, val = atomic_read(&r->refs);
  do {
    new = val + 1;
    if (!val)		// NB. refcount is already freed
      return false;
    if (unlikely(!new)) // NB. refcount is overflowed
      return true;
  } while (!atomic_try_cmpxchg_relaxed(&r->refs, &val, new));
  return true;
}

Optimization. PAX_REFCOUNT and [2] propose a potential optimization by trading #refcount by a half, using a sign bit to indicate overflowed condition. However, the current implementation just uses a cmpxchg() with an explicit check of an overflow and use #refcount upto UINT_MAX.


lock incl -0xc(%rbp)
js overflowed ; NB. unlikely to be taken

overflowed:
lea -0xc(%rbp),%rcx ; NB. restored to an old refcount
<UD0>

Performance implication

TODO.

References

Tools to prevent integer overflows

Developers have detected integer overflows as the following:


x + y < x //for addition
x - y > x //for substraction
x != 0 && y > c/x //for multiplication

There are a few problems with the above techniques. In case of signed integers, it cannot guarantee the complete checking because it relies on undefined behavior.

Therefore, GCC5 has introduced built in macros to check for integer overflows without undefined behavior. For example, overflows in signed integers are detected like below.


// @include/linux/overflow.h
#define check_add_overflow(a, b, d)					\
    __builtin_choose_expr(is_signed_type(typeof(a)),		\
            __signed_add_overflow(a, b, d),			\
            __unsigned_add_overflow(a, b, d))


/* Checking for unsigned overflow is relatively easy without causing UB. */
#define __unsigned_add_overflow(a, b, d) ({	\
    typeof(a) __a = (a);			\
    typeof(b) __b = (b);			\
    typeof(d) __d = (d);			\
    (void) (&__a == &__b);			\
    (void) (&__a == __d);			\
    *__d = __a + __b;			\
    *__d < __a;				\
})


/*
 * Adding two signed integers can overflow only if they have the same
 * sign, and overflow has happened iff the result has the opposite
 * sign.
 */
#define __signed_add_overflow(a, b, d) ({	\
    typeof(a) __a = (a);			\
    typeof(b) __b = (b);			\
    typeof(d) __d = (d);			\
    (void) (&__a == &__b);			\
    (void) (&__a == __d);			\
    *__d = (u64)__a + (u64)__b;		\
    (((~(__a ^ __b)) & (*__d ^ __a))	\
        & type_min(typeof(__a))) != 0;	\
})

Performance implication

References

compiler: use compiler to detect integer overflows

Kconfig	A	A+	F	U	U+
HARDENED_USERCOPY	Y	Y	Y	Y	Y
SECURITY_DMESG_RESTRICT	N	Y	N	N	N

A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0

Preventing information leaks

Hardening `usercopy()`

To prevent the leakage from the kernel objects, the slab interface provides a way to specify the region of each object that is allowed for usercopy. For example, to protect task_struct from unintended leakage (e.g., out-of-bound read beyond the usercopy region of the heap object), a tuple of useroffset and userisze should be provided via kmem_cache_create_usercopy().


void __init fork_init(void) {
  ...
  // NB. calculate the offset and size of the fpu states
  task_struct_whitelist(&useroffset, &usersize);
  task_struct_cachep = kmem_cache_create_usercopy("task_struct",
                         arch_task_struct_size, align,
                         SLAB_PANIC|SLAB_ACCOUNT,
                         useroffset, usersize, NULL);
}

In task_struct, fpu state is allowed for usercopy (see, below), which are accessed from/to via the ptrace() syscall.


// @x86_64
//   task_struct_cachep->useroffset = 2624 :&fxregs_state
//   task_struct_cachep->usersize   =  960 :fpu_kernel_xstate_size

//
// arch_ptrace
// -> copy_regset_from_user
//   -> xfpregs_get()
//      -> user_regset_copyout()
//         -> copy_to_user()
int xfpregs_get(struct task_struct *target, const struct user_regset *regset,
                unsigned int pos, unsigned int count,
                void *kbuf, void __user *ubuf)
{
  struct fpu *fpu = &target->thread.fpu;
  ...
  return user_regset_copyout(&pos, &count, &kbuf, &ubuf,
                             &fpu->state.fxsave, 0, -1);
}

When HARDENED_USERCOPY enabled, copy_from/to_user performs various sanity checks, including the check for the useroffset and usersize.


// copy_from/to_user()
//   -> check_object_size()
//     -> check_heap_object()
//       -> __check_heap_object()
void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
       bool to_user)
{
  struct kmem_cache *s;
  unsigned int offset;
  size_t object_size;

  /* NB. Fetch kmem_cache to find the object size/redzone. */
  s = page->slab_cache;

  /* NB. Reject if ptr is not possible to point to the page,
   * but the page is directly converted from ptr of its caller,
   * this path won't be taken in the current implementation. */
  if (ptr < page_address(page))
    usercopy_abort("SLUB object not in SLUB page?!", NULL,
             to_user, 0, n);

  /* Find offset within object. */
  offset = (ptr - page_address(page)) % s->size;

  /* Adjust for redzone and reject if within the redzone. */
  if (kmem_cache_debug(s) && s->flags & SLAB_RED_ZONE) {
    if (offset < s->red_left_pad)
      usercopy_abort("SLUB object in left red zone",
               s->name, to_user, offset, n);
    offset -= s->red_left_pad;
  }

  /* NB. Allow address range falling entirely within usercopy region.
  
   useroffset +   +-- offset (from ptr)
              |   v
              v   +--n-->|
              [   [      ]   ]
              |<--usersize--->|
   */
  if (offset >= s->useroffset &&
      offset - s->useroffset <= s->usersize &&
      n <= s->useroffset - offset + s->usersize)
    return;

  /*
   * If the copy is still within the allocated object, produce
   * a warning instead of rejecting the copy. This is intended
   * to be a temporary method to find any missing usercopy
   * whitelists.
   */
  object_size = slab_ksize(s);
  if (usercopy_fallback &&
      offset <= object_size && n <= object_size - offset) {
    usercopy_warn("SLUB object", s->name, to_user, offset, n);
    return;
  }

  usercopy_abort("SLUB object", s->name, to_user, offset, n);
}

There are a few other similar mitigation schemes to avoid such a mistake when performing a copy_to/from_user(). For example, if an object is stack allocated, then it checks if the object properly locates in the stack as well as in the proper frame of the stack, if the architecture provides a simple way to walk the stack frames (e.g., frame pointer).


// __check_object_size
//  -> check_stack_object
//    -> arch_within_stack_frames
static inline
int arch_within_stack_frames(const void * const stack,
                             const void * const stackend,
                             const void *obj, unsigned long len)
{
  const void *frame = NULL;
  const void *oldframe;

  // NB. return address of the caller
  oldframe = __builtin_frame_address(1);
  if (oldframe)
    // NB. return address of the caller's caller
    frame = __builtin_frame_address(2);

  /*
   * low ----------------------------------------------> high
   * [saved bp][saved ip][args][local vars][saved bp][saved ip]
   *                     ^----------------^
   *               allow copies only within here
   */
  while (stack <= frame && frame < stackend) {
    /*
     * If obj + len extends past the last frame, this
     * check won't pass and the next frame will be 0,
     * causing us to bail out and correctly report
     * the copy as invalid.
     */
    if (obj + len <= frame)
      // NB. 2 * sizeof(void*): frame pointer + return address
      return obj >= oldframe + 2 * sizeof(void *) ?
        GOOD_FRAME : BAD_STACK;
    oldframe = frame;
    frame = *(const void * const *)frame;
  }
  return BAD_STACK;
}

Restricting kernel pointers

dmesg command prints debug messages of the kernel buffer. However, the kernel message buffer sometimes contains sensitive information such as register values which is not allowed to users. This makes much easier for an attacker makes an exploit as in CVE-2018-17182. Thus DMESG_RESTRICT prevents unprivileged users from viewing those messages using dmesg command. When DMESG_RESTRICT is enabled, only users with system administration privileges are allowed to see the messages. When DMESG_RESTRICT is not enabled, every user can see the messages.

KPTR_RESTRICT works as similar to DMESG_RESTRICT. Kernel pointer is another sensitive information that might have chances of using by malicious users. When KPTR_RESTRICT is set to 1, %pK format specifier hides kernel pointers to unprivileged users by printing 0s. When KPTR_RESTRICT is set to 0, %pK works as same as %p which means there is no restriction on printing pointers. When KPTR_RESTRICT is set to 2, %pK hides pointers regardless of privileges.

References

Kconfig	A	A+	F	U	U+
PAGE_TABLE_ISOLATION	Y	Y	Y	Y	Y
RETPOLINE	Y	Y	Y	Y	Y

A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0

Mitigating Microarchitecture Side-channel

Mitigating KASLR and Meltdown

Mapping both user-space and kernel-space in one page-table takes a lot of advantages to the linux kernel. But on the other side, it has negative impact on security. Many researchers have exploited that for a profit via a variety of side channels. It allows an attacker to bypass KASLR, even worse, to read kernel memory from user-space that is known as Meltdown. So the linux kernel ended up to isolate the page-table based on its execution mode, either user or kernel. That is dubbed as page table isolation or pti. The implementation of pti is straightforward. allocate two page tables for a process, one is for user mode, another one is for kernel mode, and kernel mode can see entire memory space, on the other hand, user mode is limited not to see kernel space. It effectively closes both KASLR-attacks and Meltdown.

Allocating two page tables in a contiguous area.


// arch/x86/include/asm/pgalloc.h
// #ifdef CONFIG_PAGE_TABLE_ISOLATION
// #define PGD_ALLOCATION_ORDER 1
// #else
// #define PGD_ALLOCATION_ORDER 0
// #endif
....
....
// arch/x86/mm/pgtable.c
static inline pgd_t *_pgd_alloc(void)
{
  return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER);
  // NB. CONFIG_PAGE_TABLE_ISOLATION --> 8kb, two page tables.
  // !CONFIG_PAGE_TABLE_ISOLATION --> 4kb, one page table.
}

Locating the page table at a syscall entry, either user-to-kernel or kernel-to-user.


ENTRY(entry_SYSCALL_64)
    UNWIND_HINT_EMPTY

    swapgs
    // NB. percpu-access before SWITCH_TO_KERNEL_CR3
    // the percpu-area should be mapped in user page table.
    movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
    // NB. locating the page table, user-to-kernel.
    SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp

As shown in above code, the kernel entry code accesses per-cpu areas before locating the page table. So the per-cpu areas are needed to be mapped in the page table for user mode.


static void __init setup_cpu_entry_area(unsigned int cpu)
{
....
....
// NB. create a mapping for per-cpu areas
 cea_set_pte(&cea->gdt, get_cpu_gdt_paddr(cpu), gdt_prot);

    cea_map_percpu_pages(&cea->entry_stack_page,
                 per_cpu_ptr(&entry_stack_storage, cpu), 1,
                 PAGE_KERNEL);
....
 cea_map_percpu_pages(&cea->tss, &per_cpu(cpu_tss_rw, cpu),
                 sizeof(struct tss_struct) / PAGE_SIZE, tss_prot);
}
....
....
void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags)
{
    unsigned long va = (unsigned long) cea_vaddr;
    pte_t pte = pfn_pte(pa >> PAGE_SHIFT, flags);

    // NB. _PAGE_GLOBAL indicates a mapping for all page tables
    // including both user and kernel.
    if (boot_cpu_has(X86_FEATURE_PGE) &&
        (pgprot_val(flags) & _PAGE_PRESENT))
        pte = pte_set_flags(pte, _PAGE_GLOBAL);

Lastly, the pti leverages PCID or Process Context IDentifier to avoid a TLB collision between user-space and kernel-space. It works by assigning a different PCID to each execution mode. With this hardware support of Intel, the pti has a negligible performance impact.


.macro ADJUST_KERNEL_CR3 reg:req
    ALTERNATIVE "", "SET_NOFLUSH_BIT \reg", X86_FEATURE_PCID
    // NB. CR3 register contains an address of page table.
    // Since the lowest 12 bits are unused, (page-aligned)
    // they are used to represent a PCID.
    andq    $(~PTI_USER_PGTABLE_AND_PCID_MASK), \reg
.endm

.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
    ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
    mov %cr3, \scratch_reg
    ADJUST_KERNEL_CR3 \scratch_reg
    mov \scratch_reg, %cr3
.Lend_\@:
.endm

References

Mitigating Spectre

Modren CPUs have a branch predictor to optimize their performance. It works by referencing Branch Target Buffer or BTB that is a storage for a key(PC)-value(Target PC) pair. But its size limitation causes a BTB collision that leads to a new side-channel attack. The root cause of the attack is that BTB stores some parts of the bits of PC, not the full bits of PC. Using this primitive, an attacker is able to inject an indirect branch target into the BTB, and consequently run some codes in a speculative context. It can leak a sensitive data across some boundaries. (e.g. between VMs, Processes, ...) The attack is called Spectre Variant2. and retpoline has been introducted to stop the attack. The concept of retpoline is straightforward, but the implementation is a little tricky. Retpoline aims to eliminate all speculating behaviors that can be controlled by an attacker.

Indirect jump replacement with Retpoline. Take a look at how to implement an indirect branch with no speculation.


// Before Retpoline
jmp *%rax

// After Retpoline
(1) call load_label
    capture_ret_spec:
(2) pause; LFENCE
(3) jmp capture_ret_spec
    load_label:
(4) mov %rax, (%rsp)
(5) ret

// NB. Let's follow a Retpoline gadget.
// two executions are performing in parallel.
// o: original execution
// s: speculative execution
// (1-o) Direct branch to (4).
//     Push (2) onto the stack as a return.
// (4-o) Overwrite the return with the real target.
// (5-o) Load the real target from the stack memory.
// (5-s) If speculating here by RSB (Return Stack Buffer),
//     consumes RSB entry created in (1-o),
//     jumps to (2)
// (2-s) relaxing cpu for the spin-wait loop.
// (3-s) jumps to (2) again.
//     it forces the speculative execution
//     not be outside of (2)-(3).
// (5-o) Jump to the real target.
// --> There are no speculation!

The linux kernel supports Retpoline as an alternaitve section, which means that an admin can determine to enable/disable retpoline when boot-time via a kernel command-line.


// NB. A retpoline gadget replacing an indirect branch.
.macro RETPOLINE_JMP reg:req
 call    .Ldo_rop_\@
.Lspec_trap_\@:
 pause
 lfence
 jmp .Lspec_trap_\@
.Ldo_rop_\@:
 mov \reg, (%_ASM_SP)
 ret
.endm

.macro JMP_NOSPEC reg:req
#ifdef CONFIG_RETPOLINE
// NB. register an indirect branch as an alternative insn.
    ANNOTATE_NOSPEC_ALTERNATIVE
    ALTERNATIVE_2 __stringify(ANNOTATE_RETPOLINE_SAFE; jmp *\reg),  \
        __stringify(RETPOLINE_JMP \reg), X86_FEATURE_RETPOLINE, \
        __stringify(lfence; ANNOTATE_RETPOLINE_SAFE; jmp *\reg), X86_FEATURE_RETPOLINE_AMD
#else
    jmp *\reg
#endif
.endm

void __init alternative_instructions(void)
{
  ...
  // NB. runtime patching to apply retpoline gadgets.
  // (__alt_instructions ~ __alt_instructions_end) includes
  // a lot of sites for indirect branches.
  apply_alternatives(__alt_instructions, __alt_instructions_end);
  ...
}

MDS (Microarchitectural Data Sampling).

When performing store, load, L1 refill, processors write data into a variety of temporary buffers defined by microarchitecture such as Load Buffer, Store Buffer, Line Fill Buffer. The data in the buffers can be forwarded to load operations as an optimization. Unfortunately this kind of forwarding can across boundary, which means a kernel data can be forwarded to a load operation inside user space. If an attacker can stick the data into a leak gadget inside user space, it eventually leaks a kernel memory. The mitigation against this attack is very straightforward. It's to clear the cpu buffers when returning to user.


static inline void mds_clear_cpu_buffers(void)
{
    static const u16 ds = __KERNEL_DS;
    ....
    asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
}

static inline void mds_user_clear_cpu_buffers(void)
{
    if (static_branch_likely(&mds_user_clear))
        mds_clear_cpu_buffers();
}

__visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
{
    ....
    // NB. When returning from kernel to user,
    // it clears cpu buffers that contain in-fligt data.
    mds_user_clear_cpu_buffers();
}

L1TF - L1 Terminal Fault. L1TF is a hardware vulnerability which allows unprivileged speculative access to data in the Level 1 Data Cache. The root cause in it is that a physical address in a PTE (page table entry) could be speculative accessed despite the PTE is invalid when peforming page table walk.

Linux kernel applies PTE inversion to some codes relevant to page table maintenance. That modifies PTE to make sure that a physical address in a invalid PTE always points to invalid physical memory. This is an unconditional defense.


// PTE inversion

static inline bool __pte_needs_invert(u64 val)
{
    // NB. PTE is exist, but invalid.
    return val && !(val & _PAGE_PRESENT);
}

static inline u64 protnone_mask(u64 val)
{
    // NB. If PTE inversion needed,
    // return the mask for PTE to point to invalid memory.
    return __pte_needs_invert(val) ?  ~0ull : 0;
}

static inline unsigned long pte_pfn(pte_t pte)
{
    phys_addr_t pfn = pte_val(pte);
    // NB. Masking PFN (physical address)
    // after masking, it's pointing to invalid memory.
    pfn ^= protnone_mask(pfn);
    return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT;
}

References

Kconfig	A	A+	F	U	U+
BPF	Y	Y	N	Y	Y
BPF_SYSCALL	Y	Y	Y	Y	Y
BPF_JIT	Y	Y	Y	Y	Y
BPF_JIT_ALWAYS_ON	Y	Y	Y	Y	Y

A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0

Hardening Hostile Code in eBPF

Constant blinding in JIT

The JIT-ed memory is a common target for an attacker to place an arbitrary gadget (e.g., syscall in userspace). One popular technique is to encode a desirable sequence of instructions as part of immediate values, as x86-like CISC architectures provide a way to encode long bytes into one instruction. Constant blinding, as also known as constant folding, is a technique to break immediate values, avoiding the use of attacker-chosen constants in the executable region. It's worth noting that there are numerous other techniques (e.g., controlling the constant offset of direct branches) but most of well-known attacks in the user space might not be too effective in the kernel space as BPF provides only a smaller region with a smaller set of instructions available. The implementation of constant blinding is straightforward; xor the chosen immediate value with a random constant and before using it, xor with the mangled value again with the random constant.


int bpf_jit_blind_insn(const struct bpf_insn *from,
                       const struct bpf_insn *aux,
                       struct bpf_insn *to_buff)
{
  u32 imm_rnd = get_random_int();

  switch (from->code) {
  case BPF_ALU | BPF_ADD | BPF_K:
  case BPF_ALU | BPF_SUB | BPF_K:
  case BPF_ALU | BPF_AND | BPF_K:
  case BPF_ALU | BPF_OR  | BPF_K:
  case BPF_ALU | BPF_XOR | BPF_K:
  case BPF_ALU | BPF_MUL | BPF_K:
  case BPF_ALU | BPF_MOV | BPF_K:
  case BPF_ALU | BPF_DIV | BPF_K:
  case BPF_ALU | BPF_MOD | BPF_K:
    // NB. no more attack controllable instructions inserted
    // in the jitted, executable space (e.g., jump in the middle
    // of the immediate value)
    //
    //    MOV _, 0xdeadbeef
    // => MOV AX, [imm_rnd ^ 0xdeadbeef]
    //    XOR AX, imm_rnd
    //    MOV _, AX
    *to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
    *to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
    *to++ = BPF_ALU32_REG(from->code, from->dst_reg, BPF_REG_AX);
    break;
  ...
  }
  ...
}

Preventing Spectre

For non-privileged BPF programs, the JIT engine applies mitigation schemes against microarchitectural side-channel attacks, such as Spectre.

Variant 1 (Bounds Check Bypass). To prevent a speculator from performing an out-of-bound array access, it restricts its uses of an index on arrays that are accessible by an unprivileged user. The Linux kernel places such an check for both its BPF interpreter and JIT code.


// NB. in JIT-ed code:
//   array[index] -> array[index & mask]
u32 array_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf) {
 ...
  if (map->unpriv_array) {
    *insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 4);
    *insn++ = BPF_ALU32_IMM(BPF_AND, ret, array->index_mask);
  } else {
    *insn++ = BPF_JMP_IMM(BPF_JGE, ret, map->max_entries, 3);
  }
...
}

// NB. in an fuction called from an eBPF program
static void *percpu_array_map_lookup_elem(struct bpf_map *map, void *key)
{
  struct bpf_array *array = container_of(map, struct bpf_array, map);
  u32 index = *(u32 *)key;

  if (unlikely(index >= array->map.max_entries))
    return NULL;

  // NB. even after a speculator reaches here, it won't access
  // beyond the region of array->pptrs
  return this_cpu_ptr(array->pptrs[index & array->index_mask]);
}

Recently, more sophisticated mitigation to thwart generic gadgets for V1 is introduced, which simulates the behavior of a speculator and detects a potential out-of-bound memory access. Please refer to [2] for in-depth explanation.

Variant 2 (Branch Target Injection). For indirect jumps introduced during the jitting, BPF applies the Retpoline mitigation, like the Linux kernel code. For example, when the BPF_JMP instruction is a tail call, it invokes the same bpf program again, which is commonly implemented with an indirect jump (jumping right after the prologue). RETPOLINE_RAX_BPF_JIT is introduced to produce a retpoline-enabled jump gadget that can replace an indirect call with rax.


// do_jit() {
//   ...
//   case BPF_JMP | BPF_TAIL_CALL:
//     emit_bpf_tail_call(&prog);
//     break;
// }
void emit_bpf_tail_call(u8 **pprog) {
   ...
  /*
   * Wow we're ready to jump into next BPF program
   * rdi == ctx (1st arg)
   * rax == prog->bpf_func + prologue_size
   */
  RETPOLINE_RAX_BPF_JIT();
  ..
} 
  
#  define RETPOLINE_RAX_BPF_JIT()                       \
do {                                                    \
  EMIT1_off32(0xE8, 7);    /* callq do_rop */           \
  /* spec_trap: */                                      \
  EMIT2(0xF3, 0x90);       /* pause */                  \
  EMIT3(0x0F, 0xAE, 0xE8); /* lfence */                 \
  EMIT2(0xEB, 0xF9);       /* jmp spec_trap */          \
  /* do_rop: */                                         \
  EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */  \
  EMIT1(0xC3);             /* retq */                   \
} while (0)

Variant 4 (Speculative Store Bypass). To prevent a speculative memory disambiguation from performing an arbitrary kernel memory read, BPF verifier detects the malicious patterns to trigger the speculation at the time of loading a BPF program, and sanitize the patterns.


// NB: Safe execution flow by sanitizing a pattern
// Detect a case of reusing stack slot, and sanitize it.
// (1) r8 = *(u64 *)(r7 +0)   // slow read
// (2) *(u64 *)(r10 -72) = 0  // instruction for sanitizing
//     - this store becomes fast due to no depency on (1)
// (3) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
// ---- at this time, (2) is likely to be completed before (3),
// ---- so it can perfectly eliminate an arbitrary unsafe address.
// (4) r1 = *(u64 *)(r6 +0)   // loads from either sanitized or safe address
// (5) r2 = *(u8 *)(r1 +0)    // no leak happens

struct bpf_insn_aux_data {
  ....
  int sanitize_stack_off; /* stack slot to be cleared */
  ....
}

static int check_stack_write(struct bpf_verifier_env *env, ....
{
  ....
  for (i = 0; i < BPF_REG_SIZE; i++) {
    if (state->stack[spi].slot_type[i] == STACK_MISC &&
        !env->allow_ptr_leaks) {
        int *poff = &env->insn_aux_data[insn_idx].sanitize_stack_off;
        int soff = (-spi - 1) * BPF_REG_SIZE;
        ....
        ....
        // NB: examine a store instruction writing to a stack slot.
        //     record this offset for detecting reused stack slot. 
        *poff = soff;
    }
    state->stack[spi].slot_type[i] = STACK_SPILL;
  }
  ....
}

static int convert_ctx_accesses(struct bpf_verifier_env *env)
{
  ....
  // NB: Is it a reused stack slot?
  if (type == BPF_WRITE &&
    env->insn_aux_data[i + delta].sanitize_stack_off) {
    struct bpf_insn patch[] = {
      ....
      // NB: Sanitize it with 0.
      BPF_ST_MEM(BPF_DW, BPF_REG_FP,
        env->insn_aux_data[i + delta].sanitize_stack_off,
        0),
      ....
    };
  }
}

References

Preventing Code Reuse Attack

The System programming languages such as C and C++ give a freedom to optimize and control their resource. It requires the programmer to manually manage memory and observe typing rules leads to security vulnerabilities.
Memory corruptions are routinely exploited by attackers. Following defenses are introduced to mitigate such attacks:

Address Space Layout Randomization (ASLR)
Stack canaries
Data Execution Prevention (DEP)

They protects against code injection attack, but not fully prevent code reuse attack, e.g. ROP.

Return Oriented Programming (ROP)

In a ROP attack, the attacker does not inject new code; instead, the malicious computation is performed by chaining together existing sequences of instructions (called gadgets).


        Stack                     Instructions
 +-----------------+
 |                 | ----------->  pop eax
 |-----------------|        /----  ret
 |       v1        |       /
 |-----------------| <----/
 |                 | ----------->  pop ebx
 |-----------------|        /----  ret
 |       v2        |       /
 |-----------------| <----/
 |                 | ----------->  add eax, ebx
 |                 |       /-----  ret
 |-----------------| <----/
 |                 | ----------->  pop ecx
 |-----------------|        /----  ret
 |       v3        |       /
 |-----------------| <----/
 |                 | ----------->  mov [ecx], eax
 +-----------------+               ret

 -> Result: mem[v3] = v1 + v2

The attacker finds gadgets within the original program text and causes them to be executed in sequence to perform a task other than what was intended. Common objectives of such malicious payloads include arbitrary code execution, privilege escalation, and exfiltration of sensitive information.
Many ROP attacks use unintended instruction sequences. CFI mitigates such attacks by guaranteeing the program is in intended execution flow.

Control Flow Integrity (CFI)

CFI is to restrict the set of possible control-flow transfers to those that are trictly required for correct program execution. This prevents code-reuse techniques such as ROP from working because they would cause the program to execute control-flow transfers which are illegal under CFI.
Most CFI mechanisms follow a two-phase process:

An analysis phase constructs the Control-Flow Graph (CFG) which approximates the set of legitimate control-flow transfers
The CFG is used at runtime by an enforcement component to ensure that all executed branches correspond to edges in the CFG


      <Control Flow Graph>

           func1()
           /    \
          /      \
         v        v             - function call 2 from 1 is allowed
      func2()   func3()         - function call 4 from 3 is forbidden
         |
         |
         V
      func4()

However, it is hard to construct fine grained CFG because of indirect branches that are not determined at static analysis so there are approximation in most CFG. In case of RAP, it implements type-based approximated CFG.

PaX Reuse Attack Protector (RAP)

RAP is a defense mechanism against code reuse attack. It is a CFI technology developed by PaX. RAP is included in grsecurity patch for linux kernel security, but only the commercial version provides its full functionalities.
RAP is implemented as a GCC compiler plugin, and it consists of two components:

A deterministic defense limiting function call and return location
A probabilistic defense to help ensure that a function can return to the location where the function was called

Indirect Control Transfer Protection

RAP implements CFI based on type-based indirect control flow graph (ICFG). It is based on the idea that the ICFG vertex categorization can have the ICFG approximation emerge automatically. It means that the analysis can be conducted in function level without knowledge of the entire program.
It categorizes functions by type: return type, function name and function parameters. The type information extracted from each function and function pointer is used to verify matching between function and function pointer dereference (indirect call, function return, etc). Type matching uses hash value calculated from appropriate type part of each function.

A different set of type parts can be used for type hash by the sort of function.

Usable parts in type hash	Return	Name	’this’	Parameters
non-class or static member function/ptr	Y	N	N/A	Y
non-virtual method/ptr	Y	N	N	Y
virtual method/ptr	N	N	N	Y
ancestor method/virtual method call	Y	Y	Y	Y

Table: (RAP: RIP ROP) Type Hash Parts

Plugin code for function pointer protection:


// @rap_plugin/rap_fptr_pass.c
static unsigned int rap_fptr_execute(void)
{
  ...
  // ... through a function pointer
  fntype = TREE_TYPE(fntype);
  gcc_assert(TREE_CODE(fntype) == FUNCTION_TYPE || TREE_CODE(fntype) ==
   METHOD_TYPE);

  if (enable_type_call) {
    rap_instrument_fptr(&gsi);
    bb = gsi_bb(gsi);
    gcc_assert(call_stmt == gsi_stmt(gsi));
  }

  if (enable_type_ret) {
    hash = rap_hash_function_type(fntype, imprecise_rap_hash_flags);
    computed_hash = build_int_cst_type(rap_hash_type_node, -hash.hash);
    rap_mark_retloc(&gsi, computed_hash);
  }
}

...

// check the function hash of the target of the fptr
static void rap_instrument_fptr(gimple_stmt_iterator *gsi)
{
  ...
  if (TREE_CODE(fntype) == FUNCTION_TYPE) {
    computed_hash = build_rap_hash(call_stmt, fntype);
  } else {
    debug_tree(fntype);
    gcc_unreachable();
  }
  ...
  target_hash = get_rap_hash(&stmts, loc, fptr, -rap_hash_offset);
  gsi_insert_seq_before(gsi, stmts, GSI_SAME_STMT);

  // compare target_hash against computed function hash
  // bail out on mismatch
  check_hash = gimple_build_cond(NE_EXPR, target_hash, computed_hash, NULL_TREE,
    NULL_TREE);
  gimple_set_location(check_hash, loc);
  gsi_insert_before(gsi, check_hash, GSI_NEW_STMT);
  ...

Plugin code for return location protection:


// @rap_plugin/rap_ret_pass.c
/*
 * insert the equivalent of
 * if (*(long *)((void *)retaddr+N) != (long)-function_hash) abort();
 */
static void check_retaddr(gimple_stmt_iterator *gsi, tree new_retaddr)
{
  ...
#ifdef TARGET_386
    if (TARGET_64BIT)
        target_hash = get_rap_hash(&stmts, loc, new_retaddr, -16);
    else
        target_hash = get_rap_hash(&stmts, loc, new_retaddr, -10);
#else
  ...
  hash = rap_hash_function_type(TREE_TYPE(current_function_decl),
    imprecise_rap_hash_flags);
  computed_hash = build_int_cst_type(rap_hash_type_node, -hash.hash);

  stmt = gimple_build_cond(NE_EXPR, target_hash, computed_hash, NULL_TREE,
    NULL_TREE);
  gimple_set_location(stmt, loc);
  gsi_insert_after(gsi, stmt, GSI_CONTINUE_LINKING);
  ...

Return Address Protection

Return Address Protection is an another defense mechanism of RAP. It is conceptually based on the XOR canary approach. RAP encrypts the return address with a key which is stored in a reserved register (r12 on amd64). This key is highly resistant to leaking as it shouldn't be stored or spilled into memory. In addition to this, grsecurity says RAP cookie (the encryption key) does not stay, but changes per task, system call, and iteration in some infinite loops.

Following is an example for RAP: Return Address Protection. Its full implementation is not revealed at PaX test patch.


// RAP example
push %rbx
mov 8(%rsp),%rbx
xor %r12,%rbx
...
xor %r12,%rbx
cmp %rbx,8(%rsp)
jnz .error
pop %rbx
retn
.error:
ud2

References

ARM's Memory Tagging Extensions (MTE)

XXX. write here

References

Kconfig	A	A+	F	U	U+
GIN_STACKLEAK	N	Y	N	N	N
GIN_STRUCTLEAK_USER	N	N	N	N	N
GIN_STRUCTLEAK_BYREF	N	N	N	N	N
GIN_STRUCTLEAK_BYREF_ALL	N	Y	N	N	N
ACK_ALL	N	N	N	N	N
GIN_RANDSTRUCT	N	N	N	N	N
GIN_RANDSTRUCT_PERFORMANCE	N	N	N	N	N

A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0

Compiler-based Security Checkers

Preventing information leaks

Stackleak Plugin.

There are three kinds of vulnerabilities that STACKLEAK kernel security feature wants to defend against: 1) Information disclosure coming frome leaving data on the stack that can be exfiltrated to user space, 2) Targeting Uninitialized variable on the kernel stack to get the stack location, and 3) Stack clash.

The first two of the vulnerabilities are closely related. Chaining attacks like getting information from left over values on the stack and targeting the address could happen. These two vulnerabilities are blocked by the feature called stack poisoning. This feature overwrites STACKLEAK_POISON(-0xBEEF) to the used portion of the stack at the end of the syscall, before returning to the caller. Below is the implementation of stack poisoning. This code checks the count of unpoisoned space in the kernel stack, and then fills the STACKLEAK_POISON value from the lowest boundary of the current used stack.


asmlinkage void stackleak_erase(void)
{
    unsigned long kstack_ptr = current -> lowest_stack;
    unsigned long boundary = (unsigned long)end_of_stack(current);
    unsigned int poison_count = 0;
    const unsigned int depth = STACKLEAK_SEARCH_DEPTH / sizeof(unsigned long);

    if (skip_erasing())
        return;

    while (kstack_ptr > boundary && poison_count <= depth) { 
        if (*(unsigned long *)kstack_ptr == STACKLEAK_POISON)
            poison_count++;
        else
            poison_count = 0;

        kstack_ptr -= sizeof(unsigned long);
    }

    if (kstack_ptr == boundary)
        kstack_ptr += sizeof(unsigned long);

    ...
    if (on_thread_stack())
        boundary = current_stack_pointer;
    else
        boundary = current_top_of_stack();

    while (kstack_ptr < boundary) {
        *(unsigned long *)kstack_ptr = STACKLEAK_POISON;
        kstack_ptr += sizeof(unsigned long);
    }

    current->lowest_stack = current_top_of_stack() - THREAD_SIZE/64;
}

Stack poisoning prevents the stack from leaving space to be exposed, but it only works for multi-system-call attacks; it cannot protect against attacks that complete during a single system call, since the stack poisoning is done at the end of the system call. And one more, since the stack poison value could be a valid pointer to the stack since the user space address range from 0x0000 to 0x0fff and the kernel space address range from 0xf000 to 0xffff.

The third vulnerability, stack clash, is about prohibiting the memory region. It includes clashing the stack with another memory region, jumping over the stack guard page, and smashing the stack(overwriting the stack with other memory region). When the stack is full, it is automatically extended by using the page fault. If the end of the current stack access to the already allocated page, page fault will not happen and kernel cannot notice that they have reached the stack's end so the stack clash would happen.

Usually, variable length array like alloca() function call is used to consume the stack's allocated space. So the STACKLEAK plugin tried to prevent stack clash by checking all the alloca() calls using panic() and BUG_ON() function. Now Stack-poisoning is included in linux kernel mainline, but alloca() checking has been dropped since it is believed that all VLAs are removed instead.

Structleak Plugin.

There are many structures that are not initialized in kernel code. This may have interesting values from kernel when copied to user space without initialization. One example arised in CVE-2013-2141. According to CVE-2013-2141 report, do_tkill function in kernel/signal.c before kernel 3.8.9 did not initialize a data structure variable siginfo. The function do_tkill is called in system calls tkill and tgkill which can be invoked by user-level processes. When handling signals delivered from tkill, kernel memory is visible.

Structleak plugin resolves this issue by initializing uninitialized structures in the kernel. After gcc finishes type parsing, plugin is invoked. The plugin currently has three modes: Disabled, BYREF and BYREF_ALL. When BYREF is marked, the plugin initializes structures which may had passed by reference and had not been initialized. When BYREF_ALL is marked, the plugin initializes any stack variables passed by reference.

First, PLUGIN_FINISH_TYPE callback is called after finishing parsing type of code. Function finish_type() sets TYPE_USERSPACE on structure variables of interests which have __user attribute on declaration.


static bool is_userspace_type(tree type)
{
    tree field;

    for (field = TYPE_FIELDS(type); field; field = TREE_CHAIN(field)) {
        tree fieldtype = get_field_type(field);
        enum tree_code code = TREE_CODE(fieldtype);

        if (code == RECORD_TYPE || code == UNION_TYPE)
            if (is_userspace_type(fieldtype))
                return true;

        if (lookup_attribute("user", DECL_ATTRIBUTES(field)))
            return true;
    }
    return false;
}

After some declarations are marked as interests, structleak_execute() is executed. Execution function iterates all local variables and initialize the targets. Execeptions are auto variables (local variables which are stored in stack region), record or union types unless BYREF_ALL is set. If the local declaration is type of our interest(user annotated), or addressable structures with BYREF set, plugin call initialize functions.


static unsigned int structleak_execute(void)
{
    ...

    /* enumerate all local variables and forcibly initialize our targets */
    FOR_EACH_LOCAL_DECL(cfun, i, var) {
        tree type = TREE_TYPE(var);

        gcc_assert(DECL_P(var));
        if (!auto_var_in_fn_p(var, current_function_decl))
            continue;

        if (byref != BYREF_ALL && TREE_CODE(type) != RECORD_TYPE && TREE_CODE(type) != UNION_TYPE)
            continue;

        if (TYPE_USERSPACE(type) ||
            (byref && TREE_ADDRESSABLE(var)))
            initialize(var);
    }

    return ret;
}

However, the plugin has false positive problems because __user attribute is just for kernel static analysis tool such as Sparse, but not an true indication of pointers whether the pointer will be copied to user space or not. Conversely, there might be another pointers without __user attribute but copied to user space.

PaX team, who originally proposed the plugin, are aware of the false positive problems and suggests better solutions to analyzing calls to copy_to_user(). But it seems that they do no longer pay attention to this problem since the original problem CVE-2013-2141 is solved.

Since the plugin initializes structrues passed by reference if BYREF is set as stated in the function structleak_execute(), it is highly suggested to set BYREF or BYREF_ALL when using this plugin to make it work as expected.

Randomizing kernel data structures

There are lots of juicy target members for attackers in kernel structures (struct or union), for example, function pointers, stack pointers, process credentials, and important flags etc.. Attackers usually try to trick kernel into executing their exploit code by overwriting such members in structures.

Randstruct plugin. In order to mitigate such attacks, Randstruct plugin randomizes structure layout at compile time. Once structure layout is randomized, it will be much harder for attackers to overwrite specific members of those structure since they now do not know the layout of the structure.

Randstruct plugin works in three steps:

Detect structure to randomize.
Randomize layout of the structure.
Find bad casts from/to randomized structure pointer and notify it.

Note: Step 3 works after step 1 and 2 are done for all structures.

Let's see what it does at each step with code.

1. Detection.

When detecting target structure, plugin picks the structure marked with "__randomize_layout" attribute on its declaration, or the structure which contain only function pointers automatically.


static int is_pure_ops_struct(const_tree node)
{
...
  (Return 1 if the structure contains only function pointers.)
  (There was a bug here which could cause false negative, and we patched it.)
  (See `Bug patch` below.)
...
}

static void randomize_type(tree type)
{
...
  if (lookup_attribute("randomize_layout", TYPE_ATTRIBUTES(TYPE_MAIN_VARIANT(type))) || is_pure_ops_struct(type))
    relayout_struct(type);
...
}

2. Randomization.

Once it has picked the target structure, it randomizes the position of fields with modern in-place Fisher-Yates shuffle algorithm. If the target structure has flexible array member, however, the plugin does not randomize the member (field).


static int relayout_struct(tree type){
...
  /*
   * enforce that we don't randomize the layout of the last
   * element of a struct if it's a 0 or 1-length array
   * or a proper flexible array
   */
  if (is_flexible_array(newtree[num_fields - 1])) {
    has_flexarray = true;
    shuffle_length--;
  }

  shuffle(type, (tree *)newtree, shuffle_length);
...
}

3. Bad casts notification.

\textbf{\textcolor{red}{TODO: M(odot@}}

References

Kconfig	A	A+	F	U	U+
FORTIFY_SOURCE	Y	Y	Y	Y	Y
LIVEPATCH	N	N	N	Y	Y

A: Arch 5.1.8,A+: Arch Harden 5.1.11,F: Fedora 5.1.8,U: Ubuntu 5.0.0,U+: Ubuntu LTE 4.15.0

Miscellaneous Topics

Eliminating Variable-length Arrays (VLA)

VLA allows programmers to specify the length of an array at runtime: e.g., using a variable instead of a constant for the array size. This makes it easier to write certain types of programming logics such as packet/buffer handling or string manipulation, but has two critical problems, namely security and performance.


void test_vla(int i) {
  long buf[i];

  // => 30 instructions w/ div/mul
  // char *buf;
  // buf -= ((i << 0x3) + 0xf) / 0x10 * 0x10
}

In terms of security, such a code pattern makes it hard to estimate the stack usage, otherwise incurring a stack clash in the kernel space. Also, although often not emphasized enough, this pattern makes it easier to exploit the uninitialized-use vulnerability: e.g., placing arbitrary data to the proper offset of the kernel stack. In terms of performance, this benign looking code is translated to a set of about 30 native instructions that calculate the proper offset size and enforce alignment of the stack, otherwise incurring an exception in many architectures. The translated instructions include a few computational intensive instructions such as div and imul, so impose unwanted performance overheads in a common path. Since v4.20 [1], the compilation warning on the usage of VLA (i.e., -Wvla) has been globally turned on; any use of VLA prevents the kernel from compilation, thereby guiding developers to properly fix the use of VLA.

Preventing mistakes in `switch/case`

The usage of switch case in C is rather error-prone: an optional break statement can be used in each case block to indicate the break of the logic, otherwise simply executing the next case block. As both usage patterns are prevalent, it is hard to recognize whether which one is intended code flow or not. The most recent break mistake (04/2019 at the time of writing) is in sock_ioctl() that is widely used and heavily audited function!


// @net/socket.c
long sock_ioctl(struct file *file, unsigned cmd, unsigned long arg) {
...
    case SIOCGSTAMP_OLD:
    case SIOCGSTAMPNS_OLD:
      if (!sock->ops->gettstamp) {
        err = -ENOIOCTLCMD;
        break;
      }
      err = sock->ops->gettstamp(sock, argp,
               cmd == SIOCGSTAMP_OLD,
               !IS_ENABLED(CONFIG_64BIT));
+    break;
    case SIOCGSTAMP_NEW:
    case SIOCGSTAMPNS_NEW:
      ...
      break;

To address this error-prone situation, GCC introduces a compilation warning on an implicit use of case fall through (i.e., -Wimplicit-fallthrough): to avoid the warning of the fall through case, developers should explicitly express the intention, either as a comment (/* fall through */) or as an attribute (__attribute__((fallthrough))).


+++ b/kernel/compat.c
@@ -346,8 +346,11 @@ get_compat_sigset(...)
                return -EFAULT;
        switch (_NSIG_WORDS) {
        case 4: set->sig[3] = v.sig[6] | (((long)v.sig[7]) << 32 );
+               /* fall through */
        case 3: set->sig[2] = v.sig[4] | (((long)v.sig[5]) << 32 );
+               /* fall through */

Fortify

FORTIFY_SOURCE was originally feature from gcc, but adopted to linux kernel later. This option provides support for detecting buffer overflows within various functions. Unfortunately, this option cannot detect all types of buffer overflows(will be discussed in below), but it is useful since it provides extra level of validation with low performance overhead. FORTIFY_SOURCE checks buffer overflow for functions below :


memcpy, mempcpy, memmove, memset, strcpy, stpcpy, strncpy, strcat, 
strncat, sprintf, vsprintf, snprintf, vsnprintf, gets

Let's dive into some functions : strcpy() and memcpy().

At first, strcpy() checks object size via __butiltin_object_size(). This function returns object size that is determined on compile-time. However, if the object size is determined on run-time, e.g. object is allocated via kmalloc(), __butiltin_object_size() just returns -1. If both object size determined on run-time, strcpy() skips the overflow tests and passes objects to __builtin_strcpy(). Otherwise, it passes objects to memcpy() which is also fortified. Actual buffer-overflow checks would be done in memcpy(). As you can imagine, fortified strcpy() cannot detect buffer-overflows if size of both objects are determined on run-time, i.e. the case that strcpy passes objects to __builtin_strcpy().


/* defined after fortified strlen and memcpy to reuse them */
__FORTIFY_INLINE char *strcpy(char *p, const char *q)
{
    size_t p_size = __builtin_object_size(p, 0);
    size_t q_size = __builtin_object_size(q, 0);
    if (p_size == (size_t)-1 && q_size == (size_t)-1)
        return __builtin_strcpy(p, q);
    memcpy(p, q, strlen(q) + 1);
    return p;
}

memcpy() also checks object size via __butiltin_object_size(). Both read-overflow and write-overflow check are performed here. If no overflow detected, then it assumes overflow-safe and runs __builtin_memcpy().


__FORTIFY_INLINE void *memcpy(void *p, const void *q, __kernel_size_t size)
{
    size_t p_size = __builtin_object_size(p, 0);
    size_t q_size = __builtin_object_size(q, 0);
    if (__builtin_constant_p(size)) {
        if (p_size < size)
            __write_overflow();
        if (q_size < size)
            __read_overflow2();
    }
    if (p_size < size || q_size < size)
        fortify_panic(__func__);
    return __builtin_memcpy(p, q, size);
}

References

VLA removal for v4.20-rc1

Livepatch

Livepatch is a feature that applies kernel patches without any system reboot. There are many situations where systems have to keep running and up because of some critical issues such as huge economical costs. For example, In Facebook, it would take about over than 20 minutes to reboot for just one machine. But it is reluctant not to apply patches on kernel when some bugs were found as soon as possible. In order to meet these two requirements, livepatch gives the ways to redirect the buggy code to new code with keeping running.

Consistency model

\textbf{\textcolor{red}{Assigned to Sungbae Yoo}}

Design pattern for modules

\textbf{\textcolor{red}{Assigned to Sungbae Yoo}}

How it works

\textbf{\textcolor{red}{Assigned to Sungbae Yoo}}

Shadow data

\textbf{\textcolor{red}{Assigned to Sungbae Yoo}}

Userspace tool(kpatch)

Kpatch is a feature of the Linux kernel for livepatching made by Red Hat.
kpatch-build is one of the kpatch modules that convert patch files into kernel module.


+---------+    +---------------------+    +--------------+
| patch   |    | kpatch-build        |    | patch module |
+---------+ => | ============        | => +--------------+
| *.patch |    | Create patch module |    | *.ko         |
+---------+    +---------------------+    +--------------+

How to make kernel module

Download and unpack kernel source matching with patches's distro.
Test patch file with option dry-run
Read special section data with command (readelf -wi "$VMLINUX")
- alt_instr, bug_entry size, jump_entry size ...
Build original source with compile options "-ffunction-sections and -fdata-sections"
Build patched source with compile options "-ffunction-sections and -fdata-sections"
Extract new and modified ELF sections
- Compare #4's output and #5's output at a section level
- Result: Elf object included {.kpatch.strings, .kpatch.symbols, .kpatch.relocations}
Build patch module with #6's output

Core data structure: kpatch-elf

kpatch-build uses own data structure which added special data structures to elf format. The special data structures are able to include difference section between the origin object and the patched object.
The intermediate objects of kpatch-build are used in the form of kpatch-elf.


struct kpatch_elf {
  Elf *elf;
  struct list_head sections;
  struct list_head symbols;
  struct list_head strings;
  int fd;
};

Core module: create-diff-object.c

This file contains the heart of the ELF object differencing engine.

The tool takes two ELF objects from two versions of the same source file.
- a "base" object and a "patched" object
These object need to have been compiled with the GCC options.
- -ffunction-sections and -fdata-sections
The tool compares the objects at a section level to determine what sections have changed.
Once a list of changed sections has been generated, various rules are applied.

References

Taesoo Kim, taesoo@gatech.edu
Jinbum Park, jinb.park@samsung.com
Jaemin Ryu, jm77.ryu@samsung.com
Jonghyuk Song, jonghyk.song@gmail.com
Heeill Wang, showx0123@gmail.com
Yujeong Lee, yujeong.lee.14@gmail.com
Joonwon Kang, kjw1627@gmail.com
Yeonju Ro, <yeonju.ro@samsung.com, yeonju.samsung@gmail.com>
Sungbae Yoo, ysbnim@gmail.com
Sangwan Kwon, bitboom9@gmail.com
Yeji Kim, yeji01.kim@samsung.com
Sunmin Lee, sunm.lee@samsung.com
Daehwan Oh, returnwellbeing@gmail.com

Analysis on Kernel Self-Protection: Understanding Security and Performance Implication