kmem_guard_t in iOS 16 / macOS 13
🛡️

kmem_guard_t in iOS 16 / macOS 13

📅 [ Archival Date ]
Nov 21, 2022 10:38 PM
🏷️ [ Tags ]
xnuMacOSiOS
✍️ [ Author ]

Saar Amar

Intro

In iOS 16 / macOS 13 Apple added “guards” for certain types of allocations. This is an interesting change and since I haven’t seen any techincal writeup on the subject I would like to share some details about it.

Funny enough, this isn’t a reversing blogpost. I intended to post this a few months ago after reversing this mechanism, but I didn’t find the time. And thanks to the new macOS 13 / iOS 16 OSS drop (tweet, ref), we have source code to look at (tweet). I’m very happy Apple opensource all of this, it’s super useful, thank you!

Credit to Proteas who bindiffed this change right away back in the day (tweet)!

Guarded allocations

Let me state the obvious right away: we are not talking about guard pages. Guard pages have existed for about ~1000 years now, and we are talking about a modern change Apple added to the kmem_* subsystem.

Because we are talking about a securiy mitigation, we can expect to panic on failures in specific validations in the kernel. This means we can identify the relevant call to panic, and make our way up the callstack. This is useful in reversing and makes everything easy. However, as I said, we have the XNU sources for this functionality. So in this blogpost we will go through the source code.

The panic flow

The relevant panic happens in the function __kmem_entry_validate_panic (osfmk/vm/vm_kern):

__abortlike
static void
__kmem_entry_validate_panic(
	vm_map_t        map,
	vm_map_entry_t  entry,
	vm_offset_t     addr,
	vm_size_t       size,
	uint32_t        flags,
	kmem_guard_t    guard)
{
	const char *what = "???";

	if (entry->vme_atomic != guard.kmg_atomic) {
		what = "atomicity";
	} else if (entry->is_sub_map != guard.kmg_submap) {
		what = "objectness";
	} else if (addr != entry->vme_start) {
		what = "left bound";
	} else if ((flags & KMF_GUESS_SIZE) == 0 && addr + size != entry->vme_end) {
		what = "right bound";
#if __LP64__
	} else if (guard.kmg_context != entry->vme_context) {
		what = "guard";
#endif
	}

	panic("kmem(map=%p, addr=%p, size=%zd, flags=0x%x): "
	    "entry:%p %s mismatch guard(0x%08x)",
	    map, (void *)addr, size, flags, entry,
	    what, guard.kmg_context);
}

As you can see, this function is called when the decision to panic has already happened, and it just wraps nicely the call to panic. It “resolves” all the details for the panic string, including the reason for panicing (the what variable) and the “guard”.

This function is called from two callsites, after a call to __kmem_entry_validate_guard returned false:

  1. kmem_entry_validate_guard
  2. kmem_size_guard

Let’s see which flows reach these two functions:

  • The functions that call kmem_entry_validate_guard are: vm_map_delete, kmem_realloc_guard.
  • The functions that call kmem_size_guard are: kfree_large, kern_os_realloc_external.

Ok, makes sense - we can see the new security checks (which we will elaboarate about in a minute) are done before operations that interact with the allocation (free/delete mapping/reallocation/etc.).

Just to give a better picture, let’s look at the two functions that actually check and panic. They are very simple and straigtforward:

void
kmem_entry_validate_guard(
	vm_map_t        map,
	vm_map_entry_t  entry,
	vm_offset_t     addr,
	vm_size_t       size,
	kmem_guard_t    guard)
{
	if (!__kmem_entry_validate_guard(entry, addr, size, KMEM_NONE, guard)) {
		__kmem_entry_validate_panic(map, entry, addr, size, KMEM_NONE, guard);
	}
}
vm_size_t
kmem_size_guard(
	vm_map_t        map,
	vm_offset_t     addr,
	kmem_guard_t    guard)
{
	kmem_flags_t flags = KMEM_GUESS_SIZE;
	vm_map_entry_t entry;
	vm_size_t size;

	vm_map_lock_read(map);

	if (!vm_map_lookup_entry(map, addr, &entry)) {
		__kmem_entry_not_found_panic(map, addr);
	}

	if (!__kmem_entry_validate_guard(entry, addr, 0, flags, guard)) {
		__kmem_entry_validate_panic(map, entry, addr, 0, flags, guard);
	}

	size = (vm_size_t)(entry->vme_end - entry->vme_start);

	vm_map_unlock_read(map);

	return size;
}

Ok, makes sense. Pretty much what one would expect to see. Now let’s see exactly which new information/security properties are enforced here.

Allocation validation

The kmem_* subsystem knows something went wrong by comparing two structure - the mapping entry associated with the allocation (vm_map_entry_t) and the new guard structure (kmem_guard_t).

To see what exactly is checked we can simply look at __kmem_entry_validate_guard:

static bool
__kmem_entry_validate_guard(
	vm_map_entry_t  entry,
	vm_offset_t     addr,
	vm_size_t       size,
	kmem_flags_t    flags,
	kmem_guard_t    guard)
{
	if (entry->vme_atomic != guard.kmg_atomic) {
		return false;
	}

	if (!guard.kmg_atomic) {
		return true;
	}

	if (entry->is_sub_map != guard.kmg_submap) {
		return false;
	}

	if (addr != entry->vme_start) {
		return false;
	}

	if ((flags & KMEM_GUESS_SIZE) == 0 && addr + size != entry->vme_end) {
		return false;
	}

#if __LP64__
	if (!guard.kmg_submap && guard.kmg_context != entry->vme_context) {
		return false;
	}
#endif

	return true;
}

And we also would like to see kmem_guard_t, along with the great documentation:

/*!
 * @typedef kmem_guard_t
 *
 * @brief
 * KMEM guards are used by the kmem_* subsystem to secure atomic allocations.
 *
 * @discussion
 * This parameter is used to transmit the tag for the allocation.
 *
 * If @c kmg_atomic is set, then the other fields are also taken into account
 * and will affect the allocation behavior for this allocation.
 *
 * @field kmg_tag               The VM_KERN_MEMORY_* tag for this entry.
 * @field kmg_type_hash         Some hash related to the type of the allocation.
 * @field kmg_atomic            Whether the entry is atomic.
 * @field kmg_submap            Whether the entry is for a submap.
 * @field kmg_context           A use defined 30 bits that will be stored
 *                              on the entry on allocation and checked
 *                              on other operations.
 */
typedef struct {
	uint16_t                kmg_tag;
	uint16_t                kmg_type_hash;
	uint32_t                kmg_atomic : 1;
	uint32_t                kmg_submap : 1;
	uint32_t                kmg_context : 30;
} kmem_guard_t;

We have it all. This structure is used to describe allocations in the kmem_* subsystem. As the documentation suggests, this new functionality exists to secure atomic allocations:

"KMEM guards are used by the kmem_* subsystem to secure atomic allocations."

However, we don’t need (or want) to rely on documentation. We have code, and code is the most reliable thing we can ever have. To be fair, I mean binary code, not source code. However, I’ll paste here the source code, and you would trust me (and the compiler) that this is what happens :P Indeed, you can see __kmem_entry_validate_guard returns true if guard.kmg_atomic is 0:

	if (!guard.kmg_atomic) {
		return true;
	}

As we can see from __kmem_entry_validate_panic the panic happens if there is any inconsistency between the guard and the mapping entry, which could be:

  • atomicity (one is atomic, the other is not).
  • “objectness” (one is sub_map, the other is not).
  • bounds - the arguments addr and size not match the information in vm_map_entry_t.
  • the “context”s are different.

Now, what is this “*_context”?

The context

In this blogpost I would like to focus on the last check:

	if (!guard.kmg_submap && guard.kmg_context != entry->vme_context) {
		return false;
	}

This check compares the vme_context in the vm_map_entry_t structure associated with the allocation and the kmg_context in the kmem_guard_t. If they are different, the function returns false, and the caller will panic on “guard mismatch”.

This context is the “guard” we are talking about. It’s a 30-bit value which XNU stores in the vme_context field in the vm_map_entry_t structure, and checks it in different operations. XNU gets this field from the kmg_context field in kmem_guard_t instances, which are created on the fly by callers to kmem_* functionalities.

Setting vme_context

As we saw, the mapping entry is the structure that actually holds and keeps track of this context (in the vme_context field). There is only one place in XNU that sets this field directly - VME_OBJECT_SET. Of course, many callsites call this funtion, but let’s start with VME_OBJECT_SET:

static inline void
VME_OBJECT_SET(
	vm_map_entry_t entry,
	vm_object_t    object,
	bool           atomic,
	uint32_t       context)
{
	__builtin_assume(((vm_offset_t)object & 3) == 0);

	entry->vme_atomic = atomic;
	entry->is_sub_map = false;
#if __LP64__
	if (atomic) {
		entry->vme_context = context;
	} else {
		entry->vme_context = 0;
	}
#else
	(void)context;
#endif
...

It’s not like we need more evidence that the context exists only to protect atomic allocations - but we can see that if the allocation is not atomic, the vme_context is set to 0. And if the allocation is atomic - we use the context argument.

For example, you can see that in kmem_realloc_guard, the context argument is kmg_context from the guard:

...
VME_OBJECT_SET(newentry, object, guard.kmg_atomic, guard.kmg_context);
VME_ALIAS_SET(newentry, guard.kmg_tag);
...

Now let’s see how XNU generates kmg_context.

Context computation

All the functions that sets kmg_context in the guards (and actually calculates the hash instead of setting it to 0) useos_hash_kernel_pointer. This function implements some hash function that is computed on a pointer. The hash is the actual guard:

/*!
 * @function os_hash_kernel_pointer
 *
 * @brief
 * Hashes a pointer from a zone.
 *
 * @discussion
 * This is a really cheap and fast hash that will behave well for pointers
 * allocated by the kernel.
 *
 * This should be not used for untrusted pointer values from userspace,
 * or cases when the pointer is somehow under the control of userspace.
 *
 * This hash function utilizes knowledge about the span of the kernel
 * address space and inherent alignment of zalloc/kalloc.
 *
 * @param pointer
 * The pointer to hash.
 *
 * @returns
 * The hash for this pointer.
 */
static inline uint32_t
os_hash_kernel_pointer(const void *pointer)
{
	uintptr_t key = (uintptr_t)pointer >> 4;
	key *= 0x5052acdb;
	return (uint32_t)key ^ __builtin_bswap32((uint32_t)key);
}

Please note that this function is always inlined and it’s fast.

Bookkeeping

The vm_map_entry_ts are used just as before (you can see the lookup is done by calling vm_map_lookup_entry, RB trees, classic stuff, etc.). This is part of XNU MM 101, and I’m not going to cover this here because it’s not in the scope of this blogpost.

Let’s consider the additional metadata we need to keep track of. The context is 30-bit - which means the bookkeeping requires additional 30-bits per allocation. Apple stores it in the mapping entry - the vme_context field in vm_map_entry_t. This is an important change to a key structure.

Below is the diff in vm_map_entry (/osfmk/vm/vm_map.h):

 struct vm_map_entry {
-       struct vm_map_links     links;          /* links to other entries */
+       struct vm_map_links     links;                      /* links to other entries */
 #define vme_prev                links.prev
 #define vme_next                links.next
 #define vme_start               links.start
 #define vme_end                 links.end

        struct vm_map_store     store;
-       union vm_map_object     vme_object;     /* object I point to */
-       vm_object_offset_t      vme_offset;     /* offset into object */

-       unsigned int
-       /* boolean_t */ is_shared:1,    /* region is shared */
-       /* boolean_t */ is_sub_map:1,   /* Is "object" a submap? */
-       /* boolean_t */ in_transition:1, /* Entry being changed */
-       /* boolean_t */ needs_wakeup:1, /* Waiters on in_transition */
-       /* vm_behavior_t */ behavior:2, /* user paging behavior hint */
+       union {
+               vm_offset_t     vme_object_value;
+               struct {
+                       vm_offset_t vme_atomic:1;           /* entry cannot be split/coalesced */
+                       vm_offset_t is_sub_map:1;           /* Is "object" a submap? */
+                       vm_offset_t vme_submap:VME_SUBMAP_BITS;
+               };
+#if __LP64__
+               struct {
+                       uint32_t    vme_ctx_atomic : 1;
+                       uint32_t    vme_ctx_is_sub_map : 1;
+                       uint32_t    vme_context : 30;
+                       vm_page_object_t vme_object;
+               };
+#endif
+       };

And just as a reminder:

/*
 *	Types defined:
 *
 *	vm_map_t		the high-level address map data structure.
 *	vm_map_entry_t		an entry in an address map.
 *	vm_map_version_t	a timestamp of a map, for use with vm_map_lookup
 *	vm_map_copy_t		represents memory copied from an address map,
 *				 used for inter-map copy operations
 */
typedef struct vm_map_entry     *vm_map_entry_t;
#define VM_MAP_ENTRY_NULL       ((vm_map_entry_t) NULL)

Please note that Apple does not need to store a kmem_guard_t instance per allocation. The only relevant value to keep track of is the context (30-bit hash), and they already store it in the mapping entry. There isn’t a reason to store it twice. And since the guard is derived from the owner of the allocation and can be built on the fly, there isn’t any reason to keep a whole kmem_guard_t instance per allocation.

Actually, it’s more than that. The fact each caller builds the kmem_guard_t on the fly is important and helps. Because it means we have two properties:

  1. It’s not attacker controlled. For example, the kmg_atomic field is set from a constant value in the code, which means it’s not subject to altering by the attacker.
  2. It’s derived on the fly based on rules and control flow per client.

We can consider these guards as a way for the callers to describe how they expect the allocation to look like based on the control flow.

Example

The following logic in kfree_large calls kmem_size_guard while passing a dynamically generated guard, with constant kmg_atomic, kmg_tag, and kmg_type_hash, and generated context for owner:

static void
kfree_large(
	vm_offset_t             addr,
	vm_size_t               size,
	kmf_flags_t             flags,
	void                   *owner)
{
#if CONFIG_KERNEL_TBI && KASAN_TBI
	if (flags & KMF_GUESS_SIZE) {
		size = kmem_size_guard(kernel_map, VM_KERNEL_TBI_FILL(addr),
		    kalloc_guard(VM_KERN_MEMORY_NONE, 0, owner));
		flags &= ~KMF_GUESS_SIZE;
	}
	addr = kasan_tbi_tag_large_free(addr, size);
#endif /* CONFIG_KERNEL_TBI && KASAN_TBI */
#if KASAN_KALLOC
	/* TODO: quarantine for kasan large that works with guards */
	kasan_poison_range(addr, size, ASAN_VALID);
#endif

	size = kmem_free_guard(kernel_map, addr, size, flags,
	    kalloc_guard(VM_KERN_MEMORY_NONE, 0, owner));

	counter_dec(&kalloc_large_count);
	counter_add(&kalloc_large_total, -(uint64_t)size);
	KALLOC_ZINFO_SFREE(size);
	DTRACE_VM3(kfree, vm_size_t, size, vm_size_t, size, void*, addr);
}

Indeed, we can see the call to kalloc_guard, which cannot be any simpler:

static kmem_guard_t
kalloc_guard(vm_tag_t tag, uint16_t type_hash, const void *owner)
{
	kmem_guard_t guard = {
		.kmg_atomic      = true,
		.kmg_tag         = tag,
		.kmg_type_hash   = type_hash,
		.kmg_context     = os_hash_kernel_pointer(owner),
	};

	/*
	 * TODO: this use is really not sufficiently smart.
	 */

	return guard;
}

Additional fields

Before we keep going, it might be worth to mention more fields and values besides the *_context.

Tag?

Well, as you probably noticed, the function __kmem_entry_validate_guard checks all the fields of kmem_guard_t besides the kmg_tag and kmg_type_hash. I’m intentially not discussing these tags here because Apple mostly uses it for statistics/counting (if you are curious, these are vm_tag_t, you can see it in the source).

Since it’s not used for security, I’m going to ignore these tags right now :)

Bounds?

Well, unlike vm_tag_t, bounds and size certainly have great value for security (it’s like saying water has value for life). That’s interesting, because the function kmem_realloc_guard has the following “TODO”:

	/*
	 *	Locate the entry:
	 *	- wait for it to quiesce.
	 *	- validate its guard,
	 *	- learn its correct tag,
	 */
again:
	if (!vm_map_lookup_entry(map, oldaddr, &oldentry)) {
		__kmem_entry_not_found_panic(map, oldaddr);
	}
	if ((flags & KMR_KOBJECT) && oldentry->in_transition) {
		oldentry->needs_wakeup = true;
		vm_map_entry_wait(map, THREAD_UNINT);
		goto again;
	}
	kmem_entry_validate_guard(map, oldentry, oldaddr, oldsize, guard);
	if (!__kmem_entry_validate_object(oldentry, ANYF(flags))) {
		__kmem_entry_validate_object_panic(map, oldentry, ANYF(flags));
	}
	/*
	 *	TODO: We should validate for non atomic entries that the range
	 *	      we are acting on is what we expect here.
	 */

...

It seems we all have an agreement about what should be checked, even for non atomic entries :)

Security value

First of all, the threat model here is that the attacker does not have memory corruption yet - they have the ability to influence/control addr/size/etc., and they are looking to expand the set of primitives they hold. And with these new validations in place, we get a really nice hardening for that. Now one cannot use arbitrary free gadgets to mess up with the kmem_* subsystem. Common attacks on the kmem_* are:

  • calling free with a huge size.
  • calling free with a compromised/incorrect addr.

This lets attacker deallocate ranges of several allocations which they can then corrupt with UAF, and then the fun begins. There have been a lot of exploits using such techniques (you probably recall games with OSData, OSArray, and in general backing storages aliasing with regular allocations). This helps the allocator to verify the input it gets.

Please note that Apple generates the context using the storage of the pointer to the allocation (and not simply the VA itself). With this behavior in place, if an attacker has the primitive to free an arbitrary range, in addition to match bounds, now they need to do it from the right “owner”.

Example

If you recall, in my 3rd part of the ipc_kmsg blogpost I wrote the following sentence: “I love coffee, whiskey, and IOSurface”. And, well, to be fair, I should have added “OSData” to this list. To make it up to this structure, which plays a significant role in our lives, let’s look at it right now. Let’s see what this owner is in OSData related flow.

Here is the flow to kfree_large from OSData:

OSData::ensureCapacity
	krealloc_ext
		kfree_large
			kalloc_guard

And, OSData::ensureCapacity looks as follows:

unsigned int
OSData::ensureCapacity(unsigned int newCapacity)
{
	struct kalloc_result kr;
	unsigned int finalCapacity;

	if (newCapacity <= capacity) {
		return capacity;
	}

	finalCapacity = (((newCapacity - 1) / capacityIncrement) + 1)
	    * capacityIncrement;

	// integer overflow check
	if (finalCapacity < newCapacity) {
		return capacity;
	}

	kr = krealloc_ext((void *)KHEAP_DATA_BUFFERS, data, capacity, finalCapacity,
	    Z_VM_TAG_BT(Z_WAITOK_ZERO | Z_FULLSIZE | Z_MAY_COPYINMAP,
	    VM_KERN_MEMORY_LIBKERN), (void *)&this->data);

	if (kr.addr) {
		size_t delta = 0;

		data     = kr.addr;
		delta   -= capacity;
		capacity = (uint32_t)MIN(kr.size, UINT32_MAX);
		delta   += capacity;
		OSCONTAINER_ACCUMSIZE(delta);
	}

	return capacity;
}

Fantastic. The last argument to krealloc_ext is the owner, and we can see it sets it to (void *)&this->data.

Now let’s discuss the threat model and see why all of that makes sense :)

The right kind of mitigations

This section is going to take a turn to “generic memory safety mitigations”, but it’s important.

When we build mitigations, the first question should be “what is the threat model?”. Obviously, the threat model CANNOT be “the attacker has arbitrary r/w”. That contradicts one of the fundamental laws of physics, which reads “arbitrary r/w –> game over, always”. Therefore, the threat model is usually “an attacker has the ability to trigger a memory corruption of the following type…”, which can be UAF / OOB / straightforward-type-confusion / etc..

I have pretty strong (maybe too strong) feelings about how we should build mitigations. In my opinion, for a mitigation to have high ROI, it should target 1st order primitives rather than specific exploitation techniques (i.e., we should aim to kill bug classes). Or at least, we should aim to get as close to the 1st order primitive as possible. That’s precisely why I was so excited to see the following text in the best blogpost ever:

Most kernel memory corruption exploits go through a similar progression:

vulnerability → constrained memory corruption → strong memory corruption → memory read/write → control flow integrity bypass → arbitrary code execution

The idea is that the attacker starts from the initial vulnerability and builds up stronger and stronger primitives before finally achieving their goal: the ability to read and write kernel memory, or execute arbitrary code in the kernel. It’s best to mitigate these attacks as early in the chain as possible, for two reasons.

Example - kalloc_type and dataPAC

Clearly, kalloc_type does precisely that. It’s probably one of the best modern mitigations currently in production. So many UAFs are not exploitable anymore because one could only corrupt an instance of type A with another instance of type A - let’s call it “same-type-confusion” - which highly limits the attacker. To be precise (yes, I’m annoying, but that’s important) - it’s not “same-type-confusion”, because, unlike IsoHeap, kalloc_type doesn’t have 100% isolation. It isolates types based on signatures. So it’s really “same-signature-confusion”. We could keep referring to it as “same-type-confusion”; just please keep this in mind. It’s also worth to mention the signatures are built in a pretty wise and useful way (see SEAR’s blog for the details).

Some people asked me “why do you like dataPAC, if you always talk in favor of killing bug classes instead of protecting specific structures?”. Let’s consider one example. Clearly, with kalloc_type, we could do some nice UAF exploits, provided the vulnerable structure in question has some nice properties. For example, if the structure in question has:

  • fields that specify count/length/size – we could corrupt them with another value, and covert the UAF into OOB.
  • unions – well, everything inside that union is up for type confusion now.

However - what if the pointers inside these structures are signed using dataPAC, with a type as auxiliary? That’s right, dataPAC just got way up the chain, and it targets something highly closer to the 1st order primitive. Apple actually created here a scenario of “this UAF is not exploitable”, even though the structure has unions, because you need a PAC bypass to create this confusion.

That’s exactly what we should aim for when we build mitigations. The scenario of “hey, we have this new memory safety bug, but we can say that without further primitives, it’s not exploitable”.

What does kmem has to do with all of that?

I believe protecting alloactor-related structures/metadata is quite important. That’s because the allocator can expose highly powerful exploitation primitives, from highly restricted primitives. Consider the attacks of calling free with a huge size - from underflowing/modifying one integer, you get a huge, massive UAF on a ton of structures. That’s a huge leap. And since the cost for this mitigation is relatively low - it seems very worth it.

Going back to the threat model - indeed, the threat model here is that the attacker does not have memory corruption yet. They have the ability to influence/control addr/size/etc., and they are looking to expand the set of primitives they hold - and thanks to the allocator, they can create a powerful memory corruption with relatively low effort. This is exactly what the mitigation comes to address.

I’m very happy to see all these efforts from Apple.

I hope you enjoyed this blogpost.

Thanks,

Saar Amar.