Originally I wanted to do better job at presenting the contents of this article, but there is just too much information and required knowledge. Now this just looks like some notes that I took down 😅, sorry about that. You are welcome to email if you have any questions.
“guest”, “VM”, and “guest VM” are used interchangably in this article.
Arm Architecture Reference Manual K.a is assumed.
The Piece of Code that I Couldn’t Understand
I saw this during my code-reading:
1 | static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu, |
I’ll assume there’s not FEAT_FWB in this article.
Now I don’t really understand why reinitializing a vCPU requires KVM to unmap all stage2 mappings. Let’s check the comment first.
Ensure a rebooted VM will fault in RAM pages
This is just describing what the code is doing, not the reason, so not helpful.
and detect if the guest MMU is turned off and flush the caches as needed.
Now this piece of code is not “detecting” anything (aside from FWB), let alone detecting if the guest’s MMU is off, so not sure what this is about. Next the comments says flush the caches as needed, sounds like a reasonable thing to do, but when is this needed is not explained.
What does the Mailing List Say?
It’s time to turn to the mailing list after not understanding some code. git blame
and checking the commits often leads us to the email thread discussing the code. This piece of code has gone through a few modifications, this thread has more helpful discussions, and Alexandru Elisei explained why the original author Christoffer Dall adds stage2_unmap_vm
:
I had a chat with Christoffer about stage2_unmap_vm, and as I understood it, the purpose was to make sure that any changes made by userspace were seen by the guest while the MMU is off. When a stage 2 fault happens, we do clean+inval on the dcache, or inval on the icache if it was an exec fault. This means that whatever the host userspace writes while the guest is shut down and is still in the cache, the guest will be able to read/execute.
This can be relevant if the guest relocates the kernel and overwrites the original image location, and userspace copies the original kernel image back in before restarting the vm.
Analysis
The sequence of events that is described in the mailing list is:
guest runs and accesses memory
some interrupt or exception causes the CPU to switch back to host userspace
userspace pauses the guest, prepares to restart the VM, so it rewrites all guest memory, vCPU registers, etc.
userspace calls KVM API to reinitialize vCPU
guest restarts, and the initial state of the MMU is off
I think of two issues that could happen during this process.
coherency gets broken in the cache, because the guest and the host uses separate address spaces, for example
guest accesses address virtual address X, where the corresponding physical address is P, and X(P) is in the cache
host then rewrites guest memory, accesses virtual address Y, with the same physical address P, so Y(P) and X(P) are both in the cache
hardware decides to write back (clean) Y(P), and invalidate
hardware later decides to write back (clean) X(P), and invalidate, this causes the new data (Y) to be overwritten by the old data (X), so we lose the new updates by the host
after restart, at step 5, guest accesses the old data (the data before restart), instead of the new data written by the host in step 3 above
before guest restarts, its contents are in both the cache and memory
at step 3 above, host rewrites guest memory, but the updated data are only in the cache, not flushed to main memory yet
guest restarts, and since MMU is off, CPU directly reads data from memory instead of the cache, hence accessing old data
Problem 1
Let’s first discuss if issue 1 can happen or not, there are two cases:
First, if guest’s memory access attributes are identical to the host, then it’s simple
1 | D8.17.1 Data and unified caches |
The cache acts like a PIPT cache in this case, so no coherency problem from different VAs mapping to the same PA. Ordering won’t be an issue either, because switching exception levels is a context synchronization event. However, because there aren’t guarantees of coherency between instruction cache and data cache, KVM must invalidate the instruction cache before restarting the vCPU.
Second, even if the memory access attributes of the guest is different from the host, there wouldn’t be a problem as well, even if we can see:
1 | B2.15.1.2 Non-shareable Normal memory |
It is known that the host uses inner cacheable, inner shareable attributes, and the stage 2 attributes set for the guest is also inner shareable, therefore the guest would not be using a non-shareable attribute, so there won’t be a coherency problem. The rules for combining stage 1 and stage 2 attributes are listed below.
Now that both the host and the guest uses the inner shareable attribute, we can see:
1 | B2.15.1.1.1 Shareable, Inner Shareable, and Outer Shareable Normal memory |
Very safe. Still got to flush the instruction cache though.
No cacheability concerns here because the host and the guest uses the cache by definition, so the attribute must be inner cacheable. All caches in an operating system are expected to be in the same inner shareability domain.
To clarify, the need for cache maintenance operations in the statement below applies to the cases where there may be arbitrary cacheability, which is not our case. We know both the guest and the host uses inner cacheable attribute.
1 | D8.17.3 Cache maintenance requirements due to changing memory region attributes |
Problem 2
This is the primary situation that was discussed in the mailing list thread, and it does have the possiblity of happening. The reason why unmapping stage 2 helps solves this issue is because after restarting the guest will keep doing stage 2 page faults, and KVM cleans + invalidates cache to PoC for VA when handling the stage 2 page faults, when doing these cache maintenance operations:
1 | D7.5.9.5 Effects of instructions that operate by VA to the PoC |
All CPUs using the inner/non shareable attributes will see the results of the clean + invalidate.
But do we have to unmap? Can’t we just clean + invalidate all VA ranges? It might be because unmapping delays the cache maintenance operations. A small comparison between these two methods:
unmap:
walk all guest stage 2 page tables and clear everything
guest page faults when accessing each page, and KVM does clean and invalidate then
clean + invalidate
- walk all guest stage 2 page tables and clean + invalidate everything
It’s an interesting question whether which one is better, but my guess is that the original author chose to unmap everything because it’s just easier to implement XD
Background
Linux assumes all CPUs under its control is in the same inner shareability domain, which is also assumed by the ARM architecture
The normal memory attributes that Linux uses are Inner Shareable, Inner Write-Back Cacheable Non-transient Outer Write-Back Cacheable Non-transient
device memory is not cached
when stage 2 translation is activated, stage 1 memory access attirbutes (controlled by the guest) is combined with KVM-controlled stage 2 memory access attritubes
Related ARM Architecture Reference Manual Sections
B2.12 Caches and memory hierarchy
B2.15 Memory types and attributes
B2.16 Mismatched memory attributes
D7.5 Cache support
D8.2.12 The effects of disabling an address translation stage
D8.6 Memory region attributes
D8.17 Caches