Interaction of fork and user-space memory mapped in the kernel
Consider a Linux driver that uses get_user_pages
(or get_page
) to map pages from the calling process. The physical address of the pages are then passed to a hardware device. Both the process and the device may read and write to the pages until the parties decide to end the communication. In particular, the communication may continue using the pages after the system call that calls get_user_pages
returns. The system call is in effect setting up a shared memory zone between the process and the hardware device.
I'm concerned about what happens if the process calls fork
(it could be from another thread, and could happen either while the syscall that calls get_user_pages
is in progress or later). In particular, if the parent writes to the shared memory area after the fork, what do I know about the underlying physical address (presumably changed due to copy-on-write)? I want to understand:
- what the kernel needs to do to defend against a potentially misbehaving process (I don't want to create a security hole!);
what restrictions the process need to obey so that the functionality of our dri开发者_C百科ver works correctly (i.e. the physical memory remains mapped at the same address in the parent process).
- Ideally, I would like the common case where the child process doesn't use our driver at all (it probably calls
exec
almost immediately) to work. - Ideally, the parent process should not have to take any special steps when allocating the memory, as we have existing code that passes a stack-allocated buffer to the driver.
- I'm aware of
madvise
withMADV_DONTFORK
, and it would be ok to have the memory disappear from the child process's space, but it's not applicable to a stack-allocated buffer. - “Don't use fork while you have a connection active with our driver” would be annoying, but acceptable as a last resort if point 1 is satisfied.
- Ideally, I would like the common case where the child process doesn't use our driver at all (it probably calls
I'm willing to be pointed to documentation or source code. I've looked in particular at Linux Device Drivers, but didn't find this issue addressed. RTFS applied to even just the relevant part of the kernel source is a bit overwhelming.
The kernel version is not completely fixed but is a recent one (let's say ≥2.6.26). We're only targetting Arm platforms (single-processor so far but multicore is just round the corner), if it matters.
A fork()
will not interfere with get_user_pages()
: get_user_pages()
will give you a struct page
.
You would need to kmap()
it before being able to access it, and this mapping is done in kernel space, not userspace.
EDIT: get_user_pages()
touch the page table, but you should not be worried about this (it just make sure that the pages are mapped in userspace), and returns -EFAULT if it had any problem doing so.
If you fork(), until copy-on-write is performed, the child will be able to see that page. Once copy-on-write is done (because the child/the driver/the parent wrote to the page through the userspace mapping -- not the kernel kmap() the driver has), that page will no longer be shared. If you still hold a kmap() on the page (in the driver code), you will not be able to know if you are holding the parent page or the child's.
1) It's not a security hole, because once you execve(), all of that is gone.
2) When you fork() you want both process to be identical (It's a fork !!). I would think that your design should allow both the parent and the child to access the driver. Execve() will flush everything.
What about adding some functionality in userspace like:
f = open("/dev/your_thing")
mapping = mmap(f, ...)
When mmap() is called on your device, you install a memory mapping, with special flags: http://os1a.cs.columbia.edu/lxr/source/include/linux/mm.h#071
You have some interesting things like:
#define VM_SHARED 0x00000008
#define VM_LOCKED 0x00002000
#define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */
VM_SHARED will disable copy on write VM_LOCKED will disable swapping on that page VM_DONTCOPY will tell the kernel not to copy the vma region on fork, although I don't think it's a good idea
The short answer is to use madvise(addr, len, MADV_DONTFORK)
on any userspace buffers you give to your driver. This tells the kernel that the mapping should not be copied from parent to child and so there is no CoW.
The drawback is that the child inherits no mapping at that address, so if you want the child to then start using the driver it will need to remap that memory. But that is fairly easy to do in userspace.
Update: A buffer on the stack is problematic, I'm not sure you can make it safe in general.
You can't mark it DONTFORK
, because your child might be running on that stack page when it forks, or (worse in a way) it might do a function return later and hit the unmapped stack page. (I even tested this, you can happily mark your stack DONTFORK, bad things happen when you fork).
The other way to avoid a CoW is to create a shared mapping, but you can't map your stack shared for obvious reasons.
That means you risk a CoW if you fork. Even if the child "just" execs it might still touch the stack page and cause a CoW, leading to the parent getting a different page, which is bad.
The one minor point in your favor is that code using an on-stack buffer only needs to worry about code it calls forking, ie. you can't use an on-stack buffer after the function has returned. So you only need to audit your callees, and if they never fork you're safe, but that still may be infeasible, and is fragile if the code ever changes.
I think you really want to have all memory that is given to your driver to come from a custom allocator in userspace. It shouldn't be that intrusive. The allocator can either mmap
your device directly, as the other answer suggested, or just use anonymous mmap
, madvise(DONTFORK)
, and probably mlock()
to avoid swap out.
精彩评论