Memory access after fork extremely slow on Mac OS X
The following code executes about 200 times slower on Mac OS X than on Linux. I don't know why and the problem does not seem to be trivial. I suspect a bug in gcc on the Mac or in Mac OS X itself or in my hardware.
The code forks the process which will copy the page table entires but not the memory on Mac OS X. The memory is copied when written to which happens in the for loop at the end of the run method. There, for the first 4 calls of run, all pages have to be copied because every page is touched. For the second 4 calls to run where skip is 512, every second page needs to be copied since every second page is touched. Intuitively, the first 4 calls should take about twice as long as the second 4 calls which is absolutely not the case. For me, the output of the program is as follows:
169.655ms
670.559ms
2784.18ms
16007.1ms
16.207ms
25.018ms
42.712ms
79.676ms
On Linux it is
5.306ms
10.69ms
20.91ms
41.042ms
6.115ms
12.203ms
23.939ms
40.663ms
Total runtime on Mac OS X is rougly 20 seconds, about 0.5 seconds on Linux for the exact same program both times compiled with gcc. I've tried compiling the mac os version wiht gcc4, 4.2 and 4.4 - no change.
Any ideas?
Code:
#include <stdint.h>
#include <iostream>
#include <sys/types.h>
#include <unistd.h>
#include <signal.h>
#include <cstring>
#include <cstdlib>
#include <sys/time.h>
using namespace std;
class Timestamp
{
private:
timeval time;
public:
Timestamp() { gettimeofday(&time,0); }
double operator-(const Timestamp& other) const { return static_cast<double>((static_cast<long long>(time.tv_sec)*1000000+(time.tv_use开发者_如何学Cc))-(static_cast<long long>(other.time.tv_sec)*1000000+(other.time.tv_usec)))/1000.0; }
};
class ForkCoW
{
public:
void run(uint64_t size, uint64_t skip) {
// allocate and initialize array
void* arrayVoid;
posix_memalign(&arrayVoid, 4096, sizeof(uint64_t)*size);
uint64_t* array = static_cast<uint64_t*>(arrayVoid);
for (uint64_t i = 0; i < size; ++i)
array[i] = 0;
pid_t p = fork();
if (p == 0)
sleep(99999999);
if (p < 0) {
cerr << "ERRROR: Fork failed." << endl;
exit(-1);
}
{
Timestamp start;
for (uint64_t i = 0; i < size; i += skip) {
array[i] = 1;
}
Timestamp stop;
cout << (stop-start) << "ms" << endl;
}
kill(p,SIGTERM);
}
};
int main(int argc, char* argv[]) {
ForkCoW f;
f.run(1ull*1000*1000, 512);
f.run(2ull*1000*1000, 512);
f.run(4ull*1000*1000, 512);
f.run(8ull*1000*1000, 512);
f.run(1ull*1000*1000, 513);
f.run(2ull*1000*1000, 513);
f.run(4ull*1000*1000, 513);
f.run(8ull*1000*1000, 513);
}
Only reason for such a long sleep would be this line:
sleep(300000);
which results in 300 seconds of sleep (300*1000). Maybe the implementation of fork()
is different on mac os x than you expect (and it always returns 0).
This has nothing to do with C++. I rewrote your example in C and using waitpid(2) instead of sleep/SIGCHLD and cannot reproduce a problem:
#include <errno.h>
#include <inttypes.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/types.h>
void ForkCoWRun(uint64_t size, uint64_t skip) {
// allocate and initialize array
uint64_t* array;
posix_memalign((void **)&array, 4096, sizeof(uint64_t)*size);
for (uint64_t i = 0; i < size; ++i)
array[i] = 0;
pid_t p = fork();
switch(p) {
case -1:
fprintf(stderr, "ERRROR: Fork failed: %s\n", strerror(errno));
exit(EXIT_FAILURE);
case 0:
{
struct timeval start, stop;
gettimeofday(&start, 0);
for (uint64_t i = 0; i < size; i += skip) {
array[i] = 1;
}
gettimeofday(&stop, 0);
long microsecs = (long)(stop.tv_sec - start.tv_sec) *1000000 + (long)(stop.tv_usec - start.tv_usec);
printf("%ld.%03ld ms\n", microsecs / 1000, microsecs % 1000);
exit(EXIT_SUCCESS);
}
default:
{
int exit_status;
waitpid(p, &exit_status, 0);
break;
}
}
}
int main(int argc, char* argv[]) {
ForkCoWRun(1ull*1000*1000, 512);
ForkCoWRun(2ull*1000*1000, 512);
ForkCoWRun(4ull*1000*1000, 512);
ForkCoWRun(8ull*1000*1000, 512);
ForkCoWRun(1ull*1000*1000, 513);
ForkCoWRun(2ull*1000*1000, 513);
ForkCoWRun(4ull*1000*1000, 513);
ForkCoWRun(8ull*1000*1000, 513);
}
and on OS X 10.8, 10.9, and 10.10, I get results like:
6.163 ms
12.239 ms
24.529 ms
49.223 ms
6.027 ms
12.081 ms
24.270 ms
49.498 ms
You are allocating 400 megabytes once, and once again from the fork()
(Since the process is duplicated including the memory allocation).
The reason of the slowness could be simply that from the fork()
with two processes, you run out of available physical memory, and are using the swap
memory from the disk.
This is usually much slower than using the physical memory.
Edit following comments
I suggest you change the code to start the timing measurement after writing to the first element of the array.
array[0] = 1;
Timestamp start;
for (int64_t i = 1; i < size; i++) {
array[i] = 1;
This way, the time used by the memory allocation following the first write will not be taken into account in the timestamp.
I suspect your problem is the order of execution on Linux is that it runs the parent first, and then the parent executes and the child terminates because its parent is gone, but on Mac OS it runs the child first, which involves a 300 second sleep.
There is absolutely no guarantee in any Unix standard that the two processes after a fork will run in parallel. Your assertions about the capability of the OS to do so notwithstanding.
Just to prove it's the sleep time, I replaced the "30000" your code with "SLEEPTIME" and compiled and ran it with g++ -DSLEEPTIME=?? foo.c && ./a.out
:
SLEEPTIME output
20 20442.1
30 30468.5
40 40431.4
10 10449 <just to prove it wasn't getting longer each run>
What happens when you have the parent waitpid()
on the child and ensure that it is exited (and to be safe handle SIGCHLD
to ensure that the process is reaped.) It seems possible that on Linux the child could have exited sooner and now the page fault handler has to do less work to copy-on-write since the pages are only referenced by a single process.
Second... Do you have any idea the kind of work fork()
has to do? In particular it should not be assumed to be "fast". Semantically speaking, it has to duplicate every page in the process's address space. Historically this is what old Unix did, so they say. This is improved by initially marking these pages as "copy-on-write" (that is, the pages are marked read-only and the kernel's page fault handler will duplicate them at the first write), but this is still a lot of work, and it means that your first write access on every page will be slow.
I congratulate the Linux developers for getting their fork()
and their copy-on-write implementation very fast for your access pattern... But it seems a very strange thing to claim that it's a huge problem if Mac OS's kernel is not as good, or if other parts of the system happen to generate different access patterns, or whatever. Fork, and writing pages after a fork, is not supposed to be fast.
I suppose what I am trying to say is if you move your code to a kernel that has a different set of design choices and all of a sudden your fork()
s are slower, tough, that's part of moving your code to a different OS.
Have you verified that fork() is working:
int main()
{
pid_t pid = fork();
if( pid > 0 ) {
std::cout << "Parent\n";
} else if( pid == 0 ) {
std::cout << "Child\n";
} else {
std::cout << "Failed to fork!\n";
}
}
Maybe there is some restriction on MAC OS-X about forking child processes.
精彩评论