开发者

clone()/fork()/process creation is slow on some machines

Creating new processes is very slow on some of my machines, and not others.

The machines are all similar, and some of the slow machines are running the exact same workloads on the same hardware and kernel (2.6.32-26, Ubuntu 10.04) as some of the fast machines. Tasks that do not involve process creation are the same speeds on all machines.

For example, this program executes ~50 times slower on the affected machines:

int main()
{
    int i;
    for (i=0;i<10000;i++)
    {
        int p = fork();
        if (!p) exit(0);
        waitpid(p);
    }
    return 0;
}

What could be causing task creation to be much slower, and what other differences could I look for in the machines?

Edit1: Running bash scripts (as they spawn a lot of subprocesses) is also very slow on these machines, and strace on the slow scripts shows the slowdown in the clone() kernel call.

Edit2: vmstat doesn't show any significant differences on the fast vs slow machines. Th开发者_如何学Goey all have more than enough RAM for their workloads and don't go to swap.

Edit3: I don't see anything suspicious in dmesg

Edit4: I'm not sure why this is on stackoverflow now, I'm not asking about the example program above (just using it to demonstrate the problem), but linux administration/tuning, but if people think it belongs here, cool.


We experienced the same issue with our application stack, noticing massive degradation in application performance and longer clone times with strace. Using your test program across 18 nodes, I reproduced your results on the same 3 we were experiencing slow clone times with. All nodes were provisioned the same way, but with slightly different hardware. We checked the BIOS, vmstat, vm.overcommit_memory and replaced the RAM with no improvement. We then moved our drives to updated hardware and the issue was resolved.

CentOS 5.9 2.6.18-348.1.1.el5 #1 SMP Tue Jan 22 16:19:19 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

"bad" and "good" lspci:

$ diff ../bad_lspci_sort ../good_lspci_sort 
< Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 05)
> Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

< Host bridge: Intel Corporation Xeon E3-1200 Processor Family DRAM Controller (rev 09)
> Host bridge: Intel Corporation Xeon E3-1200 v2/Ivy Bridge DRAM Controller (rev 09)

< ISA bridge: Intel Corporation C204 Chipset Family LPC Controller (rev 05)
> ISA bridge: Intel Corporation C202 Chipset Family LPC Controller (rev 05)

< PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 6 (rev b5)
> PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 7 (rev b5)

< VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 04)
> VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)


I might start by using strace to see what system calls are being run, and where the slow ones hang. I'm also curious as to how you're using waitpid() here. On my systems, the signature for waitpid is

pid_t waitpid(pid_t pid, int *status, int options);

It sort of looks like you're using wait(), but passing in the pid of the child process instead of an int "status" that has an OR of the status flags you want to test for. That could cause some strange things to happen, I would expect, if the PID ended up being interpreted as a status mask.


Differences to look at are the kernel (parameters, device drivers, active modules) and the hardware (cpu version, number of cpus, memory configuration, peripheral devices).

Also: do machines change behavior after a reboot/power cycle?

EDIT:

The low performance is probably related to the (virtual) memory hierarchy. This hierarchy is very complex, and this complexity can lead to strange effects. Somewhere on the way from TLB to data caches to main memory strange conflicts may occur. These can be caused by a slightly different memory layout of the kernels of the different machines, or because the memory hierarchy (hardware) is actually slightly different.

Of course there could be other reasons, a strange peripheral (generating interrupts), a different workload (e.g. number of active processes), ...

If you can solve this problem, please share the results! Thanks.


Have you checked the BIOS configuration, in case you have the CPU caches disabled, or the power configuration messed up, or some systems are overheating, or some memory is underclocked...


Are the values of /sbin/sysctl vm.overcommit_memory the same for all the systems? If not, that could explain the difference.

Allowing overcommitment will make fork() much faster, but it means newly-allocated pages won't be backed by RAM or swap. If/when you touch an unbacked page, the OS has to find backing for it--and if it can't, the process gets killed. This isn't a problem if all the child does is exec(), as happens in a system() call, since all those unbacked pages are discarded; however, it could lead to serious problems for other users of fork().

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜