clone()/fork()/process creation is slow on some machines

2023-02-19 08:29 问答作者：

Creating new processes is very slow on some of my machines, and not others.

The machines are all similar, and some of the slow machines are running the exact same workloads on the same hardware and kernel (2.6.32-26, Ubuntu 10.04) as some of the fast machines. Tasks that do not involve process creation are the same speeds on all machines.

For example, this program executes ~50 times slower on the affected machines:

int main()
{
    int i;
    for (i=0;i<10000;i++)
    {
        int p = fork();
        if (!p) exit(0);
        waitpid(p);
    }
    return 0;
}

What could be causing task creation to be much slower, and what other differences could I look for in the machines?

Edit1: Running bash scripts (as they spawn a lot of subprocesses) is also very slow on these machines, and strace on the slow scripts shows the slowdown in the clone() kernel call.

Edit2: vmstat doesn't show any significant differences on the fast vs slow machines. Th开发者_如何学Goey all have more than enough RAM for their workloads and don't go to swap.

Edit3: I don't see anything suspicious in dmesg

Edit4: I'm not sure why this is on stackoverflow now, I'm not asking about the example program above (just using it to demonstrate the problem), but linux administration/tuning, but if people think it belongs here, cool.

We experienced the same issue with our application stack, noticing massive degradation in application performance and longer clone times with strace. Using your test program across 18 nodes, I reproduced your results on the same 3 we were experiencing slow clone times with. All nodes were provisioned the same way, but with slightly different hardware. We checked the BIOS, vmstat, vm.overcommit_memory and replaced the RAM with no improvement. We then moved our drives to updated hardware and the issue was resolved.

CentOS 5.9 2.6.18-348.1.1.el5 #1 SMP Tue Jan 22 16:19:19 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

"bad" and "good" lspci:

$ diff ../bad_lspci_sort ../good_lspci_sort 
< Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 05)
> Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

< Host bridge: Intel Corporation Xeon E3-1200 Processor Family DRAM Controller (rev 09)
> Host bridge: Intel Corporation Xeon E3-1200 v2/Ivy Bridge DRAM Controller (rev 09)

< ISA bridge: Intel Corporation C204 Chipset Family LPC Controller (rev 05)
> ISA bridge: Intel Corporation C202 Chipset Family LPC Controller (rev 05)

< PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 6 (rev b5)
> PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 7 (rev b5)

< VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 04)
> VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)

I might start by using strace to see what system calls are being run, and where the slow ones hang. I'm also curious as to how you're using waitpid() here. On my systems, the signature for waitpid is

pid_t waitpid(pid_t pid, int *status, int options);

It sort of looks like you're using wait(), but passing in the pid of the child process instead of an int "status" that has an OR of the status flags you want to test for. That could cause some strange things to happen, I would expect, if the PID ended up being interpreted as a status mask.

Differences to look at are the kernel (parameters, device drivers, active modules) and the hardware (cpu version, number of cpus, memory configuration, peripheral devices).

Also: do machines change behavior after a reboot/power cycle?

EDIT:

The low performance is probably related to the (virtual) memory hierarchy. This hierarchy is very complex, and this complexity can lead to strange effects. Somewhere on the way from TLB to data caches to main memory strange conflicts may occur. These can be caused by a slightly different memory layout of the kernels of the different machines, or because the memory hierarchy (hardware) is actually slightly different.

Of course there could be other reasons, a strange peripheral (generating interrupts), a different workload (e.g. number of active processes), ...

If you can solve this problem, please share the results! Thanks.

Have you checked the BIOS configuration, in case you have the CPU caches disabled, or the power configuration messed up, or some systems are overheating, or some memory is underclocked...

Are the values of /sbin/sysctl vm.overcommit_memory the same for all the systems? If not, that could explain the difference.

Allowing overcommitment will make fork() much faster, but it means newly-allocated pages won't be backed by RAM or swap. If/when you touch an unbacked page, the OS has to find backing for it--and if it can't, the process gets killed. This isn't a problem if all the child does is exec(), as happens in a system() call, since all those unbacked pages are discarded; however, it could lead to serious problems for other users of fork().

继续阅读：kernel performance

clone()/fork()/process creation is slow on some machines

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？