Linux application profiling

2022-12-19 10:54 问答作者：

How can I record the performance of an application on a Linux machine? I won't have an IDE.

Ideally, I need an application that will attach to a process and log periodic snapshots of:

memory usage
number of threads
CPU usa开发者_StackOverflow中文版ge

Ideally, I need an application that will attach to a process and log periodic snapshots of:

memory usage

number of threads

CPU usage

Well, in order to collect this type of information about your process, you don't actually need a profiler on Linux.

You can use top in batch mode. It runs in the batch mode either until it is killed or until N iterations is done:

top -b -p `pidof a.out`

top -b -p `pidof a.out` -n 100

and you will get this:

$ top -b -p `pidof a.out`

top - 10:31:50 up 12 days, 19:08,  5 users,  load average: 0.02, 0.01, 0.02
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16330584k total,  2335024k used, 13995560k free,   241348k buffers
Swap:  4194296k total,        0k used,  4194296k free,  1631880k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24402 SK        20   0 98.7m 1056  860 S 43.9  0.0   0:11.87 a.out


top - 10:31:53 up 12 days, 19:08,  5 users,  load average: 0.02, 0.01, 0.02
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.9%us,  3.7%sy,  0.0%ni, 95.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16330584k total,  2335148k used, 13995436k free,   241348k buffers
Swap:  4194296k total,        0k used,  4194296k free,  1631880k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24402 SK      20   0 98.7m 1072  860 S 19.0  0.0   0:12.44 a.out

You can use ps (for instance in a shell script)
```
ps --format pid,pcpu,cputime,etime,size,vsz,cmd -p `pidof a.out`
```
I need some means of recording the performance of an application on a Linux machine

In order to do this you need to use perf if your Linux kernel is greater than 2.6.32 or OProfile if it is older. Both programs don't require from you to instrument your program (like Gprof requires). However, in order to get the call graph correctly in perf you need to build you program with -fno-omit-frame-pointer. For example: g++ -fno-omit-frame-pointer -O2 main.cpp.

As for Linux perf:

To record performance data:

perf record -p `pidof a.out`

or to record for 10 seconds:

perf record -p `pidof a.out` sleep 10

or to record with a call graph ()

perf record -g -p `pidof a.out`

To analyze the recorded data
```
perf report --stdio
perf report --stdio --sort=dso -g none
perf report --stdio -g none
perf report --stdio -g
```
On RHEL 6.3 it is allowed to read /boot/System.map-2.6.32-279.el6.x86_64, so I usually add --kallsyms=/boot/System.map-2.6.32-279.el6.x86_64 when doing a performance report:
```
perf report --stdio -g --kallsyms=/boot/System.map-2.6.32-279.el6.x86_64
```
Here I wrote some more information on using Linux `perf`:
First of all - this is tutorial about Linux profiling with perf

You can use perf if your Linux Kernel is greater than 2.6.32 or OProfile if it is older. Both programs don't require from you to instrument your program (like Gprof requires). However, in order to get call graph correctly in perf you need to build you program with -fno-omit-frame-pointer. For example: g++ -fno-omit-frame-pointer -O2 main.cpp.

You can see a "live" analysis of your application with perf top:
```
 sudo perf top -p `pidof a.out` -K
```

Or you can record performance data of a running application and analyze them after that:

To record performance data:

perf record -p `pidof a.out`

or to record for 10 seconds:

perf record -p `pidof a.out` sleep 10

or to record with a call graph ()

perf record -g -p `pidof a.out`

To analyze the recorded data

perf report --stdio
perf report --stdio --sort=dso -g none
perf report --stdio -g none
perf report --stdio -g

Or you can record performance data of an application and analyze them after that just by launching the application in this way and waiting for it to exit:

perf record ./a.out

This is an example of profiling a test program.

The test program is in file main.cpp (main.cpp is at the bottom of the answer):

I compile it in this way:

g++ -m64 -fno-omit-frame-pointer -g main.cpp -L.  -ltcmalloc_minimal -o my_test

I use libmalloc_minimial.so since it is compiled with -fno-omit-frame-pointer while libc malloc seems to be compiled without this option. Then I run my test program:

./my_test 100000000

Then I record performance data of a running process:

perf record -g  -p `pidof my_test` -o ./my_test.perf.data sleep 30

Then I analyze the load per module:

perf report --stdio -g none --sort comm,dso -i ./my_test.perf.data

# Overhead  Command                 Shared Object
# ........  .......  ............................
#
    70.06%  my_test  my_test
    28.33%  my_test  libtcmalloc_minimal.so.0.1.0
     1.61%  my_test  [kernel.kallsyms]

Then load per function is analyzed:

perf report --stdio -g none -i ./my_test.perf.data | c++filt

# Overhead  Command                 Shared Object                       Symbol
# ........  .......  ............................  ...........................
#
    29.30%  my_test  my_test                       [.] f2(long)
    29.14%  my_test  my_test                       [.] f1(long)
    15.17%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator new(unsigned long)
    13.16%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator delete(void*)
     9.44%  my_test  my_test                       [.] process_request(long)
     1.01%  my_test  my_test                       [.] operator delete(void*)@plt
     0.97%  my_test  my_test                       [.] operator new(unsigned long)@plt
     0.20%  my_test  my_test                       [.] main
     0.19%  my_test  [kernel.kallsyms]             [k] apic_timer_interrupt
     0.16%  my_test  [kernel.kallsyms]             [k] _spin_lock
     0.13%  my_test  [kernel.kallsyms]             [k] native_write_msr_safe

     and so on ...

Then call chains are analyzed:

perf report --stdio -g graph -i ./my_test.perf.data | c++filt

# Overhead  Command                 Shared Object                       Symbol
# ........  .......  ............................  ...........................
#
    29.30%  my_test  my_test                       [.] f2(long)
            |
            --- f2(long)
               |
                --29.01%-- process_request(long)
                          main
                          __libc_start_main

    29.14%  my_test  my_test                       [.] f1(long)
            |
            --- f1(long)
               |
               |--15.05%-- process_request(long)
               |          main
               |          __libc_start_main
               |
                --13.79%-- f2(long)
                          process_request(long)
                          main
                          __libc_start_main

    15.17%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator new(unsigned long)
            |
            --- operator new(unsigned long)
               |
               |--11.44%-- f1(long)
               |          |
               |          |--5.75%-- process_request(long)
               |          |          main
               |          |          __libc_start_main
               |          |
               |           --5.69%-- f2(long)
               |                     process_request(long)
               |                     main
               |                     __libc_start_main
               |
                --3.01%-- process_request(long)
                          main
                          __libc_start_main

    13.16%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator delete(void*)
            |
            --- operator delete(void*)
               |
               |--9.13%-- f1(long)
               |          |
               |          |--4.63%-- f2(long)
               |          |          process_request(long)
               |          |          main
               |          |          __libc_start_main
               |          |
               |           --4.51%-- process_request(long)
               |                     main
               |                     __libc_start_main
               |
               |--3.05%-- process_request(long)
               |          main
               |          __libc_start_main
               |
                --0.80%-- f2(long)
                          process_request(long)
                          main
                          __libc_start_main

     9.44%  my_test  my_test                       [.] process_request(long)
            |
            --- process_request(long)
               |
                --9.39%-- main
                          __libc_start_main

     1.01%  my_test  my_test                       [.] operator delete(void*)@plt
            |
            --- operator delete(void*)@plt

     0.97%  my_test  my_test                       [.] operator new(unsigned long)@plt
            |
            --- operator new(unsigned long)@plt

     0.20%  my_test  my_test                       [.] main
     0.19%  my_test  [kernel.kallsyms]             [k] apic_timer_interrupt
     0.16%  my_test  [kernel.kallsyms]             [k] _spin_lock
     and so on ...

So at this point you know where your program spends time.

And this is the main.cpp file for the test:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

time_t f1(time_t time_value)
{
  for (int j = 0; j < 10; ++j) {
    ++time_value;
    if (j%5 == 0) {
      double *p = new double;
      delete p;
    }
  }
  return time_value;
}

time_t f2(time_t time_value)
{
  for (int j = 0; j < 40; ++j) {
    ++time_value;
  }
  time_value = f1(time_value);
  return time_value;
}

time_t process_request(time_t time_value)
{
  for (int j = 0; j < 10; ++j) {
    int *p = new int;
    delete p;
    for (int m = 0; m < 10; ++m) {
      ++time_value;
    }
  }
  for (int i = 0; i < 10; ++i) {
    time_value = f1(time_value);
    time_value = f2(time_value);
  }
  return time_value;
}

int main(int argc, char* argv2[])
{
  int number_loops = argc > 1 ? atoi(argv2[1]) : 1;
  time_t time_value = time(0);
  printf("number loops %d\n", number_loops);
  printf("time_value: %d\n", time_value);

  for (int i = 0; i < number_loops; ++i) {
    time_value = process_request(time_value);
  }
  printf("time_value: %ld\n", time_value);
  return 0;
}

Quoting Linus Torvalds himself:

Don't use gprof. You're much better off using the newish Linux 'perf' tool.

And later ...

I can pretty much guarantee that once you start using it, you'll never use gprof or oprofile again.

See Re: [PATCH] grep: do not do external grep on skip-worktree entries (2010-01-04)

If you are looking for things to do to possibly speed up the program, you need stackshots. A simple way to do this is to use the pstack utility, or lsstack if you can get it.

You can do better than Gprof. If you want to use an official profiling tool, you want something that samples the call stack on wall-clock time and presents line-level cost, such as OProfile or RotateRight Zoom.

You can use Valgrind. It records data in a file which you can analyse later using a proper GUI, like KCacheGrind.

A usage example would be:

valgrind --tool=callgrind --dump-instr=yes --simulate-cache=yes your_program

It'll generate a file called callgrind.out.xxx where xxx is the PID of the program.

Unlike Gprof, Valgrind works with many different languages, including Java, with some limitations.

Look into Gprof. You need to compile the code with the -pg option, which instruments the code. After that, you can run the program and use Gprof to see the results.

You can also try out cpuprofiler.com. It gets the information you would normally get from the top command, and the CPU usage data can be even viewed remotely from a web browser.

继续阅读：profiling

Linux application profiling

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？