Back to basics – gprof

A while ago at work we were discussing the pros and cons of the classic Linux profiler gprof In a previous job I spent a lot of time looking at profiles, hopefully this is useful

What are profiles useful for?

Profiles allow you to observe what a program is doing at runtime without access to the source code. A software performance regression or improvement in a given program will often show up in a profile as a change in either duration or number of calls to some set of functions that changed between software builds. The output of a profile run against a very simple program looks like the example below.

It tells us that of the 5.82 seconds of runtime 2.38 seconds were spent in main() and 3.44 seconds in functions called by main. It also tells us that the function return3() was called 1/3 by calcme1 and 2/3 by calcme2 – Profiles can give a good overview of where time is spent – albeit with some overhead.

                     Call graph (explanation follows)


granularity: each sample hit covers 4 byte(s) for 0.17% of 5.82 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    2.38    3.44                 main [1]
                0.88    0.55 666666667/666666667     calcme2 [2]
                0.73    0.28 333333333/333333333     calcme1 [3]
                0.99    0.00 1000000000/1000000000     increment [4]
-----------------------------------------------
                0.88    0.55 666666667/666666667     main [1]
[2]     24.6    0.88    0.55 666666667         calcme2 [2]
                0.55    0.00 666666667/1000000000     return3 [5]
-----------------------------------------------
                0.73    0.28 333333333/333333333     main [1]
[3]     17.3    0.73    0.28 333333333         calcme1 [3]
                0.28    0.00 333333333/1000000000     return3 [5]
-----------------------------------------------
                0.99    0.00 1000000000/1000000000     main [1]
[4]     17.1    0.99    0.00 1000000000         increment [4]
-----------------------------------------------
                0.28    0.00 333333333/1000000000     calcme1 [3]
                0.55    0.00 666666667/1000000000     calcme2 [2]
[5]     14.3    0.83    0.00 1000000000         return3 [5]
-----------------------------------------------

Working with gprof

When we want to obtain a profile we ask gcc to insert additional code (a function call) into the compiled binary, then we use the gprof tool to process the output. There is no need for any special hardware support – no issues running the tool in a VM/container etc whereas some tools need to access the CPU directly via pass-through, or you need to be root etc.

The general workflow is as follows

Build a binary using the flag -pg this will create a “special” build of your program which implements profiling – the program should operate as normal, albeit a bit slower. the implementation of profiling is detailed here in the gprof man/help/doc page

There will be an additional symbol/function inserted into the binary that gcc produces. In the output of nm you will see this function – it is called mcount. This is the function that does the work of figuring out which function calls what – so an easy way to see if a binary has profiling enabled is to look for that symbol.

During compilation/linking gcc inserts calls to mcount into each function call in the assembly/binary that it produces for you.

How to determine if a binary has profiling enabled

In the output below the symbol counter is just a variable in my program.

gary:~/git/unixfun/simple-call$ cc -o 1b-pg -pg simple_call_1B.c
gary:~/git/unixfun/simple-call$ nm ./1b-pg |grep count

0000000000004018 B counter
                 U mcount@GLIBC_2.2.5    <----- This binary has profiling enabled

gary:~/git/unixfun/simple-call$ cc -o 1b simple_call_1B.c
gary@perflabjump:~/git/unixfun/simple-call$ nm ./1b |grep count

0000000000004014 B counter
                                         <------- No mcount, means no profiling

How is the profiling actually achieved?

In the assembly you will see calls to _mcount. To see the ASM produced, you can use objdump -d Here is part of the output from objdump -d <binary>

    1242:       53                      push   %rbx
    1243:       ff 15 9f 2d 00 00       call   *0x2d9f(%rip) # 3fe8 <mcount@GLIBC_2.2.5>  <----- this is only seen in pg builds
    1249:       bb 00 ca 9a 3b          mov    $0x3b9aca00,%ebx

Reading the profile

When you run the binary compiled with -pg it will create a binary output file called gmon.out in the current working directory.

Then use the gprof tool to read the gmon.out – it needs to reference the binary (compiled with pg) to show the right symbols/functions etc. In our example below for the binary compiled with profiling. After the run completes – issue gprof 1b-pg and the gprof command will read the file gmon.out and use the symbols in the binary to display the correct function names.

Output of gprof

The output of gprof is just a text file containing information on the program as it ran – you can read it with a standard text editor. You are looking for the profile section. There are two main sections, the Flat Profile and the Call Graph e.g. something like this:-

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 40.98      2.38     2.38                             main
 17.10      3.38     0.99 1000000000     0.99     0.99  increment
 15.12      4.26     0.88 666666667     1.32     2.15  calcme2
 14.26      5.09     0.83 1000000000     0.83     0.83  return3
 12.54      5.82     0.73 333333333     2.19     3.02  calcme1

Overhead of profiling

The overhead of using -pg is caused by inserting the _mcount call into every function call (or return for e.g. main) – this means that the overhead is dependent on how many calls are made and how “large” the functions are. For instance in the trivial example below – the functions are very short and do almost nothing. Therefore the overhead of that additional call to _mcount is quite high. If the functions were larger, the fraction of time spent in _mcount relative to the work done in the functions would be much smaller.

Profiling overhead for tiny functions called many times

In the examples below (tiny functions) the overhead is 6X with profiling enabled

$ time ./1b-O # Optimzed - no profiling
real    0m1.943s
user    0m1.941s
sys     0m0.002s

$ time ./1b-O-pg # Optimized - profiling enabled
real    0m13.314s
user    0m13.304s
sys     0m0.003s
Profiling overhead with larger functions

If I change the function “increment” to spin 10,000 times the overhead is only around 2X

gary@perflabjump:~/git/unixfun/simple-call$ time ./1b-O-spin

real    0m17.494s
user    0m17.482s
sys     0m0.002s
gary@perflabjump:~/git/unixfun/simple-call$ time ./1b-O-spin-pg 

real    0m34.647s
user    0m34.632s
sys     0m0.003s
Profiling overhead for large functions called few times

If I decrease the number of calls to “increment” – and increase the number of spins.. I expect the overhead to decrease. With high ratio of “spins to calls” 1B:1 the overhead is barely measureable, which is what we expect because there are very few function calls relative to how long processing spends inside the function.

long z=10000000000;
    for (long i=0; i<z; i++) {
        z=z-1;
    }

above is the spin loop inserted into the increment function

$ time ./1b-O-spin ; time ./1b-O-spin-pg

real    0m18.070s
user    0m18.056s
sys     0m0.002s

real    0m18.192s
user    0m18.183s
sys     0m0.001s

The output of gprof is below – all the time is spent in the increment function

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
100.00     18.18    18.18       10     1.82     1.82  increment


                     Call graph (explanation follows)


granularity: each sample hit covers 4 byte(s) for 0.06% of 18.18 seconds

index % time    self  children    called     name
               18.18    0.00      10/10          main [2]
[1]    100.0   18.18    0.00      10         increment [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00   18.18                 main [2]
               18.18    0.00      10/10          increment [1]

Examples

Example 1

  • default compilation cc -o 1b simple_call_1B.c
  • compilation allowing compiler optimization cc -o 1b-O -O simple_call_1B.c
  • default compilation with profiling cc -o 1b-pg -pg simple_call_1B.c
  • compilation allowing compiler optimization and profiling cc -o 1b-O-pg -O -pg simple_call_1B.c

Example 1 – Unoptimized

A simple userland binary that makes function calls

Example 1 Timings

Default compilation (no compiler optimization) with/without profiling

gary:~/git/unixfun/simple-call$ time ./1b 
real    0m7.219s
user    0m7.212s
sys     0m0.004s

gary@:~/git/unixfun/simple-call$ time ./1b-pg 
real    0m43.157s
user    0m43.127s
sys     0m0.002s

Optimized by compiler -O with/without profiling

$ time ./1b-O
real    0m1.943s
user    0m1.941s
sys     0m0.002s

$ time ./1b-O-pg
real    0m13.314s
user    0m13.304s
sys     0m0.003s

Example 1 gprof output – unoptimized

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 40.98      2.38     2.38                             main
 17.10      3.38     0.99 1000000000     0.99     0.99  increment
 15.12      4.26     0.88 666666667     1.32     2.15  calcme2
 14.26      5.09     0.83 1000000000     0.83     0.83  return3
 12.54      5.82     0.73 333333333     2.19     3.02  calcme1

Call graph

                     Call graph (explanation follows)


granularity: each sample hit covers 4 byte(s) for 0.17% of 5.82 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    2.38    3.44                 main [1]
                0.88    0.55 666666667/666666667     calcme2 [2]
                0.73    0.28 333333333/333333333     calcme1 [3]
                0.99    0.00 1000000000/1000000000     increment [4]
-----------------------------------------------
                0.88    0.55 666666667/666666667     main [1]
[2]     24.6    0.88    0.55 666666667         calcme2 [2]
                0.55    0.00 666666667/1000000000     return3 [5]
-----------------------------------------------
                0.73    0.28 333333333/333333333     main [1]
[3]     17.3    0.73    0.28 333333333         calcme1 [3]
                0.28    0.00 333333333/1000000000     return3 [5]
-----------------------------------------------
                0.99    0.00 1000000000/1000000000     main [1]
[4]     17.1    0.99    0.00 1000000000         increment [4]
-----------------------------------------------
                0.28    0.00 333333333/1000000000     calcme1 [3]
                0.55    0.00 666666667/1000000000     calcme2 [2]
[5]     14.3    0.83    0.00 1000000000         return3 [5]
-----------------------------------------------

Example 1 gprof output – optimized

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ps/call  ps/call  name    
 52.14      0.73     0.73 1000000000   730.00   730.00  increment
 37.50      1.25     0.53                             main
 10.36      1.40     0.14                             calcme2

Call graph

                     Call graph (explanation follows)


granularity: each sample hit covers 4 byte(s) for 0.71% of 1.40 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     89.6    0.53    0.73                 main [1]
                0.73    0.00 1000000000/1000000000     increment [2]
-----------------------------------------------
                0.73    0.00 1000000000/1000000000     main [1]
[2]     52.1    0.73    0.00 1000000000         increment [2]
-----------------------------------------------
                                                 <spontaneous>
[3]     10.4    0.14    0.00                 calcme2 [3]

In this optimized case, we see calcme2 shown as which means that gprof can’t figure out who called it. What is intersting is that in the ASM for the opytimized case – we don’t see any calls to calcme2 in the main loop.

ASM (optimized no profile)

000000000000114e <main>:
    114e:       f3 0f 1e fa             endbr64 
    1152:       53                      push   %rbx
    1153:       bb 00 ca 9a 3b          mov    $0x3b9aca00,%ebx
    1158:       b8 00 00 00 00          mov    $0x0,%eax
    115d:       e8 e0 ff ff ff          call   1142 <increment>
    1162:       83 eb 01                sub    $0x1,%ebx
    1165:       75 f1                   jne    1158 <main+0xa>
    1167:       b8 00 00 00 00          mov    $0x0,%eax
    116c:       5b                      pop    %rbx
    116d:       c3                      ret

Whereas in the optimized+profile case the ASM is a bit different – to accomodate profiling? But we don’t see any call to calcme2
ASM (optimized + profile) (Output of objump -d)

000000000000123a <main>:
    123a:       f3 0f 1e fa             endbr64 
    123e:       55                      push   %rbp
    123f:       48 89 e5                mov    %rsp,%rbp
    1242:       53                      push   %rbx
    1243:       ff 15 9f 2d 00 00       call   *0x2d9f(%rip)        # 3fe8 <mcount@GLIBC_2.2.5>
    1249:       bb 00 ca 9a 3b          mov    $0x3b9aca00,%ebx
    124e:       b8 00 00 00 00          mov    $0x0,%eax
    1253:       e8 cb ff ff ff          call   1223 <increment>
    1258:       83 eb 01                sub    $0x1,%ebx
    125b:       75 f1                   jne    124e <main+0x14>
    125d:       b8 00 00 00 00          mov    $0x0,%eax
    1262:       5b                      pop    %rbx
    1263:       5d                      pop    %rbp
    1264:       c3                      ret    
    1265:       66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
    126c:       00 00 00 
    126f:       90                      nop

Example 1 source

Runs 1B times

#include <stdio.h>
#include <stdlib.h>

/* 
  runs 1 Billion times
*/

int counter = 0;

int return3(int f) {
    return(f*3);
}   

int calcme1(int f) {
    return(return3(f));
}

int calcme2(int f) {
    return(return3(f+1));
}

void increment() {
    counter +=1;
}

int main() {
    int i,res;
    int iterations = 1000000000; 

    for (i = 0; i < iterations; i++) {
        // Your main loop code here

        // Call calcme every three iterations
        if ((i + 1) % 3 == 0) {
            res=calcme1(i);
        } else {
            res=calcme2(i);
        }
        increment();
    }
    return 0;
}

Leave a Comment