A while ago at work we were discussing the pros and cons of the classic Linux profiler gprof
In a previous job I spent a lot of time looking at profiles, hopefully this is useful
What are profiles useful for?
Profiles allow you to observe what a program is doing at runtime without access to the source code. A software performance regression or improvement in a given program will often show up in a profile as a change in either duration or number of calls to some set of functions that changed between software builds. The output of a profile run against a very simple program looks like the example below.
It tells us that of the 5.82 seconds of runtime 2.38
seconds were spent in main()
and 3.44
seconds in functions called by main. It also tells us that the function return3()
was called 1/3 by calcme1
and 2/3 by calcme2
– Profiles can give a good overview of where time is spent – albeit with some overhead.
Call graph (explanation follows)
granularity: each sample hit covers 4 byte(s) for 0.17% of 5.82 seconds
index % time self children called name
<spontaneous>
[1] 100.0 2.38 3.44 main [1]
0.88 0.55 666666667/666666667 calcme2 [2]
0.73 0.28 333333333/333333333 calcme1 [3]
0.99 0.00 1000000000/1000000000 increment [4]
-----------------------------------------------
0.88 0.55 666666667/666666667 main [1]
[2] 24.6 0.88 0.55 666666667 calcme2 [2]
0.55 0.00 666666667/1000000000 return3 [5]
-----------------------------------------------
0.73 0.28 333333333/333333333 main [1]
[3] 17.3 0.73 0.28 333333333 calcme1 [3]
0.28 0.00 333333333/1000000000 return3 [5]
-----------------------------------------------
0.99 0.00 1000000000/1000000000 main [1]
[4] 17.1 0.99 0.00 1000000000 increment [4]
-----------------------------------------------
0.28 0.00 333333333/1000000000 calcme1 [3]
0.55 0.00 666666667/1000000000 calcme2 [2]
[5] 14.3 0.83 0.00 1000000000 return3 [5]
-----------------------------------------------
Working with gprof
When we want to obtain a profile we ask gcc
to insert additional code (a function call) into the compiled binary, then we use the gprof
tool to process the output. There is no need for any special hardware support – no issues running the tool in a VM/container etc whereas some tools need to access the CPU directly via pass-through, or you need to be root etc.
The general workflow is as follows
Build a binary using the flag -pg
this will create a “special” build of your program which implements profiling – the program should operate as normal, albeit a bit slower. the implementation of profiling is detailed here in the gprof man/help/doc page
There will be an additional symbol/function inserted into the binary that gcc produces. In the output of nm
you will see this function – it is called mcount
. This is the function that does the work of figuring out which function calls what – so an easy way to see if a binary has profiling enabled is to look for that symbol.
During compilation/linking gcc
inserts calls to mcount
into each function call in the assembly/binary that it produces for you.
How to determine if a binary has profiling enabled
In the output below the symbol counter
is just a variable in my program.
gary:~/git/unixfun/simple-call$ cc -o 1b-pg -pg simple_call_1B.c
gary:~/git/unixfun/simple-call$ nm ./1b-pg |grep count
0000000000004018 B counter
U mcount@GLIBC_2.2.5 <----- This binary has profiling enabled
gary:~/git/unixfun/simple-call$ cc -o 1b simple_call_1B.c
gary@perflabjump:~/git/unixfun/simple-call$ nm ./1b |grep count
0000000000004014 B counter
<------- No mcount, means no profiling
How is the profiling actually achieved?
In the assembly you will see calls to _mcount. To see the ASM produced, you can use objdump -d
Here is part of the output from objdump -d <binary>
1242: 53 push %rbx
1243: ff 15 9f 2d 00 00 call *0x2d9f(%rip) # 3fe8 <mcount@GLIBC_2.2.5> <----- this is only seen in pg builds
1249: bb 00 ca 9a 3b mov $0x3b9aca00,%ebx
Reading the profile
When you run the binary compiled with -pg
it will create a binary output file called gmon.out
in the current working directory.
Then use the gprof
tool to read the gmon.out – it needs to reference the binary (compiled with pg) to show the right symbols/functions etc. In our example below for the binary compiled with profiling. After the run completes – issue gprof 1b-pg
and the gprof command will read the file gmon.out and use the symbols in the binary to display the correct function names.
Output of gprof
The output of gprof is just a text file containing information on the program as it ran – you can read it with a standard text editor. You are looking for the profile section. There are two main sections, the Flat Profile and the Call Graph e.g. something like this:-
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
40.98 2.38 2.38 main
17.10 3.38 0.99 1000000000 0.99 0.99 increment
15.12 4.26 0.88 666666667 1.32 2.15 calcme2
14.26 5.09 0.83 1000000000 0.83 0.83 return3
12.54 5.82 0.73 333333333 2.19 3.02 calcme1
Overhead of profiling
The overhead of using -pg
is caused by inserting the _mcount
call into every function call (or return for e.g. main) – this means that the overhead is dependent on how many calls are made and how “large” the functions are. For instance in the trivial example below – the functions are very short and do almost nothing. Therefore the overhead of that additional call to _mcount
is quite high. If the functions were larger, the fraction of time spent in _mcount relative to the work done in the functions would be much smaller.
Profiling overhead for tiny functions called many times
In the examples below (tiny functions) the overhead is 6X with profiling enabled
$ time ./1b-O # Optimzed - no profiling
real 0m1.943s
user 0m1.941s
sys 0m0.002s
$ time ./1b-O-pg # Optimized - profiling enabled
real 0m13.314s
user 0m13.304s
sys 0m0.003s
Profiling overhead with larger functions
If I change the function “increment” to spin 10,000 times the overhead is only around 2X
gary@perflabjump:~/git/unixfun/simple-call$ time ./1b-O-spin
real 0m17.494s
user 0m17.482s
sys 0m0.002s
gary@perflabjump:~/git/unixfun/simple-call$ time ./1b-O-spin-pg
real 0m34.647s
user 0m34.632s
sys 0m0.003s
Profiling overhead for large functions called few times
If I decrease the number of calls to “increment” – and increase the number of spins.. I expect the overhead to decrease. With high ratio of “spins to calls” 1B:1 the overhead is barely measureable, which is what we expect because there are very few function calls relative to how long processing spends inside the function.
long z=10000000000;
for (long i=0; i<z; i++) {
z=z-1;
}
above is the spin loop inserted into the increment
function
$ time ./1b-O-spin ; time ./1b-O-spin-pg
real 0m18.070s
user 0m18.056s
sys 0m0.002s
real 0m18.192s
user 0m18.183s
sys 0m0.001s
The output of gprof is below – all the time is spent in the increment function
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
100.00 18.18 18.18 10 1.82 1.82 increment
Call graph (explanation follows)
granularity: each sample hit covers 4 byte(s) for 0.06% of 18.18 seconds
index % time self children called name
18.18 0.00 10/10 main [2]
[1] 100.0 18.18 0.00 10 increment [1]
-----------------------------------------------
<spontaneous>
[2] 100.0 0.00 18.18 main [2]
18.18 0.00 10/10 increment [1]
Examples
Example 1
- default compilation
cc -o 1b simple_call_1B.c
- compilation allowing compiler optimization
cc -o 1b-O -O simple_call_1B.c
- default compilation with profiling
cc -o 1b-pg -pg simple_call_1B.c
- compilation allowing compiler optimization and profiling
cc -o 1b-O-pg -O -pg simple_call_1B.c
Example 1 – Unoptimized
A simple userland binary that makes function calls
Example 1 Timings
Default compilation (no compiler optimization) with/without profiling
gary:~/git/unixfun/simple-call$ time ./1b
real 0m7.219s
user 0m7.212s
sys 0m0.004s
gary@:~/git/unixfun/simple-call$ time ./1b-pg
real 0m43.157s
user 0m43.127s
sys 0m0.002s
Optimized by compiler -O
with/without profiling
$ time ./1b-O
real 0m1.943s
user 0m1.941s
sys 0m0.002s
$ time ./1b-O-pg
real 0m13.314s
user 0m13.304s
sys 0m0.003s
Example 1 gprof output – unoptimized
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
40.98 2.38 2.38 main
17.10 3.38 0.99 1000000000 0.99 0.99 increment
15.12 4.26 0.88 666666667 1.32 2.15 calcme2
14.26 5.09 0.83 1000000000 0.83 0.83 return3
12.54 5.82 0.73 333333333 2.19 3.02 calcme1
Call graph
Call graph (explanation follows)
granularity: each sample hit covers 4 byte(s) for 0.17% of 5.82 seconds
index % time self children called name
<spontaneous>
[1] 100.0 2.38 3.44 main [1]
0.88 0.55 666666667/666666667 calcme2 [2]
0.73 0.28 333333333/333333333 calcme1 [3]
0.99 0.00 1000000000/1000000000 increment [4]
-----------------------------------------------
0.88 0.55 666666667/666666667 main [1]
[2] 24.6 0.88 0.55 666666667 calcme2 [2]
0.55 0.00 666666667/1000000000 return3 [5]
-----------------------------------------------
0.73 0.28 333333333/333333333 main [1]
[3] 17.3 0.73 0.28 333333333 calcme1 [3]
0.28 0.00 333333333/1000000000 return3 [5]
-----------------------------------------------
0.99 0.00 1000000000/1000000000 main [1]
[4] 17.1 0.99 0.00 1000000000 increment [4]
-----------------------------------------------
0.28 0.00 333333333/1000000000 calcme1 [3]
0.55 0.00 666666667/1000000000 calcme2 [2]
[5] 14.3 0.83 0.00 1000000000 return3 [5]
-----------------------------------------------
Example 1 gprof output – optimized
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ps/call ps/call name
52.14 0.73 0.73 1000000000 730.00 730.00 increment
37.50 1.25 0.53 main
10.36 1.40 0.14 calcme2
Call graph
Call graph (explanation follows)
granularity: each sample hit covers 4 byte(s) for 0.71% of 1.40 seconds
index % time self children called name
<spontaneous>
[1] 89.6 0.53 0.73 main [1]
0.73 0.00 1000000000/1000000000 increment [2]
-----------------------------------------------
0.73 0.00 1000000000/1000000000 main [1]
[2] 52.1 0.73 0.00 1000000000 increment [2]
-----------------------------------------------
<spontaneous>
[3] 10.4 0.14 0.00 calcme2 [3]
In this optimized case, we see calcme2
shown as which means that gprof can’t figure out who called it. What is intersting is that in the ASM for the opytimized case – we don’t see any calls to calcme2
in the main loop.
ASM (optimized no profile)
000000000000114e <main>:
114e: f3 0f 1e fa endbr64
1152: 53 push %rbx
1153: bb 00 ca 9a 3b mov $0x3b9aca00,%ebx
1158: b8 00 00 00 00 mov $0x0,%eax
115d: e8 e0 ff ff ff call 1142 <increment>
1162: 83 eb 01 sub $0x1,%ebx
1165: 75 f1 jne 1158 <main+0xa>
1167: b8 00 00 00 00 mov $0x0,%eax
116c: 5b pop %rbx
116d: c3 ret
Whereas in the optimized+profile case the ASM is a bit different – to accomodate profiling? But we don’t see any call to calcme2
ASM (optimized + profile) (Output of objump -d
)
000000000000123a <main>:
123a: f3 0f 1e fa endbr64
123e: 55 push %rbp
123f: 48 89 e5 mov %rsp,%rbp
1242: 53 push %rbx
1243: ff 15 9f 2d 00 00 call *0x2d9f(%rip) # 3fe8 <mcount@GLIBC_2.2.5>
1249: bb 00 ca 9a 3b mov $0x3b9aca00,%ebx
124e: b8 00 00 00 00 mov $0x0,%eax
1253: e8 cb ff ff ff call 1223 <increment>
1258: 83 eb 01 sub $0x1,%ebx
125b: 75 f1 jne 124e <main+0x14>
125d: b8 00 00 00 00 mov $0x0,%eax
1262: 5b pop %rbx
1263: 5d pop %rbp
1264: c3 ret
1265: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
126c: 00 00 00
126f: 90 nop
Example 1 source
Runs 1B times
#include <stdio.h>
#include <stdlib.h>
/*
runs 1 Billion times
*/
int counter = 0;
int return3(int f) {
return(f*3);
}
int calcme1(int f) {
return(return3(f));
}
int calcme2(int f) {
return(return3(f+1));
}
void increment() {
counter +=1;
}
int main() {
int i,res;
int iterations = 1000000000;
for (i = 0; i < iterations; i++) {
// Your main loop code here
// Call calcme every three iterations
if ((i + 1) % 3 == 0) {
res=calcme1(i);
} else {
res=calcme2(i);
}
increment();
}
return 0;
}