Tuesday, May 23, 2017

Can Using goto Create Optimizations that the Compiler Can't Generate in C++?

Leave a Comment

Can using gotos instead of oops result in a series of jump instructions more efficient than what the compiler would have generated if loops had been used instead?

For example: If I had a while loop nested inside a switch statement, which would be nested in another loop which would be nested inside of another switch case, could using goto actually outsmart the jump instructions the compiler generates when using just loops and no gotos?

5 Answers

Answers 1

It may be possible to gain a small speed advantage by using goto. However, the opposite may also be true. The compilers have become very good in detecting and unrolling loops or optimizing loops by using SIMD instructions. You will most likely kill all those optimization options for the compiler, since they are not build to optimize goto statements. You can also write functions to prevent gotos. This way you enable the compiler to inline the function and get rid of the jump.

If you consider using goto for optimization purposes, i would say, that it is a very bad idea. Express your algorithm in clean code and optimize later. And if you need more performance, think about you data and the access to it. That is the point, where you can gain or loose performance.

Since you wanted code to proove the point, I constructed the following example. I used gcc 6.3.1 with -O3 on an Intel i7-3537U. If you try to reproduce the example, your results may differ depending on your compiler or hardware.

#include <iostream> #include <array> #include "ToolBox/Instrumentation/Profiler.hpp"  constexpr size_t kilo = 1024; constexpr size_t mega = kilo * 1024;  constexpr size_t size = 512*mega; using container = std::array<char, size>; enum class Measurements {     jump,     loop };  // Simple vector addition using for loop void sum(container& result, const container& data1, const container& data2) {     profile(Measurements::loop);     for(unsigned int i = 0; i < size; ++i) {         result[i] = data1[i] + data2[i];     } }  //  Simple vector addition using jumps void sum_jump(container& result, const container& data1, const container& data2) {     profile(Measurements::jump);     unsigned int i = 0; label:     result[i] = data1[i] + data2[i];     i++;     if(i == size) goto label; }  int main() {     // This segment is just for benchmarking purposes     // Just ignore this     ToolBox::Instrumentation::Profiler<Measurements, std::chrono::nanoseconds, 2> profiler(         std::cout,         {             {Measurements::jump, "jump"},             {Measurements::loop, "loop"}         }     );       // allocate memory to execute our sum functions on     container data1, data2, result;      // run the benchmark 100 times to account for caching of the data     for(unsigned i = 0; i < 100; i++) {      sum_jump(result, data1, data2);     sum(result, data1, data2);      } } 

The output of the programm is the the following:

Runtimes for 12Measurements         jump:       100x            2972 nanoseconds            29 nanoseconds/execution         loop:       100x            2820 nanoseconds            28 nanoseconds/execution 

Ok, we see that there is no time difference in the runtimes, because we are limited in bandwidth of the memory and not in cpu instructions. But lets look at the assembler instructions, that are generated:

Dump of assembler code for function sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&):    0x00000000004025c0 <+0>:     push   %r15    0x00000000004025c2 <+2>:     push   %r14    0x00000000004025c4 <+4>:     push   %r12    0x00000000004025c6 <+6>:     push   %rbx    0x00000000004025c7 <+7>:     push   %rax    0x00000000004025c8 <+8>:     mov    %rdx,%r15    0x00000000004025cb <+11>:    mov    %rsi,%r12    0x00000000004025ce <+14>:    mov    %rdi,%rbx    0x00000000004025d1 <+17>:    callq  0x402110 <_ZNSt6chrono3_V212system_clock3nowEv@plt>    0x00000000004025d6 <+22>:    mov    %rax,%r14    0x00000000004025d9 <+25>:    lea    0x20000000(%rbx),%rcx    0x00000000004025e0 <+32>:    lea    0x20000000(%r12),%rax    0x00000000004025e8 <+40>:    lea    0x20000000(%r15),%rsi    0x00000000004025ef <+47>:    cmp    %rax,%rbx    0x00000000004025f2 <+50>:    sbb    %al,%al    0x00000000004025f4 <+52>:    cmp    %rcx,%r12    0x00000000004025f7 <+55>:    sbb    %dl,%dl    0x00000000004025f9 <+57>:    and    %al,%dl    0x00000000004025fb <+59>:    cmp    %rsi,%rbx    0x00000000004025fe <+62>:    sbb    %al,%al    0x0000000000402600 <+64>:    cmp    %rcx,%r15    0x0000000000402603 <+67>:    sbb    %cl,%cl    0x0000000000402605 <+69>:    test   $0x1,%dl    0x0000000000402608 <+72>:    jne    0x40268b <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+203>    0x000000000040260e <+78>:    and    %cl,%al    0x0000000000402610 <+80>:    and    $0x1,%al    0x0000000000402612 <+82>:    jne    0x40268b <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+203>    0x0000000000402614 <+84>:    xor    %eax,%eax    0x0000000000402616 <+86>:    nopw   %cs:0x0(%rax,%rax,1)    0x0000000000402620 <+96>:    movdqu (%r12,%rax,1),%xmm0    0x0000000000402626 <+102>:   movdqu 0x10(%r12,%rax,1),%xmm1    0x000000000040262d <+109>:   movdqu (%r15,%rax,1),%xmm2    0x0000000000402633 <+115>:   movdqu 0x10(%r15,%rax,1),%xmm3    0x000000000040263a <+122>:   paddb  %xmm0,%xmm2    0x000000000040263e <+126>:   paddb  %xmm1,%xmm3    0x0000000000402642 <+130>:   movdqu %xmm2,(%rbx,%rax,1)    0x0000000000402647 <+135>:   movdqu %xmm3,0x10(%rbx,%rax,1)    0x000000000040264d <+141>:   movdqu 0x20(%r12,%rax,1),%xmm0    0x0000000000402654 <+148>:   movdqu 0x30(%r12,%rax,1),%xmm1    0x000000000040265b <+155>:   movdqu 0x20(%r15,%rax,1),%xmm2    0x0000000000402662 <+162>:   movdqu 0x30(%r15,%rax,1),%xmm3    0x0000000000402669 <+169>:   paddb  %xmm0,%xmm2    0x000000000040266d <+173>:   paddb  %xmm1,%xmm3    0x0000000000402671 <+177>:   movdqu %xmm2,0x20(%rbx,%rax,1)    0x0000000000402677 <+183>:   movdqu %xmm3,0x30(%rbx,%rax,1)    0x000000000040267d <+189>:   add    $0x40,%rax    0x0000000000402681 <+193>:   cmp    $0x20000000,%rax    0x0000000000402687 <+199>:   jne    0x402620 <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+96>    0x0000000000402689 <+201>:   jmp    0x4026d5 <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+277>    0x000000000040268b <+203>:   xor    %eax,%eax    0x000000000040268d <+205>:   nopl   (%rax)    0x0000000000402690 <+208>:   movzbl (%r15,%rax,1),%ecx    0x0000000000402695 <+213>:   add    (%r12,%rax,1),%cl    0x0000000000402699 <+217>:   mov    %cl,(%rbx,%rax,1)    0x000000000040269c <+220>:   movzbl 0x1(%r15,%rax,1),%ecx    0x00000000004026a2 <+226>:   add    0x1(%r12,%rax,1),%cl    0x00000000004026a7 <+231>:   mov    %cl,0x1(%rbx,%rax,1)    0x00000000004026ab <+235>:   movzbl 0x2(%r15,%rax,1),%ecx    0x00000000004026b1 <+241>:   add    0x2(%r12,%rax,1),%cl    0x00000000004026b6 <+246>:   mov    %cl,0x2(%rbx,%rax,1)    0x00000000004026ba <+250>:   movzbl 0x3(%r15,%rax,1),%ecx    0x00000000004026c0 <+256>:   add    0x3(%r12,%rax,1),%cl    0x00000000004026c5 <+261>:   mov    %cl,0x3(%rbx,%rax,1)    0x00000000004026c9 <+265>:   add    $0x4,%rax    0x00000000004026cd <+269>:   cmp    $0x20000000,%rax    0x00000000004026d3 <+275>:   jne    0x402690 <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+208>    0x00000000004026d5 <+277>:   callq  0x402110 <_ZNSt6chrono3_V212system_clock3nowEv@plt>    0x00000000004026da <+282>:   sub    %r14,%rax    0x00000000004026dd <+285>:   add    %rax,0x202b74(%rip)        # 0x605258 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_1EE14totalTimeSpentE>    0x00000000004026e4 <+292>:   incl   0x202b76(%rip)        # 0x605260 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_1EE10executionsE>    0x00000000004026ea <+298>:   add    $0x8,%rsp    0x00000000004026ee <+302>:   pop    %rbx    0x00000000004026ef <+303>:   pop    %r12    0x00000000004026f1 <+305>:   pop    %r14    0x00000000004026f3 <+307>:   pop    %r15    0x00000000004026f5 <+309>:   retq    End of assembler dump. 

As we can see, the compiler has vectorized the loop and uses simd instructions (e.g. paddb).

Now the version with the jumps:

Dump of assembler code for function sum_jump(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&):    0x0000000000402700 <+0>:     push   %r15    0x0000000000402702 <+2>:     push   %r14    0x0000000000402704 <+4>:     push   %r12    0x0000000000402706 <+6>:     push   %rbx    0x0000000000402707 <+7>:     push   %rax    0x0000000000402708 <+8>:     mov    %rdx,%rbx    0x000000000040270b <+11>:    mov    %rsi,%r14    0x000000000040270e <+14>:    mov    %rdi,%r15    0x0000000000402711 <+17>:    callq  0x402110 <_ZNSt6chrono3_V212system_clock3nowEv@plt>    0x0000000000402716 <+22>:    mov    %rax,%r12    0x0000000000402719 <+25>:    mov    (%rbx),%al    0x000000000040271b <+27>:    add    (%r14),%al    0x000000000040271e <+30>:    mov    %al,(%r15)    0x0000000000402721 <+33>:    callq  0x402110 <_ZNSt6chrono3_V212system_clock3nowEv@plt>    0x0000000000402726 <+38>:    sub    %r12,%rax    0x0000000000402729 <+41>:    add    %rax,0x202b38(%rip)        # 0x605268 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_0EE14totalTimeSpentE>    0x0000000000402730 <+48>:    incl   0x202b3a(%rip)        # 0x605270 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_0EE10executionsE>    0x0000000000402736 <+54>:    add    $0x8,%rsp    0x000000000040273a <+58>:    pop    %rbx    0x000000000040273b <+59>:    pop    %r12    0x000000000040273d <+61>:    pop    %r14    0x000000000040273f <+63>:    pop    %r15    0x0000000000402741 <+65>:    retq    End of assembler dump. 

And here we did not trigger the optimization.

You can compile the programm and check the asm code yourself with:

gdb -batch -ex 'file a.out' -ex 'disassemble sum' 

I also tried this approch for a matrix multiplication, but gcc was smart enough to detect the matrix multiplication with goto/label syntax too.

Conclusion:

Even if we did not see a speed loss, we saw, that gcc could not use an optimization, that could speed up the computation. In more cpu challenging tasks, that may cause a drop in performance.

Answers 2

Could it? Sure, theoretically, if you were using an especially dumb compiler or were compiling with optimizations disabled.

In practice, absolutely not. An optimizing compiler has very little difficulty optimizing loops and switch statements. You are far more likely to confuse the optimizer with unconditional jumps than if you play by the normal rules and use looping constructs that it is familiar with and programmed to optimize accordingly.

This just goes back to a general rule that optimizers do their best work with standard, idiomatic code because that's what they have been trained to recognize. Given their extensive use of pattern matching, if you deviate from normal patterns, you are less likely to get optimal code output.

For example: If I had a while loop nested inside a switch statement, which would be nested in another loop which would be nested inside of another switch case

Yeah, I'm confused by reading your description of the code, but a compiler would not be. Unlike humans, compilers have no trouble with nested loops. You aren't going to confuse it by writing valid C++ code, unless the compiler has a bug, in which case all bets are off. No matter how much nesting there is, if the logical result is a jump outside of the entire block, then a compiler will emit code that does that, just as if you had written a goto of your own. If you had tried to write a goto of your own, you would be more likely to get sub-optimal code because of scoping issues, which would require the compiler to emit code that saved local variables, adjusted the stack frame, etc. There are many standard optimizations that are normally performed on loops, yet become either impossible or simply not applied if you throw a goto inside of that loop.

Furthermore, it is not the case that jumps always result in faster code—even when jumping over other instructions. In many cases on modern, heavily-pipelined, out-of-order processors, mispredicted branches amount to a significant penalty, far more than if the intervening instructions had simply been executed and their results thrown away (ignored). Optimizing compilers that are targeting a particular platform know about this, and may decide to emit branchless code for performance reasons. If you throw in a jump, you force an unconditional branch and eliminate their ability to make these decisions strategically.

You claim that you want "an actual example instead of theory", but this is setting up a logical fallacy. It makes the assumption that you are correct and that the use of goto can indeed lead to better optimized code than a compiler could generate with looping constructs, but if that is not the case, then there is no code that can prove it. I could show you hundreds of examples of code where a goto resulted in sub-optimal object code being generated, but you could just claim that there still existed some snippet of code where the reverse is true, and there is no way that I (or anyone else) could exhaustively demonstrate that you were wrong.

On the contrary, theoretical arguments are the only way to answer this question (at least if answering in the negative). Furthermore, I would argue that a theoretical understanding is all that is required to inform one's writing of code. You should write code assuming that your optimizer is not broken, and only after determining that it actually is broken should you go back and try to figure out how to revise the code to get it to generate the output you expected. If, in some bizarre circumstance, you find that that involves a goto, then you can use it. Until that point, assume that the compiler knows how to optimize loops and write the code in the normal, readable way, assured that in the vast majority of circumstances (if not actually 100%) that the output will be better than what you could have gotten by trying to outsmart the compiler from the outset.

Answers 3

If you asking about the exact assembly code, probably yes, there is some sequence of assembly instructions you can't force the compiler to generate without using goto. This is not an excuse to use goto, however - compilers tend to change their assembly output (with different versions, flags, optimization settings and minor unrelated changes in source code) and if you need the exact sequence, you should use the exact sequence as an inline assembly.

In general, no, you can always write equivalent code up to using a few extra variables

Answers 4

Goto/while/do are just high-level compiler constructions. Once they get resolved to Intermediate Representation (IR or AST) they disappear and everything is implemented in terms of branches.

Take for example this code

double calcsum( double* val, unsigned int count ) {     double sum = 0;     for ( unsigned int j=0; j<count; ++count ) {         sum += val[j];     }     return sum; } 

Compile it such that it generates IR:

clang++ -S -emit-llvm -O3 test.cpp   # will create test.ll 

Look at the generated IR language

$ cat test.ll   define double @_Z7calcsumPdj(double* nocapture readonly, i32)  local_unnamed_addr #0 {   %3 = icmp eq i32 %1, 0   br i1 %3, label %26, label %4  ; <label>:4:                                      ; preds = %2   %5 = load double, double* %0, align 8, !tbaa !1   %6 = sub i32 0, %1   %7 = and i32 %6, 7   %8 = icmp ugt i32 %1, -8   br i1 %8, label %12, label %9  ; <label>:9:                                      ; preds = %4   %10 = sub i32 %6, %7   br label %28  ; <label>:11:                                     ; preds = %28   br label %12  ; <label>:12:                                     ; preds = %11, %4   %13 = phi double [ undef, %4 ], [ %38, %11 ]   %14 = phi double [ 0.000000e+00, %4 ], [ %38, %11 ]   %15 = icmp eq i32 %7, 0   br i1 %15, label %24, label %16  ; <label>:16:                                     ; preds = %12   br label %17  ; <label>:17:                                     ; preds = %17, %16   %18 = phi double [ %14, %16 ], [ %20, %17 ]   %19 = phi i32 [ %7, %16 ], [ %21, %17 ]   %20 = fadd double %18, %5   %21 = add i32 %19, -1   %22 = icmp eq i32 %21, 0   br i1 %22, label %23, label %17, !llvm.loop !5  ; <label>:23:                                     ; preds = %17   br label %24  ; <label>:24:                                     ; preds = %12, %23   %25 = phi double [ %13, %12 ], [ %20, %23 ]   br label %26  ; <label>:26:                                     ; preds = %24, %2   %27 = phi double [ 0.000000e+00, %2 ], [ %25, %24 ]   ret double %27  ; <label>:28:                                     ; preds = %28, %9   %29 = phi double [ 0.000000e+00, %9 ], [ %38, %28 ]   %30 = phi i32 [ %10, %9 ], [ %39, %28 ]   %31 = fadd double %29, %5   %32 = fadd double %31, %5   %33 = fadd double %32, %5   %34 = fadd double %33, %5   %35 = fadd double %34, %5   %36 = fadd double %35, %5   %37 = fadd double %36, %5   %38 = fadd double %37, %5   %39 = add i32 %30, -8   %40 = icmp eq i32 %39, 0   br i1 %40, label %11, label %28 } 

You can see that there are not while/do constructs anymore. Everything is branch/goto. Obviously this still gets further compiled into assembly language .

But besides this, "fast" today depends on if the compiler can match a good optimization pass to your code. By using goto you would be potentially forfeiting some loop-optimization passes as lcssa, licm, loop deletion, loop reduce, simplify, unroll, unswitch:

http://llvm.org/docs/Passes.html#lcssa-loop-closed-ssa-form-pass

Answers 5

I think if you are trying to optimize your code using this you are doing something wrong.

  1. Compiler optimization can be made per platform - you can't do this
  2. There are a lot of compilers how can you be sure your optimization is will be not degradation on different platform
  3. There is so match OS, processors and other things that prevent such optimization to live
  4. Programming language is made for Human not machines - why didn't you use machine instructions - you will have better control and can optimize things (take into account point 3) .......

Anyway if you Are looking for micro optimizations on current platform better to get full control and use asm.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment