Can using gotos instead of oops result in a series of jump instructions more efficient than what the compiler would have generated if loops had been used instead?
For example: If I had a while loop nested inside a switch statement, which would be nested in another loop which would be nested inside of another switch case, could using goto actually outsmart the jump instructions the compiler generates when using just loops and no gotos?
5 Answers
Answers 1
It may be possible to gain a small speed advantage by using goto. However, the opposite may also be true. The compilers have become very good in detecting and unrolling loops or optimizing loops by using SIMD instructions. You will most likely kill all those optimization options for the compiler, since they are not build to optimize goto statements. You can also write functions to prevent gotos. This way you enable the compiler to inline the function and get rid of the jump.
If you consider using goto for optimization purposes, i would say, that it is a very bad idea. Express your algorithm in clean code and optimize later. And if you need more performance, think about you data and the access to it. That is the point, where you can gain or loose performance.
Since you wanted code to proove the point, I constructed the following example. I used gcc 6.3.1 with -O3 on an Intel i7-3537U. If you try to reproduce the example, your results may differ depending on your compiler or hardware.
#include <iostream> #include <array> #include "ToolBox/Instrumentation/Profiler.hpp" constexpr size_t kilo = 1024; constexpr size_t mega = kilo * 1024; constexpr size_t size = 512*mega; using container = std::array<char, size>; enum class Measurements { jump, loop }; // Simple vector addition using for loop void sum(container& result, const container& data1, const container& data2) { profile(Measurements::loop); for(unsigned int i = 0; i < size; ++i) { result[i] = data1[i] + data2[i]; } } // Simple vector addition using jumps void sum_jump(container& result, const container& data1, const container& data2) { profile(Measurements::jump); unsigned int i = 0; label: result[i] = data1[i] + data2[i]; i++; if(i == size) goto label; } int main() { // This segment is just for benchmarking purposes // Just ignore this ToolBox::Instrumentation::Profiler<Measurements, std::chrono::nanoseconds, 2> profiler( std::cout, { {Measurements::jump, "jump"}, {Measurements::loop, "loop"} } ); // allocate memory to execute our sum functions on container data1, data2, result; // run the benchmark 100 times to account for caching of the data for(unsigned i = 0; i < 100; i++) { sum_jump(result, data1, data2); sum(result, data1, data2); } }
The output of the programm is the the following:
Runtimes for 12Measurements jump: 100x 2972 nanoseconds 29 nanoseconds/execution loop: 100x 2820 nanoseconds 28 nanoseconds/execution
Ok, we see that there is no time difference in the runtimes, because we are limited in bandwidth of the memory and not in cpu instructions. But lets look at the assembler instructions, that are generated:
Dump of assembler code for function sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&): 0x00000000004025c0 <+0>: push %r15 0x00000000004025c2 <+2>: push %r14 0x00000000004025c4 <+4>: push %r12 0x00000000004025c6 <+6>: push %rbx 0x00000000004025c7 <+7>: push %rax 0x00000000004025c8 <+8>: mov %rdx,%r15 0x00000000004025cb <+11>: mov %rsi,%r12 0x00000000004025ce <+14>: mov %rdi,%rbx 0x00000000004025d1 <+17>: callq 0x402110 <_ZNSt6chrono3_V212system_clock3nowEv@plt> 0x00000000004025d6 <+22>: mov %rax,%r14 0x00000000004025d9 <+25>: lea 0x20000000(%rbx),%rcx 0x00000000004025e0 <+32>: lea 0x20000000(%r12),%rax 0x00000000004025e8 <+40>: lea 0x20000000(%r15),%rsi 0x00000000004025ef <+47>: cmp %rax,%rbx 0x00000000004025f2 <+50>: sbb %al,%al 0x00000000004025f4 <+52>: cmp %rcx,%r12 0x00000000004025f7 <+55>: sbb %dl,%dl 0x00000000004025f9 <+57>: and %al,%dl 0x00000000004025fb <+59>: cmp %rsi,%rbx 0x00000000004025fe <+62>: sbb %al,%al 0x0000000000402600 <+64>: cmp %rcx,%r15 0x0000000000402603 <+67>: sbb %cl,%cl 0x0000000000402605 <+69>: test $0x1,%dl 0x0000000000402608 <+72>: jne 0x40268b <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+203> 0x000000000040260e <+78>: and %cl,%al 0x0000000000402610 <+80>: and $0x1,%al 0x0000000000402612 <+82>: jne 0x40268b <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+203> 0x0000000000402614 <+84>: xor %eax,%eax 0x0000000000402616 <+86>: nopw %cs:0x0(%rax,%rax,1) 0x0000000000402620 <+96>: movdqu (%r12,%rax,1),%xmm0 0x0000000000402626 <+102>: movdqu 0x10(%r12,%rax,1),%xmm1 0x000000000040262d <+109>: movdqu (%r15,%rax,1),%xmm2 0x0000000000402633 <+115>: movdqu 0x10(%r15,%rax,1),%xmm3 0x000000000040263a <+122>: paddb %xmm0,%xmm2 0x000000000040263e <+126>: paddb %xmm1,%xmm3 0x0000000000402642 <+130>: movdqu %xmm2,(%rbx,%rax,1) 0x0000000000402647 <+135>: movdqu %xmm3,0x10(%rbx,%rax,1) 0x000000000040264d <+141>: movdqu 0x20(%r12,%rax,1),%xmm0 0x0000000000402654 <+148>: movdqu 0x30(%r12,%rax,1),%xmm1 0x000000000040265b <+155>: movdqu 0x20(%r15,%rax,1),%xmm2 0x0000000000402662 <+162>: movdqu 0x30(%r15,%rax,1),%xmm3 0x0000000000402669 <+169>: paddb %xmm0,%xmm2 0x000000000040266d <+173>: paddb %xmm1,%xmm3 0x0000000000402671 <+177>: movdqu %xmm2,0x20(%rbx,%rax,1) 0x0000000000402677 <+183>: movdqu %xmm3,0x30(%rbx,%rax,1) 0x000000000040267d <+189>: add $0x40,%rax 0x0000000000402681 <+193>: cmp $0x20000000,%rax 0x0000000000402687 <+199>: jne 0x402620 <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+96> 0x0000000000402689 <+201>: jmp 0x4026d5 <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+277> 0x000000000040268b <+203>: xor %eax,%eax 0x000000000040268d <+205>: nopl (%rax) 0x0000000000402690 <+208>: movzbl (%r15,%rax,1),%ecx 0x0000000000402695 <+213>: add (%r12,%rax,1),%cl 0x0000000000402699 <+217>: mov %cl,(%rbx,%rax,1) 0x000000000040269c <+220>: movzbl 0x1(%r15,%rax,1),%ecx 0x00000000004026a2 <+226>: add 0x1(%r12,%rax,1),%cl 0x00000000004026a7 <+231>: mov %cl,0x1(%rbx,%rax,1) 0x00000000004026ab <+235>: movzbl 0x2(%r15,%rax,1),%ecx 0x00000000004026b1 <+241>: add 0x2(%r12,%rax,1),%cl 0x00000000004026b6 <+246>: mov %cl,0x2(%rbx,%rax,1) 0x00000000004026ba <+250>: movzbl 0x3(%r15,%rax,1),%ecx 0x00000000004026c0 <+256>: add 0x3(%r12,%rax,1),%cl 0x00000000004026c5 <+261>: mov %cl,0x3(%rbx,%rax,1) 0x00000000004026c9 <+265>: add $0x4,%rax 0x00000000004026cd <+269>: cmp $0x20000000,%rax 0x00000000004026d3 <+275>: jne 0x402690 <sum(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&)+208> 0x00000000004026d5 <+277>: callq 0x402110 <_ZNSt6chrono3_V212system_clock3nowEv@plt> 0x00000000004026da <+282>: sub %r14,%rax 0x00000000004026dd <+285>: add %rax,0x202b74(%rip) # 0x605258 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_1EE14totalTimeSpentE> 0x00000000004026e4 <+292>: incl 0x202b76(%rip) # 0x605260 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_1EE10executionsE> 0x00000000004026ea <+298>: add $0x8,%rsp 0x00000000004026ee <+302>: pop %rbx 0x00000000004026ef <+303>: pop %r12 0x00000000004026f1 <+305>: pop %r14 0x00000000004026f3 <+307>: pop %r15 0x00000000004026f5 <+309>: retq End of assembler dump.
As we can see, the compiler has vectorized the loop and uses simd instructions (e.g. paddb).
Now the version with the jumps:
Dump of assembler code for function sum_jump(std::array<char, 536870912ul>&, std::array<char, 536870912ul> const&, std::array<char, 536870912ul> const&): 0x0000000000402700 <+0>: push %r15 0x0000000000402702 <+2>: push %r14 0x0000000000402704 <+4>: push %r12 0x0000000000402706 <+6>: push %rbx 0x0000000000402707 <+7>: push %rax 0x0000000000402708 <+8>: mov %rdx,%rbx 0x000000000040270b <+11>: mov %rsi,%r14 0x000000000040270e <+14>: mov %rdi,%r15 0x0000000000402711 <+17>: callq 0x402110 <_ZNSt6chrono3_V212system_clock3nowEv@plt> 0x0000000000402716 <+22>: mov %rax,%r12 0x0000000000402719 <+25>: mov (%rbx),%al 0x000000000040271b <+27>: add (%r14),%al 0x000000000040271e <+30>: mov %al,(%r15) 0x0000000000402721 <+33>: callq 0x402110 <_ZNSt6chrono3_V212system_clock3nowEv@plt> 0x0000000000402726 <+38>: sub %r12,%rax 0x0000000000402729 <+41>: add %rax,0x202b38(%rip) # 0x605268 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_0EE14totalTimeSpentE> 0x0000000000402730 <+48>: incl 0x202b3a(%rip) # 0x605270 <_ZN7ToolBox15Instrumentation6detail19ProfilerMeasurementI12MeasurementsLS3_0EE10executionsE> 0x0000000000402736 <+54>: add $0x8,%rsp 0x000000000040273a <+58>: pop %rbx 0x000000000040273b <+59>: pop %r12 0x000000000040273d <+61>: pop %r14 0x000000000040273f <+63>: pop %r15 0x0000000000402741 <+65>: retq End of assembler dump.
And here we did not trigger the optimization.
You can compile the programm and check the asm code yourself with:
gdb -batch -ex 'file a.out' -ex 'disassemble sum'
I also tried this approch for a matrix multiplication, but gcc was smart enough to detect the matrix multiplication with goto/label syntax too.
Conclusion:
Even if we did not see a speed loss, we saw, that gcc could not use an optimization, that could speed up the computation. In more cpu challenging tasks, that may cause a drop in performance.
Answers 2
Could it? Sure, theoretically, if you were using an especially dumb compiler or were compiling with optimizations disabled.
In practice, absolutely not. An optimizing compiler has very little difficulty optimizing loops and switch statements. You are far more likely to confuse the optimizer with unconditional jumps than if you play by the normal rules and use looping constructs that it is familiar with and programmed to optimize accordingly.
This just goes back to a general rule that optimizers do their best work with standard, idiomatic code because that's what they have been trained to recognize. Given their extensive use of pattern matching, if you deviate from normal patterns, you are less likely to get optimal code output.
For example: If I had a while loop nested inside a switch statement, which would be nested in another loop which would be nested inside of another switch case
Yeah, I'm confused by reading your description of the code, but a compiler would not be. Unlike humans, compilers have no trouble with nested loops. You aren't going to confuse it by writing valid C++ code, unless the compiler has a bug, in which case all bets are off. No matter how much nesting there is, if the logical result is a jump outside of the entire block, then a compiler will emit code that does that, just as if you had written a goto of your own. If you had tried to write a goto of your own, you would be more likely to get sub-optimal code because of scoping issues, which would require the compiler to emit code that saved local variables, adjusted the stack frame, etc. There are many standard optimizations that are normally performed on loops, yet become either impossible or simply not applied if you throw a goto inside of that loop.
Furthermore, it is not the case that jumps always result in faster code—even when jumping over other instructions. In many cases on modern, heavily-pipelined, out-of-order processors, mispredicted branches amount to a significant penalty, far more than if the intervening instructions had simply been executed and their results thrown away (ignored). Optimizing compilers that are targeting a particular platform know about this, and may decide to emit branchless code for performance reasons. If you throw in a jump, you force an unconditional branch and eliminate their ability to make these decisions strategically.
You claim that you want "an actual example instead of theory", but this is setting up a logical fallacy. It makes the assumption that you are correct and that the use of goto can indeed lead to better optimized code than a compiler could generate with looping constructs, but if that is not the case, then there is no code that can prove it. I could show you hundreds of examples of code where a goto resulted in sub-optimal object code being generated, but you could just claim that there still existed some snippet of code where the reverse is true, and there is no way that I (or anyone else) could exhaustively demonstrate that you were wrong.
On the contrary, theoretical arguments are the only way to answer this question (at least if answering in the negative). Furthermore, I would argue that a theoretical understanding is all that is required to inform one's writing of code. You should write code assuming that your optimizer is not broken, and only after determining that it actually is broken should you go back and try to figure out how to revise the code to get it to generate the output you expected. If, in some bizarre circumstance, you find that that involves a goto, then you can use it. Until that point, assume that the compiler knows how to optimize loops and write the code in the normal, readable way, assured that in the vast majority of circumstances (if not actually 100%) that the output will be better than what you could have gotten by trying to outsmart the compiler from the outset.
Answers 3
If you asking about the exact assembly code, probably yes, there is some sequence of assembly instructions you can't force the compiler to generate without using goto. This is not an excuse to use goto, however - compilers tend to change their assembly output (with different versions, flags, optimization settings and minor unrelated changes in source code) and if you need the exact sequence, you should use the exact sequence as an inline assembly.
In general, no, you can always write equivalent code up to using a few extra variables
Answers 4
Goto/while/do are just high-level compiler constructions. Once they get resolved to Intermediate Representation (IR or AST) they disappear and everything is implemented in terms of branches.
Take for example this code
double calcsum( double* val, unsigned int count ) { double sum = 0; for ( unsigned int j=0; j<count; ++count ) { sum += val[j]; } return sum; }
Compile it such that it generates IR:
clang++ -S -emit-llvm -O3 test.cpp # will create test.ll
Look at the generated IR language
$ cat test.ll define double @_Z7calcsumPdj(double* nocapture readonly, i32) local_unnamed_addr #0 { %3 = icmp eq i32 %1, 0 br i1 %3, label %26, label %4 ; <label>:4: ; preds = %2 %5 = load double, double* %0, align 8, !tbaa !1 %6 = sub i32 0, %1 %7 = and i32 %6, 7 %8 = icmp ugt i32 %1, -8 br i1 %8, label %12, label %9 ; <label>:9: ; preds = %4 %10 = sub i32 %6, %7 br label %28 ; <label>:11: ; preds = %28 br label %12 ; <label>:12: ; preds = %11, %4 %13 = phi double [ undef, %4 ], [ %38, %11 ] %14 = phi double [ 0.000000e+00, %4 ], [ %38, %11 ] %15 = icmp eq i32 %7, 0 br i1 %15, label %24, label %16 ; <label>:16: ; preds = %12 br label %17 ; <label>:17: ; preds = %17, %16 %18 = phi double [ %14, %16 ], [ %20, %17 ] %19 = phi i32 [ %7, %16 ], [ %21, %17 ] %20 = fadd double %18, %5 %21 = add i32 %19, -1 %22 = icmp eq i32 %21, 0 br i1 %22, label %23, label %17, !llvm.loop !5 ; <label>:23: ; preds = %17 br label %24 ; <label>:24: ; preds = %12, %23 %25 = phi double [ %13, %12 ], [ %20, %23 ] br label %26 ; <label>:26: ; preds = %24, %2 %27 = phi double [ 0.000000e+00, %2 ], [ %25, %24 ] ret double %27 ; <label>:28: ; preds = %28, %9 %29 = phi double [ 0.000000e+00, %9 ], [ %38, %28 ] %30 = phi i32 [ %10, %9 ], [ %39, %28 ] %31 = fadd double %29, %5 %32 = fadd double %31, %5 %33 = fadd double %32, %5 %34 = fadd double %33, %5 %35 = fadd double %34, %5 %36 = fadd double %35, %5 %37 = fadd double %36, %5 %38 = fadd double %37, %5 %39 = add i32 %30, -8 %40 = icmp eq i32 %39, 0 br i1 %40, label %11, label %28 }
You can see that there are not while/do constructs anymore. Everything is branch/goto. Obviously this still gets further compiled into assembly language .
But besides this, "fast" today depends on if the compiler can match a good optimization pass to your code. By using goto you would be potentially forfeiting some loop-optimization passes as lcssa, licm, loop deletion, loop reduce, simplify, unroll, unswitch:
http://llvm.org/docs/Passes.html#lcssa-loop-closed-ssa-form-pass
Answers 5
I think if you are trying to optimize your code using this you are doing something wrong.
- Compiler optimization can be made per platform - you can't do this
- There are a lot of compilers how can you be sure your optimization is will be not degradation on different platform
- There is so match OS, processors and other things that prevent such optimization to live
- Programming language is made for Human not machines - why didn't you use machine instructions - you will have better control and can optimize things (take into account point 3) .......
Anyway if you Are looking for micro optimizations on current platform better to get full control and use asm.
0 comments:
Post a Comment