微优化c比较功能

微优化c比较功能,第1张

概述我有一个Compare()函数,如下所示: inline bool Compare(bool greater, int p1, int p2) { if (greater) return p1>=p2; else return p1<=p2;} 我决定优化以避免分歧: inline bool Compare2(bool greater, int p1, int p2) { bool 我有一个Compare()函数,如下所示:
inline bool Compare(bool greater,int p1,int p2) {  if (greater) return p1>=p2;  else return p1<=p2;}

我决定优化以避免分歧:

inline bool Compare2(bool greater,int p2) {  bool ret[2] = {p1<=p2,p1>=p2};  return ret[greater];}

然后我通过这样做测试:

bool x = true;int M = 100000;int N = 100;bool a[N];int b[N];int c[N];for (int i=0;i<N; ++i) {  a[i] = rand()%2;  b[i] = rand()%128;  c[i] = rand()%128;}// Timed the below loop with both Compare() and Compare2()for (int j=0; j<M; ++j) {  for (int i=0; i<N; ++i) {    x ^= Compare(a[i],b[i],c[i]);  }}

结果:

Compare(): 3.14ns avgCompare2(): 1.61ns avg

我会说个案封闭,避免分支FTW.但为了完整,我换了

a[i] = rand()%2;

有:

a[i] = true;

并得到〜3.14ns的完全相同的测量.大概当时没有分支,编译器实际上是重写Compare()来避免if语句.但是,为什么Compare2()更快?

不幸的是,我是汇编代码文盲,否则我本来会尝试回答这个问题.

编辑:下面是一些程序集:

_Z7Comparebii:.LFB4:    .cfi_startproc    .cfi_personality 0x3,__gxx_personality_v0    pushq   %rbp    .cfi_def_cfa_offset 16    movq    %rsp,%rbp    .cfi_offset 6,-16    .cfi_def_cfa_register 6    movl    %edi,%eax    movl    %esi,-8(%rbp)    movl    %edx,-12(%rbp)    movb    %al,-4(%rbp)    cmpb    
class StackOverflowFixture : public celero::TestFixture{  public:    StackOverflowFixture()    {    }    inline bool NoOp(bool greater,int p2)     {      return true;    }    inline bool Compare(bool greater,int p2)     {      if(greater == true)      {        return p1>=p2;      }      return p1<=p2;    }    inline bool Compare2(bool greater,int p2)    {      bool ret[2] = {p1<=p2,p1>=p2};      return ret[greater];    }    inline bool Compare3(bool greater,int p2)     {      return (!greater != !(p1 <= p2)) | (p1 == p2);    }    inline bool Compare4(bool greater,int p2)     {      return (greater ^ (p1 <= p2)) | (p1 == p2);    }};BASEliNE_F(StackOverflow,Baseline,StackOverflowFixture,100,5000000){  celero::DoNotoptimizeAway(NoOp(rand()%2,rand(),rand()));}BENCHMARK_F(StackOverflow,Compare,5000000){  celero::DoNotoptimizeAway(Compare(rand()%2,Compare2,5000000){  celero::DoNotoptimizeAway(Compare2(rand()%2,Compare3,5000000){  celero::DoNotoptimizeAway(Compare3(rand()%2,Compare4,5000000){  celero::DoNotoptimizeAway(Compare4(rand()%2,rand()));}
,-4(%rbp) je .L2 movl -8(%rbp),%eax cmpl -12(%rbp),%eax setge %al jmp .L3.L2: movl -8(%rbp),%eax setle %al.L3: leave ret .cfi_endproc.LFE4: .size _Z7Comparebii,.-_Z7Comparebii .section .text._Z8Compare2bii,"axG",@progbits,_Z8Compare2bii,comdat .weak _Z8Compare2bii .type _Z8Compare2bii,@function_Z8Compare2bii:.LFB5: .cfi_startproc .cfi_personality 0x3,-24(%rbp) movl %edx,-28(%rbp) movb %al,-20(%rbp) movw
[==========][  CELERO  ][==========][ STAGE    ] Baselining[==========][ RUN      ] StackOverflow.Baseline -- 100 samples,5000000 calls per run.[     DONE ] StackOverflow.Baseline  (0.690499 sec) [5000000 calls in 690499 usec] [0.138100 us/call] [7241140.103027 calls/sec][==========][ STAGE    ] Benchmarking[==========][ RUN      ] StackOverflow.Compare -- 100 samples,5000000 calls per run.[     DONE ] StackOverflow.Compare  (0.782818 sec) [5000000 calls in 782818 usec] [0.156564 us/call] [6387180.672902 calls/sec][ BASEliNE ] StackOverflow.Compare 1.133699[ RUN      ] StackOverflow.Compare2 -- 100 samples,5000000 calls per run.[     DONE ] StackOverflow.Compare2  (0.700767 sec) [5000000 calls in 700767 usec] [0.140153 us/call] [7135039.178500 calls/sec][ BASEliNE ] StackOverflow.Compare2 1.014870[ RUN      ] StackOverflow.Compare3 -- 100 samples,5000000 calls per run.[     DONE ] StackOverflow.Compare3  (0.709471 sec) [5000000 calls in 709471 usec] [0.141894 us/call] [7047504.408214 calls/sec][ BASEliNE ] StackOverflow.Compare3 1.027476[ RUN      ] StackOverflow.Compare4 -- 100 samples,5000000 calls per run.[     DONE ] StackOverflow.Compare4  (0.712940 sec) [5000000 calls in 712940 usec] [0.142588 us/call] [7013212.893091 calls/sec][ BASEliNE ] StackOverflow.Compare4 1.032500[==========][ COMPLETE ][==========]
,-16(%rbp) movl -24(%rbp),%eax cmpl -28(%rbp),%eax setle %al movb %al,%eax setge %al movb %al,-15(%rbp) movzbl -20(%rbp),%eax cltq movzbl -16(%rbp,%rax),%eax leave ret .cfi_endproc.LFE5: .size _Z8Compare2bii,.-_Z8Compare2bii .text

现在,执行测试的实际代码可能是使用上述两个函数的内联版本,所以有可能这是错误的代码分析.就这样说,我在Compare()中看到一个jmp命令,所以我认为这意味着它是分支.如果是这样,我猜这个问题就变成了:为什么当我将一个[i]从rand()%2更改为true(或者对于该事情是false)时,为什么分支预测器不能提高Compare()的性能?

编辑2:我用“分支”代替“分支预测”,使我的帖子更加明智.

解决方法 我写了一个名为Celero的C库,旨在测试这样的优化和替代方案. (无耻自我推销: https://github.com/DigitalInBlue/Celero)

我使用以下代码运行您的案例:

cmp r8d,r9dmovzx   eax,dlsetle   BYTE PTR ret$[rsp]cmp r8d,r9dsetge   BYTE PTR ret$[rsp+1]movzx   eax,BYTE PTR ret$[rsp+rax]

结果如下:

xor r11d,r11dcmp r8d,r9dmov r10d,r11dsetg    r10btest    dl,dlmov ecx,r11dsete    clmov eax,r11dcmp ecx,r10dsetne   alcmp r8d,r9dsete    r11bor  eax,r11d

考虑到这个测试,看起来Compare2是这个微型优化的最佳选择.

编辑:

Compare2装配(最好的情况):

比较3装配(最好的情况):

总结

以上是内存溢出为你收集整理的微优化c比较功能全部内容,希望文章能够帮你解决微优化c比较功能所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/1233596.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-06
下一篇 2022-06-06

发表评论

登录后才能评论

评论列表(0条)

保存