linux-kernel – 为什么cpu“insn per cycle”在类似的cpu中有所不同以及“MONITOR-MWAIT”如何在Linux中运行？_系统运维

概述背景：我有2个服务器,所有os内核版本是4.18.7,它有CONFIG_BPF_SYSCALL = y 我创建了一个shell脚本’x.sh’ i=0 while (( i < 1000000 )) do (( i ++ )) done 并运行命令：perf stat ./x.sh 所有的shell版本都是“4.2.6(1)-release” S1： CPU – Intel(R)Xeon( 背景：
我有2个服务器,所有os内核版本是4.18.7,它有CONfig_BPF_SYSCALL = y

我创建了一个shell脚本’x.sh’

i=0 while (( i < 1000000 )) do (( i ++ )) done

并运行命令：perf stat ./x.sh

所有的shell版本都是“4.2.6(1)-release”

S1：
cpu – Intel(R)Xeon(R)cpu E5-2630 v4 @ 2.20GHz,以及微码 – 0xb00002e
和perf统计结果

5391.653531      task-clock (msec)         #    1.000 cpus utilized                       4      context-switches          #    0.001 K/sec                               0      cpu-migrations            #    0.000 K/sec                             107      page-faults               #    0.020 K/sec                  12,910,036,202      cycles                    #    2.394 GHz                    27,055,073,385      instructions              #    2.10  insn per cycle          6,527,267,657      branches                  # 1210.624 M/sec                      34,787,686      branch-misses             #    0.53% of all branches           5.392121575 seconds time elapsed

S2：
cpu – Intel(R)Xeon(R)cpu E5-2620 v4 @ 2.10GHz,以及微码 – 0xb00002e
和perf统计结果

10688.669439      task-clock (msec)         #    1.000 cpus utilized                       6      context-switches          #    0.001 K/sec                               0      cpu-migrations            #    0.000 K/sec                             105      page-faults               #    0.010 K/sec                  24,583,857,467      cycles                    #    2.300 GHz                    27,117,299,405      instructions              #    1.10  insn per cycle          6,571,204,123      branches                  #  614.782 M/sec                      32,996,513      branch-misses             #    0.50% of all branches          10.688907278 seconds time elapsed

题：
我们可以看到cpu类似,os内核是一样的,但为什么perf stat的循环是如此的差异！

编辑：
我修改了shell和命令：
x.sh,将循环时间缩小以减少花费时间

i=0while (( i < 10000 )) do  (( i ++))done

命令,添加更多细节并重复
perf stat -d -d -d -r 100~ / 1.sh

结果
S1：

54.007015      task-clock (msec)         #    0.993 cpus utilized            ( +-  0.09% )             0      context-switches          #    0.002 K/sec                    ( +- 29.68% )             0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )           106      page-faults               #    0.002 M/sec                    ( +-  0.12% )   128,380,832      cycles                    #    2.377 GHz                      ( +-  0.09% )  (30.52%)   252,497,672      instructions              #    1.97  insn per cycle           ( +-  0.01% )  (39.75%)    60,741,861      branches                  # 1124.703 M/sec                    ( +-  0.01% )  (40.63%)       451,011      branch-misses             #    0.74% of all branches          ( +-  0.29% )  (40.72%)    66,621,188      L1-dcache-loads           # 1233.565 M/sec                    ( +-  0.01% )  (40.76%)        52,248      L1-dcache-load-misses     #    0.08% of all L1-dcache hits    ( +-  4.55% )  (39.86%)         1,568      LLC-loads                 #    0.029 M/sec                    ( +-  9.58% )  (29.75%)           168      LLC-load-misses           #   21.47% of all LL-cache hits     ( +-  3.87% )  (29.66%)<not supported>      L1-icache-loads                                                    672,212      L1-icache-load-misses                                         ( +-  0.85% )  (29.62%)    67,630,589      dTLB-loads                # 1252.256 M/sec                    ( +-  0.01% )  (29.62%)         1,051      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 33.11% )  (29.62%)        13,929      iTLB-loads                #    0.258 M/sec                    ( +- 17.85% )  (29.62%)        44,327      iTLB-load-misses          #  318.24% of all iTLB cache hits   ( +-  8.12% )  (29.62%)<not supported>      L1-dcache-prefetches<not supported>      L1-dcache-prefetch-misses                                      0.054370018 seconds time elapsed                                          ( +-  0.08% )

S2：

106.405511      task-clock (msec)         #    0.996 cpus utilized            ( +-  0.07% )             0      context-switches          #    0.002 K/sec                    ( +- 18.92% )             0      cpu-migrations            #    0.000 K/sec                             106      page-faults               #    0.994 K/sec                    ( +-  0.09% )   242,242,714      cycles                    #    2.277 GHz                      ( +-  0.07% )  (30.55%)   260,394,910      instructions              #    1.07  insn per cycle           ( +-  0.01% )  (39.00%)    62,877,430      branches                  #  590.923 M/sec                    ( +-  0.01% )  (39.65%)       407,887      branch-misses             #    0.65% of all branches          ( +-  0.25% )  (39.81%)    68,137,265      L1-dcache-loads           #  640.355 M/sec                    ( +-  0.01% )  (39.84%)        70,330      L1-dcache-load-misses     #    0.10% of all L1-dcache hits    ( +-  2.91% )  (39.38%)         3,526      LLC-loads                 #    0.033 M/sec                    ( +-  7.33% )  (30.28%)           153      LLC-load-misses           #    8.69% of all LL-cache hits     ( +-  6.29% )  (30.12%)<not supported>      L1-icache-loads                                                    878,021      L1-icache-load-misses                                         ( +-  0.43% )  (30.09%)    68,442,021      dTLB-loads                #  643.219 M/sec                    ( +-  0.01% )  (30.07%)         9,518      dTLB-load-misses          #    0.01% of all dTLB cache hits   ( +-  2.58% )  (30.07%)       233,190      iTLB-loads                #    2.192 M/sec                    ( +-  3.73% )  (30.07%)        17,837      iTLB-load-misses          #    7.65% of all iTLB cache hits   ( +- 13.21% )  (30.07%)<not supported>      L1-dcache-prefetches<not supported>      L1-dcache-prefetch-misses                                      0.106858870 seconds time elapsed                                          ( +-  0.07% )

编辑：
我检查/usr/bin/sh md5sum是否相同,并添加bash脚本标题#！ /usr/bin/sh,和之前的结果一样

编辑：
我找到了一些有价值的区别使用命令perf diff perf.data.s2 perf.data.s1

首先显示一些警告：

/usr/lib64/ld-2.17.so with build ID 93d2e4a501823d041413eeb652b89044d1f680ee not found,continuing without symbols/usr/lib64/libc-2.17.so with build ID b04a54c443d36058702ab4060c63f4ab3273eae9 not found,continuing without symbols

并发现rpm版本不同.

perf diff显示：

# Event 'cycles'## Baseline    Delta  Shared Object      Symbol# ........  .......  .................  ..............................................#21.20%   +3.83%  bash               [.] 0x000000000002c0f010.22%           libc-2.17.so       [.] _int_free 9.11%           libc-2.17.so       [.] _int_malloc 7.97%           libc-2.17.so       [.] malloc 4.09%           libc-2.17.so       [.] __gconv_transform_utf8_internal 3.71%           libc-2.17.so       [.] __mbrtowc 3.48%   -1.63%  bash               [.] execute_command_internal 3.48%   +1.18%  [unkNown]          [k] 0xfffffe0000032000 3.25%   -1.87%  bash               [.] xmalloc 3.12%           libc-2.17.so       [.] __strcpy_sse2_unaligned 2.44%   +2.22%  [kernel.kallsyms]  [k] syscall_return_via_sysret 2.09%   -0.24%  bash               [.] evalexp 2.09%           libc-2.17.so       [.] __ctype_get_mb_cur_max 1.92%           libc-2.17.so       [.] free 1.41%   -0.95%  bash               [.] dequote_string 1.19%   +0.23%  bash               [.] stupIDly_Hack_special_variables 1.16%           libc-2.17.so       [.] __strlen_sse2_pminub 1.16%           libc-2.17.so       [.] __memcpy_ssse3_back 1.16%           libc-2.17.so       [.] __strcmp_sse42 0.93%   -0.01%  bash               [.] mbschr 0.93%   -0.47%  bash               [.] hash_search 0.70%           libc-2.17.so       [.] __sigprocmask 0.70%   -0.23%  bash               [.] dispose_words 0.70%   -0.23%  bash               [.] execute_command 0.70%   -0.23%  bash               [.] set_pipestatus_array 0.70%           bash               [.] run_pending_traps 0.47%           bash               [.] malloc@plt 0.47%           bash               [.] var_lookup 0.47%           bash               [.] fmtumax 0.47%           bash               [.] do_redirections 0.46%           bash               [.] dispose_word 0.46%   -0.00%  bash               [.] alloc_word_desc 0.46%   -0.00%  [kernel.kallsyms]  [k] _copy_to_user 0.46%           libc-2.17.so       [.] __ctype_b_loc 0.46%           bash               [.] new_fd_bitmap 0.46%           bash               [.] add_unwind_protect 0.46%   -0.00%  bash               [.] discard_unwind_frame 0.46%           bash               [.] memcpy@plt 0.46%           bash               [.] __ctype_get_mb_cur_max@plt 0.46%           bash               [.] signal_in_progress 0.40%           libc-2.17.so       [.] _IO_vfscanf 0.40%           ld-2.17.so         [.] do_lookup_x 0.27%           bash               [.] mbrtowc@plt 0.24%   +1.60%  [kernel.kallsyms]  [k] __x64_sys_rt_sigprocmask 0.23%           bash               [.] List_append 0.23%           bash               [.] bind_variable 0.23%   +0.69%  [kernel.kallsyms]  [k] entry_SYSCALL_64_stage2 0.23%   +0.69%  [kernel.kallsyms]  [k] do_syscall_64 0.23%           libc-2.17.so       [.] _dl_mcount_wrapper_check 0.23%   +0.69%  bash               [.] make_word_List 0.23%   +0.69%  [kernel.kallsyms]  [k] copy_user_generic_unrolled 0.23%           [kernel.kallsyms]  [k] unmap_page_range 0.23%           libc-2.17.so       [.] __sigjmp_save 0.23%   +0.23%  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe 0.20%           [kernel.kallsyms]  [k] swapgs_restore_regs_and_return_to_usermode 0.03%           [kernel.kallsyms]  [k] page_fault 0.00%           [kernel.kallsyms]  [k] xfs_bmAPI_read 0.00%           [kernel.kallsyms]  [k] xfs_release 0.00%   +0.00%  [kernel.kallsyms]  [k] native_write_msr        +45.33%  libc-2.17.so       [.] 0x0000000000027cc6         +0.52%  [kernel.kallsyms]  [k] __mod_node_page_state         +0.46%  bash               [.] free@plt         +0.46%  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string         +0.46%  bash               [.] begin_unwind_frame         +0.46%  bash               [.] make_bare_word         +0.46%  bash               [.] find_variable_internal         +0.37%  ld-2.17.so         [.] 0x0000000000009b13

也许glibc差异就是答案！

编辑：
最后,我检查BIOS的配置,看到S2服务器使用省电模式,这才是真正的答案！

但是,BIOS的配置让我感到困惑,这是MONITOR-MWAIT,即使使用“Max Performance Mode”和“MONITOR-MWAIT”启用,S2的性能也不好.并使用命令cpupower IDle-info -o,请参阅cpu使用“C-state”,它已在“Max Performance Mode”中禁用.必须禁用加“最高性能模式”,性能要更好.

“MONITOR-MWAIT”的描述说一些 *** 作系统会检查这个选项以恢复“C状态”,我找不到linux内核如何使用它来改变“C状态”……

解决方法我找到了答案.

首先,让我们看一下内核4.18.7中BIOS的MONITOR / MWAIT选项.
在该内核中,它将使用intel_IDle驱动程序,此驱动程序仅检查系统是否支持mwait指令,并且不关心C状态是否已启用.
一旦使用MONITOR / MWAIT指令,将使用intel_IDle驱动程序,并强制使用C状态,似乎使用省电模式.

第二,为什么insn每个周期不同？
因为使用了服务调优,并且活动配置文件是“延迟性能”,force_latency是1us.
如果使用C状态,将使用C-state级别,其延迟小于force_latency;

# cpupower IDle-infocpuIDle driver: intel_IDlecpuIDle governor: menuanalyzing cpu 0:Number of IDle states: 5Available IDle states: PolL C1 C1E C3 C6PolL:Flags/Description: cpuIDLE CORE PolL IDLELatency: 0Usage: 13034605Duration: 820867557C1:Flags/Description: MWAIT 0x00Latency: 2Usage: 349471619Duration: 344311623672C1E:Flags/Description: MWAIT 0x01Latency: 10Usage: 237Duration: 55999C3:Flags/Description: MWAIT 0x10Latency: 40Usage: 350Duration: 168988C6:Flags/Description: MWAIT 0x20Latency: 133Usage: 3696Duration: 17809893

您将只看到PolL级别,其延迟小于1us,PolL级别将强制cpu使用nop指令运行.
在这种情况下,如果使用超线程技术,会使执行指令的速度下降一半.
因为两个逻辑内核将共享一个ALU,其中一个正在运行nop指令,导致另一个必须等待它.

如果您禁用MONITOR / MWAIT选项,则将禁用intel_IDle驱动程序,因此不会使用调优服务的force_latency,并且逻辑核心将停止,使另一个使用ALU排他性.

最后,谢谢大家,特别是@Peter Cordes和@osgx,让我检查BIOS,命令echo 2 ^ 1234567％2 | bc很漂亮！

总结

以上是内存溢出为你收集整理的linux-kernel – 为什么cpu“insn per cycle”在类似的cpu中有所不同以及“MONITOR-MWAIT”如何在Linux中运行？全部内容，希望文章能够帮你解决linux-kernel – 为什么cpu“insn per cycle”在类似的cpu中有所不同以及“MONITOR-MWAIT”如何在Linux中运行？所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: https://outofmemory.cn/yw/1032289.html