我有2个服务器,所有os内核版本是4.18.7,它有CONfig_BPF_SYSCALL = y
我创建了一个shell脚本’x.sh’
i=0 while (( i < 1000000 )) do (( i ++ )) done
并运行命令:perf stat ./x.sh
所有的shell版本都是“4.2.6(1)-release”
S1:
cpu – Intel(R)Xeon(R)cpu E5-2630 v4 @ 2.20GHz,以及微码 – 0xb00002e
和perf统计结果
5391.653531 task-clock (msec) # 1.000 cpus utilized 4 context-switches # 0.001 K/sec 0 cpu-migrations # 0.000 K/sec 107 page-faults # 0.020 K/sec 12,910,036,202 cycles # 2.394 GHz 27,055,073,385 instructions # 2.10 insn per cycle 6,527,267,657 branches # 1210.624 M/sec 34,787,686 branch-misses # 0.53% of all branches 5.392121575 seconds time elapsed
S2:
cpu – Intel(R)Xeon(R)cpu E5-2620 v4 @ 2.10GHz,以及微码 – 0xb00002e
和perf统计结果
10688.669439 task-clock (msec) # 1.000 cpus utilized 6 context-switches # 0.001 K/sec 0 cpu-migrations # 0.000 K/sec 105 page-faults # 0.010 K/sec 24,583,857,467 cycles # 2.300 GHz 27,117,299,405 instructions # 1.10 insn per cycle 6,571,204,123 branches # 614.782 M/sec 32,996,513 branch-misses # 0.50% of all branches 10.688907278 seconds time elapsed
题:
我们可以看到cpu类似,os内核是一样的,但为什么perf stat的循环是如此的差异!
编辑:
我修改了shell和命令:
x.sh,将循环时间缩小以减少花费时间
i=0while (( i < 10000 )) do (( i ++))done
命令,添加更多细节并重复
perf stat -d -d -d -r 100~ / 1.sh
结果
S1:
54.007015 task-clock (msec) # 0.993 cpus utilized ( +- 0.09% ) 0 context-switches # 0.002 K/sec ( +- 29.68% ) 0 cpu-migrations # 0.000 K/sec ( +-100.00% ) 106 page-faults # 0.002 M/sec ( +- 0.12% ) 128,380,832 cycles # 2.377 GHz ( +- 0.09% ) (30.52%) 252,497,672 instructions # 1.97 insn per cycle ( +- 0.01% ) (39.75%) 60,741,861 branches # 1124.703 M/sec ( +- 0.01% ) (40.63%) 451,011 branch-misses # 0.74% of all branches ( +- 0.29% ) (40.72%) 66,621,188 L1-dcache-loads # 1233.565 M/sec ( +- 0.01% ) (40.76%) 52,248 L1-dcache-load-misses # 0.08% of all L1-dcache hits ( +- 4.55% ) (39.86%) 1,568 LLC-loads # 0.029 M/sec ( +- 9.58% ) (29.75%) 168 LLC-load-misses # 21.47% of all LL-cache hits ( +- 3.87% ) (29.66%)<not supported> L1-icache-loads 672,212 L1-icache-load-misses ( +- 0.85% ) (29.62%) 67,630,589 dTLB-loads # 1252.256 M/sec ( +- 0.01% ) (29.62%) 1,051 dTLB-load-misses # 0.00% of all dTLB cache hits ( +- 33.11% ) (29.62%) 13,929 iTLB-loads # 0.258 M/sec ( +- 17.85% ) (29.62%) 44,327 iTLB-load-misses # 318.24% of all iTLB cache hits ( +- 8.12% ) (29.62%)<not supported> L1-dcache-prefetches<not supported> L1-dcache-prefetch-misses 0.054370018 seconds time elapsed ( +- 0.08% )
S2:
106.405511 task-clock (msec) # 0.996 cpus utilized ( +- 0.07% ) 0 context-switches # 0.002 K/sec ( +- 18.92% ) 0 cpu-migrations # 0.000 K/sec 106 page-faults # 0.994 K/sec ( +- 0.09% ) 242,242,714 cycles # 2.277 GHz ( +- 0.07% ) (30.55%) 260,394,910 instructions # 1.07 insn per cycle ( +- 0.01% ) (39.00%) 62,877,430 branches # 590.923 M/sec ( +- 0.01% ) (39.65%) 407,887 branch-misses # 0.65% of all branches ( +- 0.25% ) (39.81%) 68,137,265 L1-dcache-loads # 640.355 M/sec ( +- 0.01% ) (39.84%) 70,330 L1-dcache-load-misses # 0.10% of all L1-dcache hits ( +- 2.91% ) (39.38%) 3,526 LLC-loads # 0.033 M/sec ( +- 7.33% ) (30.28%) 153 LLC-load-misses # 8.69% of all LL-cache hits ( +- 6.29% ) (30.12%)<not supported> L1-icache-loads 878,021 L1-icache-load-misses ( +- 0.43% ) (30.09%) 68,442,021 dTLB-loads # 643.219 M/sec ( +- 0.01% ) (30.07%) 9,518 dTLB-load-misses # 0.01% of all dTLB cache hits ( +- 2.58% ) (30.07%) 233,190 iTLB-loads # 2.192 M/sec ( +- 3.73% ) (30.07%) 17,837 iTLB-load-misses # 7.65% of all iTLB cache hits ( +- 13.21% ) (30.07%)<not supported> L1-dcache-prefetches<not supported> L1-dcache-prefetch-misses 0.106858870 seconds time elapsed ( +- 0.07% )
编辑:
我检查/usr/bin/sh md5sum是否相同,并添加bash脚本标题#! /usr/bin/sh,和之前的结果一样
编辑:
我找到了一些有价值的区别使用命令perf diff perf.data.s2 perf.data.s1
首先显示一些警告:
/usr/lib64/ld-2.17.so with build ID 93d2e4a501823d041413eeb652b89044d1f680ee not found,continuing without symbols/usr/lib64/libc-2.17.so with build ID b04a54c443d36058702ab4060c63f4ab3273eae9 not found,continuing without symbols
并发现rpm版本不同.
perf diff显示:
# Event 'cycles'## Baseline Delta Shared Object Symbol# ........ ....... ................. ..............................................#21.20% +3.83% bash [.] 0x000000000002c0f010.22% libc-2.17.so [.] _int_free 9.11% libc-2.17.so [.] _int_malloc 7.97% libc-2.17.so [.] malloc 4.09% libc-2.17.so [.] __gconv_transform_utf8_internal 3.71% libc-2.17.so [.] __mbrtowc 3.48% -1.63% bash [.] execute_command_internal 3.48% +1.18% [unkNown] [k] 0xfffffe0000032000 3.25% -1.87% bash [.] xmalloc 3.12% libc-2.17.so [.] __strcpy_sse2_unaligned 2.44% +2.22% [kernel.kallsyms] [k] syscall_return_via_sysret 2.09% -0.24% bash [.] evalexp 2.09% libc-2.17.so [.] __ctype_get_mb_cur_max 1.92% libc-2.17.so [.] free 1.41% -0.95% bash [.] dequote_string 1.19% +0.23% bash [.] stupIDly_Hack_special_variables 1.16% libc-2.17.so [.] __strlen_sse2_pminub 1.16% libc-2.17.so [.] __memcpy_ssse3_back 1.16% libc-2.17.so [.] __strcmp_sse42 0.93% -0.01% bash [.] mbschr 0.93% -0.47% bash [.] hash_search 0.70% libc-2.17.so [.] __sigprocmask 0.70% -0.23% bash [.] dispose_words 0.70% -0.23% bash [.] execute_command 0.70% -0.23% bash [.] set_pipestatus_array 0.70% bash [.] run_pending_traps 0.47% bash [.] malloc@plt 0.47% bash [.] var_lookup 0.47% bash [.] fmtumax 0.47% bash [.] do_redirections 0.46% bash [.] dispose_word 0.46% -0.00% bash [.] alloc_word_desc 0.46% -0.00% [kernel.kallsyms] [k] _copy_to_user 0.46% libc-2.17.so [.] __ctype_b_loc 0.46% bash [.] new_fd_bitmap 0.46% bash [.] add_unwind_protect 0.46% -0.00% bash [.] discard_unwind_frame 0.46% bash [.] memcpy@plt 0.46% bash [.] __ctype_get_mb_cur_max@plt 0.46% bash [.] signal_in_progress 0.40% libc-2.17.so [.] _IO_vfscanf 0.40% ld-2.17.so [.] do_lookup_x 0.27% bash [.] mbrtowc@plt 0.24% +1.60% [kernel.kallsyms] [k] __x64_sys_rt_sigprocmask 0.23% bash [.] List_append 0.23% bash [.] bind_variable 0.23% +0.69% [kernel.kallsyms] [k] entry_SYSCALL_64_stage2 0.23% +0.69% [kernel.kallsyms] [k] do_syscall_64 0.23% libc-2.17.so [.] _dl_mcount_wrapper_check 0.23% +0.69% bash [.] make_word_List 0.23% +0.69% [kernel.kallsyms] [k] copy_user_generic_unrolled 0.23% [kernel.kallsyms] [k] unmap_page_range 0.23% libc-2.17.so [.] __sigjmp_save 0.23% +0.23% [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe 0.20% [kernel.kallsyms] [k] swapgs_restore_regs_and_return_to_usermode 0.03% [kernel.kallsyms] [k] page_fault 0.00% [kernel.kallsyms] [k] xfs_bmAPI_read 0.00% [kernel.kallsyms] [k] xfs_release 0.00% +0.00% [kernel.kallsyms] [k] native_write_msr +45.33% libc-2.17.so [.] 0x0000000000027cc6 +0.52% [kernel.kallsyms] [k] __mod_node_page_state +0.46% bash [.] free@plt +0.46% [kernel.kallsyms] [k] copy_user_enhanced_fast_string +0.46% bash [.] begin_unwind_frame +0.46% bash [.] make_bare_word +0.46% bash [.] find_variable_internal +0.37% ld-2.17.so [.] 0x0000000000009b13
也许glibc差异就是答案!
编辑:
最后,我检查BIOS的配置,看到S2服务器使用省电模式,这才是真正的答案!
但是,BIOS的配置让我感到困惑,这是MONITOR-MWAIT,即使使用“Max Performance Mode”和“MONITOR-MWAIT”启用,S2的性能也不好.并使用命令cpupower IDle-info -o,请参阅cpu使用“C-state”,它已在“Max Performance Mode”中禁用.必须禁用加“最高性能模式”,性能要更好.
“MONITOR-MWAIT”的描述说一些 *** 作系统会检查这个选项以恢复“C状态”,我找不到linux内核如何使用它来改变“C状态”……
解决方法 我找到了答案.首先,让我们看一下内核4.18.7中BIOS的MONITOR / MWAIT选项.
在该内核中,它将使用intel_IDle驱动程序,此驱动程序仅检查系统是否支持mwait指令,并且不关心C状态是否已启用.
一旦使用MONITOR / MWAIT指令,将使用intel_IDle驱动程序,并强制使用C状态,似乎使用省电模式.
第二,为什么insn每个周期不同?
因为使用了服务调优,并且活动配置文件是“延迟性能”,force_latency是1us.
如果使用C状态,将使用C-state级别,其延迟小于force_latency;
# cpupower IDle-infocpuIDle driver: intel_IDlecpuIDle governor: menuanalyzing cpu 0:Number of IDle states: 5Available IDle states: PolL C1 C1E C3 C6PolL:Flags/Description: cpuIDLE CORE PolL IDLELatency: 0Usage: 13034605Duration: 820867557C1:Flags/Description: MWAIT 0x00Latency: 2Usage: 349471619Duration: 344311623672C1E:Flags/Description: MWAIT 0x01Latency: 10Usage: 237Duration: 55999C3:Flags/Description: MWAIT 0x10Latency: 40Usage: 350Duration: 168988C6:Flags/Description: MWAIT 0x20Latency: 133Usage: 3696Duration: 17809893
您将只看到PolL级别,其延迟小于1us,PolL级别将强制cpu使用nop指令运行.
在这种情况下,如果使用超线程技术,会使执行指令的速度下降一半.
因为两个逻辑内核将共享一个ALU,其中一个正在运行nop指令,导致另一个必须等待它.
如果您禁用MONITOR / MWAIT选项,则将禁用intel_IDle驱动程序,因此不会使用调优服务的force_latency,并且逻辑核心将停止,使另一个使用ALU排他性.
最后,谢谢大家,特别是@Peter Cordes和@osgx,让我检查BIOS,命令echo 2 ^ 1234567%2 | bc很漂亮!
总结以上是内存溢出为你收集整理的linux-kernel – 为什么cpu“insn per cycle”在类似的cpu中有所不同以及“MONITOR-MWAIT”如何在Linux中运行?全部内容,希望文章能够帮你解决linux-kernel – 为什么cpu“insn per cycle”在类似的cpu中有所不同以及“MONITOR-MWAIT”如何在Linux中运行?所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)