Intel MKL on AMD Zen
Aug 31, 2020
Tags:
dev,
Introduction
Disclaimer: this post investigates how fresh MKL versions behave
on Zen CPUs. You’ll want to well maybe maybe also still be taught the MKL license earlier than the use of MKL. I shall
no longer be held guilty for one of the best blueprint you use MKL.
Intel MKL has been known to use a SSE code paths on AMD CPUs that
strengthen more moderen SIMD instructions equivalent to of us that use the Zen
microarchitecture. A (by now) neatly-known trick has been to region the
MKL_DEBUG_CPU_TYPE
atmosphere variable to the value 5
to pressure
the utilization of AVX2
kernels
on AMD Zen CPUs. Sadly, this variable has been
eliminated
from Intel MKL 2020 Update 1 and later. This could perhaps maybe be confirmed without concerns
by running a program that uses MKL with ltrace -e getenv
.
Correct news: Intel appears to be adding Zen kernels
Nonetheless, it appears that Intel eliminated this risk attributable to they’re
adding Zen kernels to MKL. For instance, if we flee the ACES
dgemm
benchmark with MKL 2020.2.254 on a Ryzen 3700X, efficiency is real:
$ ./mt-dgemm 4000 | grep GF
GFLOP/s price: 382.756063 GF/s
A instant inspection with perf
reveals that nearly all cycles are spent
in a Zen-optimized kernel:
79.95% mt-dgemm libmkl_def.so [.] mkl_blas_def_dgemm_kernel_zen
Harmful news: sgemm
is no longer yet implemented
Nonetheless, it appears that they’ve no longer yet implemented Zen kernels
for every BLAS goal yet. I modified the ACES benchmark to
use the sgemm
BLAS goal and the outcomes aren’t fairly as
real:
$ ./mt-sgemm 4000 | grep GF
GFLOP/s price: 237.352720 GF/s
And certainly, perf
unearths that MKL would no longer use a Zen kernel:
88.90% mt-sgemm libmkl_def.so [.] LM_LOOPgas_1
A momentary workaround
Some instant tracing reveals that MKL uses a single goal
mkl_serv_intel_cpu_true
to detect whether or no longer it is miles dealing
with a real Intel CPU. Fortuitously, the goal is
reasonably trivial, so we can exchange it by our relish goal:
int mkl_serv_intel_cpu_true() {
return 1;
}
And produce together it as a shared library:
$ gcc -shared -fPIC -o libfakeintel.so fakeintel.c
And procure certain the library will get preloaded:
$ export LD_PRELOAD=libfakeintel.so
Now the sgemm
benchmark reveals real efficiency:
$ ./mt-sgemm 4000 | grep GF
GFLOP/s price: 851.541946 GF/s
And certainly, an AVX2-optimized code route is susceptible:
82.73% mt-sgemm libmkl_avx2.so [.] mkl_blas_avx2_sgemm_kernel_0
The most traditional minor downside is that MKL will additionally use AVX2 kernels for
other capabilities equivalent to dgemm
. Nonetheless this would no longer appear to impact
efficiency negatively. Truly, for the dgemm
benchmark efficiency
is a cramped higher on my machine (430 GF/s).
Making it permanent
Environment LD_PRELOAD
everytime on a machine can procure weary and one can
without concerns forget it. A straightforward reply is so to add our small library to the
ELF dynamic share of your program the use of
patchelf with the DT_NEEDED
tag. As an illustration:
$ patchelf --add-wished libfakeintel.so yourbinary