Intel MKL on AMD Zen

Aug 31, 2020

Tags:

Introduction

Disclaimer: this post investigates how fresh MKL versions behave
on Zen CPUs. You’ll want to well maybe maybe also still be taught the MKL license earlier than the use of MKL. I shall
no longer be held guilty for one of the best blueprint you use MKL.

Intel MKL has been known to use a SSE code paths on AMD CPUs that
strengthen more moderen SIMD instructions equivalent to of us that use the Zen
microarchitecture. A (by now) neatly-known trick has been to region the
MKL_DEBUG_CPU_TYPE atmosphere variable to the value 5 to pressure
the utilization of AVX2
kernels
on AMD Zen CPUs. Sadly, this variable has been
eliminated
from Intel MKL 2020 Update 1 and later. This could perhaps maybe be confirmed without concerns
by running a program that uses MKL with ltrace -e getenv.

Correct news: Intel appears to be adding Zen kernels

Nonetheless, it appears that Intel eliminated this risk attributable to they’re
adding Zen kernels to MKL. For instance, if we flee the ACES
dgemm
benchmark with MKL 2020.2.254 on a Ryzen 3700X, efficiency is real:

$ ./mt-dgemm 4000 | grep GF    
GFLOP/s price:         382.756063 GF/s

A instant inspection with perf reveals that nearly all cycles are spent
in a Zen-optimized kernel:

79.95%  mt-dgemm  libmkl_def.so           [.] mkl_blas_def_dgemm_kernel_zen

Harmful news: `sgemm` is no longer yet implemented

Nonetheless, it appears that they’ve no longer yet implemented Zen kernels
for every BLAS goal yet. I modified the ACES benchmark to
use the sgemm BLAS goal and the outcomes aren’t fairly as
real:

$ ./mt-sgemm 4000 | grep GF
GFLOP/s price:         237.352720 GF/s

And certainly, perf unearths that MKL would no longer use a Zen kernel:

88.90%  mt-sgemm  libmkl_def.so           [.] LM_LOOPgas_1

A momentary workaround

Some instant tracing reveals that MKL uses a single goal
mkl_serv_intel_cpu_true to detect whether or no longer it is miles dealing
with a real Intel CPU. Fortuitously, the goal is
reasonably trivial, so we can exchange it by our relish goal:

int mkl_serv_intel_cpu_true() {
  return 1;
}

And produce together it as a shared library:

$ gcc -shared -fPIC -o libfakeintel.so fakeintel.c

And procure certain the library will get preloaded:

$ export LD_PRELOAD=libfakeintel.so

Now the sgemm benchmark reveals real efficiency:

$ ./mt-sgemm 4000 | grep GF
GFLOP/s price:         851.541946 GF/s

And certainly, an AVX2-optimized code route is susceptible:

82.73%  mt-sgemm  libmkl_avx2.so          [.] mkl_blas_avx2_sgemm_kernel_0

The most traditional minor downside is that MKL will additionally use AVX2 kernels for
other capabilities equivalent to dgemm. Nonetheless this would no longer appear to impact
efficiency negatively. Truly, for the dgemm benchmark efficiency
is a cramped higher on my machine (430 GF/s).

Making it permanent

Environment LD_PRELOAD everytime on a machine can procure weary and one can
without concerns forget it. A straightforward reply is so to add our small library to the
ELF dynamic share of your program the use of
patchelf with the DT_NEEDED
tag. As an illustration:

$ patchelf --add-wished libfakeintel.so yourbinary

Study Extra