By James Darnley

Introduction

As in previous blog posts, we work extensively on using high SIMD instructions on Intel CPUs to speed up video processing in open source libraries such as FFmpeg and Upipe.

Recently we have been considering using Intel’s new instruction set AVX-512 and its wider vector registers, 512-bit 64-byte ZMM registers, to see if we can eke more speed out of the code anywhere.  While we were gearing up to to test this, incorporating a very new assembler and an update to the x264asm compat layer, Cloudflare published its own findings on using these features in On the dangers of Intel’s frequency scaling.

Briefly put they showed that only using a little bit of code that uses ZMM registers can slow everything else down. The processor will reduce its operating frequency when it hits a ZMM instruction to reduce power consumption and heat output.

Because of that we decided to not try testing any ZMM registers.  Like Cloudflare we don’t spend enough of time in assembly functions to be able to take the CPU clock speed hit.  However the new instructions and EVEX prefix are available for narrower XMM and YMM registers and increases these to 32 registers.  Specifically this requires the AVX-512 Vector Length Extensions (VL) feature which the Skylake-X and new Xeon processors have.  If you can make use of the new features they may provide you with some speed gains.

Where to Start

Where would one begin?  There are so many new features that it can be hard to know.  There are op-masks, op-mask registers, blends, compares, permutes, conversions, scatters, and more.

I will start by covering a couple of instructions I have emulated in the past: maximum and minimum of packed signed quadwords; arithmetic right shift of packed quadwords; convert quadwords to doublewords.  These now exist as single instructions.  AVX-512 has added or extended many functions for quadwords, see Intel’s Optimization Reference Manual (pdf) section 15.13.

Arithmetic shift right of quadwords could be emulated with a pregenerated sign-extend mask and pxor; pcmpgtq; pand; psrlq; por and a spare register.  5 instructions only 1 of which could be done in parallel with the others, plus however many are needed to create the mask.  For the function I needed this the shift was constant for the duration of the function so it was a once-only cost to create the mask.  The five instructions could have a latency of 7 cycles whereas vpsraq is 1, 4, or 8 cycles, depending on the precise form used, according to Intel’s own documents about latency (pdf).

Maximum and minimum of packed signed quadwords can be emulated with pcmpgtq; pand; pandn; por and a spare register.  4 instructions, 5 if a memory operand is needed for the minimum, none can be done in parallel.  The four instructions to emulate could have a 6 cycle latency whereas vpmaxsq is 3 cycles or 10 with a memory operand.

Convert quadwords to doublewords: it now exists.  AVX-512 adds many down convert instructions for doublewords and quadwords with truncation, signed and unsigned saturation.  These are a bit like the reverse operation of the pmovsx and pmovzx instructions, move with sign or zero extend from SSE 4.1.  The min/max mentioned above was to work around this particular limitation.  I needed to pack and saturate the quadwords so I was clipping with min/max and then shuffling or blending values back together.

It would need a rewrite of the function to make good use of the new features because the rather ugly logic is partly a result of the limitations of older instruction sets.  It would also need a rewrite because the older blend instructions do not have an EVEX encoded form so cannot use the new 16 registers.  Because the x264asm compat layer, which Upipe and FFmpeg use, prefers the new registers AVX-512 isn’t a simple drop-in replacement for this.

Op-masks

Which brings me onto op-masks.  Op-masks are a feature that could see a great deal of use in code which has run-time branching, conditionals, or masking.  Blends can now done with op-masks.

The EVEX encoding means instructions now have a form like this vpaddw m1 {k1}, m2, m3 in which k1 is the op-mask.  k1 is one of eight dedicated op-mask registers.  They are manipulated using dedicated instructions, see the instruction set reference of Intel’s Software Development Manuals, the instructions begin with a ‘K’.  They can also be set using the result of the various compare instructions.  In this example each word in m1 will only be changed to the result of m2+m3 if the corresponding bit in k1 is set otherwise it is left unchanged.  The lowest word will check bit 0 up to the highest word which will check bit 15.

It is similar for a move, which you can turn into a blend with an op-mask.  New move instructions have been added vmovdqu8; vmovdq16; vmovdq32; vmovdq64.  With movdqu16 m1 {k1}, m2 each word value in the destination will only be changed to the source value if the corresponding bit is set.  Either the destination or the source could also be a memory location, like with the older moves.  This is a conditional move of packed values.

Another feature of these op-masks is the zeroing bit of the EVEX encoding.  In the form vpaddw m1 {k1}{z}, m2, m3 the instruction will will change m1 to be m2+m3 where the corresponding bit is in k1.  However when the bit is not set then the corresponding word value will be set to zero.  This benefits by not depending on the values in m1 before the instruction.  If you can use the zero values then it will be useful in that fashion too.

These op-masks are probably the biggest reason to rewrite functions because of the conditionals they let you use.  With the op-mask registers freeing vector registers from holding masks and with the new instructions freeing more registers that may have been used in emulation and with the added 16 registers there are now more registers than I know what to do with.  Most of the functions I’ve worked on were not short on registers, at least on x86-64.  I could store more constants in them rather than loading from memory but that only gets you a small speedup in most cases.

Summary

For those looking for a summary or a TL;DR of what they should look at in their own code I think you should focus these areas:

  • Any function that stores intermediate data into memory because of register pressure.
  • Any function with conditionals, any function with a compare instruction.
  • Any function that uses quadwords, uint64_t, or int64_t data types.

About Kieran Kunhya