This does provide some speedup for this function. However, the IDCT could make much better use of the wider xmm registers and more registers (on 64-bit). The data block is 128 bytes, an 8x8 array of int16_t. Always look at your specific problem.
Having tried to implement this directly myself and failing, Ronald pointed me to the existing 10-bit IDCT which does use xmm registers and all 16 available on x86-64.
Comparing that with the C function and comparing the 8 and 10-bit C functions I thought it would be as simple as calling the existing macros with the right parameters. If only. It was pretty close but didn't pass testing.
Ronald pointed out that I should be using different rounding at one point and, very helpfully, contributed the value. Incorporating that still wasn't completely accurate. At first I thought this was because of tiny differences in the coefficients FFmpeg uses between the 8- and 10-bit functions. Minor differences of +/- 1, for example 16383 vs 16384 and 19265 vs 19266 but the former is used most often. This isn't too hard to correct. Do a little more macro magic and the code can be made to use the right ones. This is a little messy but keeps the functional code clean. This is commit 8221c71703.
After boxing this all up into a few neat commits I submitted the patches for review. Almost immediately, Michael found a problem. Most of the problem came from using one but not all of the 3 functions I wanted to add. Simply stated: it left the internal state of the decoder broken. It is expected that the IDCT functions all have the same permutation and that add/put functions exist.
While addressing this problem we kicked off a lengthy and slightly heated discussion about the legacy of the IDCT any why it uses the coefficients it does. That discussion is covered below.
There was one final problem with my new code which more thorough testing revealed. Unfortunately I don't remember who first highlighted this last one as we were discussing the legacy of the IDCT and other issues.
Ronald identified that in the case when there is only a DC coefficient in a row that the C code uses a short cut to prevent the need to do a full IDCT. Instead of multiplying by 16383 it only shifts to make it the equivalent of multiplying by 16384. After telling me the specifics he said that the code would need to perform a full IDCT and then merge in the "shortcut" values.
He must have been bored because while I was working on some other corner of the code he suddenly presented me with the appropriate patch. That can be seen in commit 8b19467d07.
Other than a few mostly minor cosmetic patches which precede it the commit which adds the new functions is d7246ea9f2.
A nice post on the subject by Michael is The MPEG1/2/4 and H.261/2/3 IDCT. A quick summary is that the encoder and decoder must have matching transforms otherwise the differences quickly accumulate.
Thanks to careful preservation of FFmpeg's history you can trace the origin of the different coefficients change back to 2002. Using git-blame to identify the last time a line was changed you can use git-show to show that patch and the message. If that wasn't related then use git-blame again on the preceding commit to see the time it was changed before then. Keep going until you discover the origin.
I did ask Michael about why he made the change but he couldn't remember the specifics. He guessed that the reason was it was a better match for encoders/files of the time. Another unfortunate effect of legacy code: you don’t always remember why and shows why it’s important to document why things are done a certain way.
Part three will cover other bugs uncovered by changing this key piece of legacy code and the assumptions made about it.