Famously, Mark Twain popularised the phrase “Lies, damned lies and latency statistics” as he clearly got bored of seeing claims of ever-lower video latency on Linkedin and other social media. Regular readers of this blog will know we have spent a huge amount of time and effort building our own SDI cards in order to get latency as low as possible by controlling the electronics that capture/create pixels on the wire. So then why on Linkedin and other social media are there claims of 30ms latency one week and 10ms the next week using protocols like WebRTC and Media-over-QUIC? Was building our own SDI card a complete waste of time?
In most cases these claims are based on cherry-picked latency measurements where portions of the actual glass-to-glass latency have been ignored. This isn’t always the case. There are of course specialist applications like military encoding where these numbers are genuine and they have gone to extreme lengths and extreme costs to keep latency down.
Let’s look at the ways people can selectively measure latency in order to get a lower number:
Pretending capture is instantaneous
In most applications a video frame from a camera doesn’t appear instantaneously in the memory of the video encoder. The video data usually travels over a transport layer like USB or SDI. This isn’t an instantaneous process. The video frame usually arrives at the encoder at least one frame after the camera has sent it on the wire – i.e there is a capture latency which can be large in consumer applications. Just because a consumer application doesn’t have control over this latency, doesn’t mean it is nonexistent and you can start measuring latency from the point of video arriving in the software. Note that a camera might have inherent latency of its own, but there are cameras available with practically no latency (i.e capture to pixel on the wire time is low).
Advanced capture mechanisms that have access to the data as it arrives on the wire can already start processing with less than a frame. However, consumer applications are not capable of doing this. They have to wait for a frame to arrive. This delay is just often ignored. One way to handwave away this delay in the real world is to use a desktop screen share in the latency comparison. This means that the data arrives instantaneously at the encoder, conveniently hiding capture latency.
Latency ignored at 1080p60: 16-64ms
Pretending there is only one frame rate
Most operations in a pipeline are based on processing data frames at a time. It’s therefore beneficial to keep the frame duration as low as possible (i.e use a higher number of frames per second). Often in broadcast applications latency measurements are quoted using a high frame rate like 1080p50, even though a lot of production is still done at 25 frames per second. This allows numbers to be quoted that are half the actual latency in the real world.
Latency ignored at 1080p60: N/A
Pretending encoding is always fast
Another trick that’s used is assuming the encoder can encode frames quickly. For example at 1080p60, each frame lasts 16ms. But some implementations assume that the encoder can complete the encoding of a frame in a lower time such as 8ms. This may not be guaranteed in reality. It might happen to work in low bitrate consumer applications with simple content like talking heads, but at higher bitrates with more complex content used in professional applications this may not be the case. Assuming an encoded latency that’s too low can also be problematic. If for example the encode delay is assumed to be 8ms and a particularly difficult frame takes 12ms, there will be a stutter. This might be acceptable for videoconferencing applications but is not in professional applications.
Latency ignored at 1080p60: 8-10ms
Pretending (de)muxing delay doesn’t exist
Latency measurements often also just include the encode portion of the process and fail to ignore the fact that data is processed a frame at a time. Virtually all software solutions and many hardware solutions process data a frame at a time so even if the encode delay is low, there are still downstream processes like (de)muxing that happen frame by frame, negating many of the benefits of an encoder that is fast.
Often audio is compressed which means that the audio frame duration no longer matches that of a video frame. For example, a 1080p60 video frame lasts 16ms but an AAC-LC frame lasts 21.3ms. This means that somewhere in the pipeline there needs to be additional buffering so that each video frame has enough audio data associated with it. Professional applications don’t have this issue as they are able to send audio data uncompressed meaning audio and video frame durations match so no additional buffering is needed.
Latency ignored at 1080p60: 16ms
Pretending the network instantaneously transfers data
Another trick that’s often played is assuming the network instantaneously transfers data. Often this takes place in a lab and the time taken for compressed video to arrive at the receiver is assumed to be negligible. This isn’t the case in the real world as the worst-case needs to be accounted for. For example with a connection of 3Mbit/s and the worst-case largest video frame at 1080p60, it will take exactly the frame duration of 16ms to transfer this over the wire (i.e in the worst case it will be 16ms between the first packet of a video frame arriving and the last packet of a frame). Outside of a lab you can’t burst up to 1000Mbit/s. Note this isn’t the same as network latency (the time taken for a packet to traverse the network).
Latency ignored at 1080p60: 16ms
Pretending decoding is always fast
This one is the same as pretending encoding is also fast just on the other end. It’s a more realistic assumption as today’s decoders are very fast. Again it suffers from the same problem of worst-case complexity, especially at high bitrates. In professional applications, the video can be decoded as it arrives on the wire, slice by slice, negating this latency altogether. As soon as the last slice arrives, the video frame is completely decoded.
Latency ignored at 1080p60: 16ms
Pretending there is no further processing
In the real world there is also further processing such as audio resampling and frame synchronisation. In the professional world it’s important to do a like for like comparison between a device that’s frame synchronising to a genlock signal vs one that isn’t and is merely passing through data.
Latency ignored at 1080p60: 4-16ms
Pretending display is instantaneous
In our in-house developed SDI card we went to painful lengths to make sure that when we write data to the SDI card it will appear on screen in less than 1ms (assuming the monitor does no processing). But it’s widely known that computer desktops have around 100-200ms of latency. Measuring the rendering delay of a desktop operating system is a complex problem. There are a lot of different factors involved: GPUs don’t immediately present video to the screen, there’s unpredictable Operating System scheduling, application latencies, power saving modes etc. The linked articles explain very well how this latency is accumulated in modern computer systems.
So how do desktop/browser based videoconferencing protocols like claim 30ms? They often show a side-by-side comparison which inherently cherry-picks the latency measurement. This is because both sides of the comparison have the desktop display latency included. That might be fine for screen sharing but for interactive applications with video there’s an extra 100ms or so of latency that is ignored. This is why while the numbers for WebRTC latency might look good, they are not comparable to professional broadcast equipment designed specifically to minimise latency from capture to playback.
Latency ignored at 1080p60: Up to 100ms
Conclusions
We’ve shown that many measurements of latency are heavily cherry-picked to produce numbers which are low and ignore the full capture and playback pipeline. The accumulation of all these values (adding up to nearly 200+ms in many cases) produces a much larger latency than many people claim.
We’ve always been open and transparent about the latency in our products. We ship 100ms at 1080p60 in our encoders and have demonstrated 64ms latency using more sophisticated techniques not commonly used in software at tradeshows. This is the lowest latency possible of an encoder in our class.
We also covered the limitations of technologies like WebRTC in a professional environment in the blog post Why Does MPEG Transport Stream still exist (for example lack of interlaced support).