A few reminders about dav1d

dav1d cores

If you follow this blog, you should know everything about dav1d.

The VideoLAN, VLC and FFmpeg communities have been working on a new AV1 decoder, dav1d, in order to create the best and fastest decoder.

A new very fast release

0.7.0 is a major new release, whose focus is, once again, speed. It is doubly interesting, for improvements are important for both computers and smartphones.

The ref_mv rewrite

For once, the biggest speed improvement for desktop and laptops is not coming from writing more assembly code, but from Ronald's rewrite of the ref_mv algorithm.

This new algo gives a 8-12% speed improvement measured on Haswell machines while reducing memory usage by 25%.
We're talking about 10% faster for the complete AV1 decoding, that's a more important impact than a lot of assembly we wrote.

x86 Assembly

With 0.7.0 release, the assembly for x86 CPUs (32bit and 64bit) is now totally complete for the 8bit bitdepth.

We finished up all the small optimizations that remained for SSSE3 and AVX2, notably film grain, during the 0.6.0 and 0.7.0 development cycles. We added more AVX-512 assembly, for those with very recent CPUs.

In the future, getting faster on those Intel CPU is going to be very difficult (I know I said that already many times, but this time it's true).

Dav1d is still around 3x to 5x faster than aomdec on normal computers; but we are now even more faster :). See older posts for more information.

ARM Assembly

As for 0.6.0, an important focus of dav1d 0.7.0 was ARM assembly, and notably for the 10bitdepth cases.

As of 0.7.0, most assembly you should care about is done for 8bit/10bit/12bit on ARM64 and this makes decoding AV1 on the phones affordable.

ARM speed vs gav1

gav1 is an open source decoder made by Google to compete with dav1d on Android and ARM.

As of 0.7.0, dav1d is between 1.8x and 2.5x faster on 8b content and 2.4x to 5x faster on 10b content than gav1 on different CPUs.

dav1d vs gav1 this graph was made on ODroid N2, for example.

Deep dive on ARM cores and performance

ARM CPUs for mobile devices have an architecture with both LITTLE and big cores, which offer different speed and different power usage.

Using different types of cores allows to consume only the power you need for normal tasks, and be able to go in max power, when requested.

It is therefore extremely important to analyze the performance of our ARM code on both types of cores and when mixing it.

So let's see have a look at how dav1d and gav1 compare on the reference AV1 sample, made by Netflix, Chimera and on the SnapDragon 821 (Pixel 1 phone): dav1d cores dav1d cores

Learnings

What we can learn from those graphs are the following:

  • dav1d can decode this sample, in all the above configurations, starting from 2 threads
  • gav1 is never able to decode that sample at 24fps, in LITTLE, big and big.LITTLE configurations
  • threading in gav1 is catastrophic: the more threads you add, the less efficient the decoding is
  • threading in dav1d is quite good: it always increases the performance, when you add more threads
  • max performance is around 2.3x faster in dav1d than gav1

For 10b, the situation is even worse for gav1.

I want to emphasis on the fact that dav1d can decode Chimera with 2 threads on the Pixel 1, from 2016, using only the LITTLE cores.

Focus on LITTLE cores on Android

So, what's interesting is to look at the LITTLE cores performance on Android to see the actual speed of the decoder, under low-power cases.

We tested here, all the threads configuration, on the following Android devices:

  • Google Pixel 1 (SnapDragon 821) (2016)
  • Google Pixel 2 (SnapDragon 835) (2017)
  • Google Pixel 3 (SnapDragon 845) (2018)
  • Xiaomi Mi 9T Pro (SnapDragon 855) (2019)

Here are the results: dav1d cores

Once again, we can see, on LITTLE cores:

  • dav1d is always at least 2x faster than gav1
  • we still see the previously mentioned threading issues on gav1
  • dav1d can decode Chimera at 24fps starting with 2 threads on the LITTLE cores, gav1 cannot

AV1 10bit on LITTLE

For the sake of completeness, here are the results for 10b on the LITTLE cores: dav1d cores

You can find all the details here, in the spreadsheet done by Nathan.

Conclusion and Thanks

dav1d is now a very fast decoder on desktops, laptops, but mostly on mobile where it shows very impressive performance on 8b and 10b. It can decode 1080p with a couple of cores on mobile.

Thanks a lot to Nathan Egge, from Mozilla, who gathered all the data required for this post. He therefore did all the work for this blogpost.