Bug 229

Summary: AV1 optimizations
Product: Libre-SOC's first SoC Reporter: cand
Component: Source CodeAssignee: Konstantinos Margaritis (markos) <konstantinos>
Status: RESOLVED FIXED    
Severity: enhancement CC: libre-soc-bugs, lkcl
Priority: ---    
Version: unspecified   
Hardware: PC   
OS: Linux   
See Also: http://bugs.libre-riscv.org/show_bug.cgi?id=230
NLnet milestone: NLNet.2019.10.031.Video total budget (EUR) for completion of task and all subtasks: 4000
budget (EUR) for this task, excluding subtasks' budget: 4000 parent task for budget allocation: 137
child tasks for budget allocation: The table of payments (in EUR) for this task; TOML format:
markos={amount=3200, submitted=2022-10-14, paid=2022-10-20} lkcl={amount=800, submitted=2022-10-14, paid=2022-10-20}
Bug Depends on:    
Bug Blocks: 952, 137    

Description cand 2020-03-13 09:59:06 GMT
Optimizing AV1 code in dav1d with new instructions.
Comment 1 Luke Kenneth Casson Leighton 2022-09-28 18:14:58 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=b58869c4f2efc7ab4a885e3a1de39fda616ddd57

added a horizontal-or demo which is easily adapted to do
horizontal-add or horizontal-mul (both useful in VPUs)
Comment 2 Luke Kenneth Casson Leighton 2022-10-14 11:30:19 BST
first working version of AV1 assembler.
needs more work.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=a65084c24742b43e79da714e5cd08f0d24a83eab
Comment 3 Konstantinos Margaritis (markos) 2022-10-14 11:44:09 BST
Using a similar method to VP9 investigation, we wrote an SVP64 implementation of dav1d's cdef_find_dir function, which is included in src/cdef_tmpl.c.

The SVP64 function demonstrates using all the available registers to minimize loads (unfortunately we cannot do zero-loads at the moment, but we will be when elwidth/subvl are fully operational). The function loads and processes in multiple ways a 8x8 array of pixels, in horizontal/vertical and diagonals (normal and slanted) producing a "cost" array of 8 elements. The results between C reference function and SVP64 are exactly the same:

C ref:
04858917 05cf5742 021c7323 01c68c56 05931132 03de109a 02f8e489 00f02d4b
SVP64 (register dump):
reg 24 04858917 05cf5742 021c7323 01c68c56 05931132 03de109a 02f8e489 00f02d4b

As a future improvement we could adopt elwidth=16 packed loads so that we can minimize the number of used registers even more and we can do the whole processing without a single memory access -apart from the initial buffer load!

This implementation demonstrates how complicated algorithms can be optimized with SVP64 and how the abundance of registers can almost eliminate memory access.
Comment 5 Luke Kenneth Casson Leighton 2022-10-14 15:02:17 BST
https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=380d9fc5bb078c313dc0d7dc4fcfcef63a990ca2

max-of-array-plus-index-of-the-last-max-element

for max-of-array-plus-index-of-first a small tweak
will be needed, to make sure the cmp doesn't activate
when the element-being-compared is equal.