| Summary: | REMAP CSR for Matrix Multiplies | ||
|---|---|---|---|
| Product: | Libre-SOC's first SoC | Reporter: | Luke Kenneth Casson Leighton <lkcl> |
| Component: | Specification | Assignee: | Luke Kenneth Casson Leighton <lkcl> |
| Status: | RESOLVED FIXED | ||
| Severity: | enhancement | CC: | libre-soc-bugs |
| Priority: | --- | ||
| Version: | unspecified | ||
| Hardware: | Other | ||
| OS: | Linux | ||
| URL: | https://libre-soc.org/openpower/sv/remap | ||
| See Also: |
https://bugs.libre-soc.org/show_bug.cgi?id=701 https://bugs.libre-soc.org/show_bug.cgi?id=788 |
||
| NLnet milestone: | NLnet.2019.02.012 | total budget (EUR) for completion of task and all subtasks: | 0 |
| budget (EUR) for this task, excluding subtasks' budget: | 0 | parent task for budget allocation: | |
| child tasks for budget allocation: | The table of payments (in EUR) for this task; TOML format: | ||
| Bug Depends on: | |||
| Bug Blocks: | 213 | ||
|
Description
Luke Kenneth Casson Leighton
2019-10-07 13:01:01 BST
Apologies I hadn't realised quite how important swizzling really is. https://libre-riscv.org/simple_v_extension/vblock_format/#swizzle_format I have been looking at the PLX 3D paper and it contains an algorithm for 4x4 matrix times 4x1 vector. That algorithm is: fmul f2, f1.xxxx, f10 fmac f2, f1.yyyy, f11, f2 fmac f2, f1.zzzz, f12, f2 fmac f2, f1.wwww, f13, f2 VBLOCK swizzle table format can cope with this in a single block by setting a swizzler onto four registers that are *redirected* to f1, each with a different swizzle setting. Macro op fusion would result in *doubling* the number of instructions. Both are not ideal. For this particular case however I am inclined to review the decision to put the REMAP CSR on the back burner. https://libre-riscv.org/simple_v_extension/remap/ These were intended for Matrices, however I forgot about them after thinking that Vector Mul was not as high a priority. Swizzle looks to be extremely awkward and costly, making the REMAP CSRs attractive by comparison. With the right REMAP, setting * SHAPE1 to operate on a 4-element continuous loop and attached to f2 * SHAPE2 to wait 4 elements before incrementing by 1, and attaching to f1 the Matrix Multiply is LITERALLY reduced to 2 instructions, one of which is to clear out f2 to 4 zeros, the other is an FMAC with a VL of 16 (no SUBVLs). VL could be set with an SVP-64 instruction, no need to set up a VBLOCK. The alternative is to add REMAP to VBLOCK. |