Bug 776

Summary: Documentation of designs, code, processes, and other relevant things as needed
Product: Libre-SOC's second ASIC Reporter: Jacob Lifshay <programmerjake>
Component: source codeAssignee: Luke Kenneth Casson Leighton <lkcl>
Status: CONFIRMED ---    
Severity: enhancement CC: andy.miroshnikov, ghostmansd, libre-soc-bugs, lkcl, programmerjake, shriya.sharma
Priority: ---    
Version: unspecified   
Hardware: Other   
OS: Linux   
See Also: https://bugs.libre-soc.org/show_bug.cgi?id=1152
NLnet milestone: NLnet.2021.02A.052.CryptoRouter total budget (EUR) for completion of task and all subtasks: 8000
budget (EUR) for this task, excluding subtasks' budget: 0 parent task for budget allocation: 589
child tasks for budget allocation: 968 1006 1158 1166 The table of payments (in EUR) for this task; TOML format:
Bug Depends on: 809    
Bug Blocks: 589    

Description Jacob Lifshay 2022-02-15 06:49:56 GMT
In order to make it more likely for our project to be understandable and useful,
documentation of designs, code, processes, and other relevant things is necessary.
ISA Standard creation and submission covered by bug #952
Comment 1 Luke Kenneth Casson Leighton 2023-09-05 04:53:08 BST
konstantinos i am assigning this bugreport to you as a reminder to
add an ed25519 sub-bug, and also to discuss who is going to
add/document a "bigint long-multiply REMAP Schedule" that i
need to sketch an outline for, as well. as jacob has done
a Prefix-Sum REMAP a few months back he can guide on doing it.
Comment 2 Jacob Lifshay 2023-09-05 05:00:50 BST
(In reply to Luke Kenneth Casson Leighton from comment #1)
> konstantinos i am assigning this bugreport to you as a reminder to
> add an ed25519 sub-bug, and also to discuss who is going to
> add/document a "bigint long-multiply REMAP Schedule" that i
> need to sketch an outline for, as well. as jacob has done
> a Prefix-Sum REMAP a few months back he can guide on doing it.

unfortunately, because a long-multiply needs 2 kinds of insns (carrying-wide-madd and carrying-add), you can't easily do that as a REMAP schedule. Additionally, it is substantially faster to use Karatsuba multiplication once you get inputs more than a few hundred bits wide (and other more complex algorithms for wider multiplies).
Comment 3 Konstantinos Margaritis (markos) 2023-09-05 18:21:33 BST
(In reply to Jacob Lifshay from comment #2)
> unfortunately, because a long-multiply needs 2 kinds of insns
> (carrying-wide-madd and carrying-add), you can't easily do that as a REMAP
> schedule. Additionally, it is substantially faster to use Karatsuba
> multiplication once you get inputs more than a few hundred bits wide (and
> other more complex algorithms for wider multiplies).

I would pick the simplest and fastest to implement long-multiply method for this one, speed is not a requirement. We can always optimize later.
Comment 4 Jacob Lifshay 2023-09-05 18:50:47 BST
(In reply to Konstantinos Margaritis (markos) from comment #3)
> (In reply to Jacob Lifshay from comment #2)
> > unfortunately, because a long-multiply needs 2 kinds of insns
> > (carrying-wide-madd and carrying-add), you can't easily do that as a REMAP
> > schedule. Additionally, it is substantially faster to use Karatsuba
> > multiplication once you get inputs more than a few hundred bits wide (and
> > other more complex algorithms for wider multiplies).
> 
> I would pick the simplest and fastest to implement long-multiply method for
> this one, speed is not a requirement. We can always optimize later.

yes, except that the stuff that's going into the PowerISA spec. needs to actually be as fast as we can make it since it's for forever, not just for the crypto-router.

imho doing REMAP for just O(n^2) multiply is fine (except for the complexity due to multiple different insns), since Karatsuba multiplication can just run those insns a bunch of times.
Comment 5 Luke Kenneth Casson Leighton 2023-09-05 19:06:05 BST
(In reply to Konstantinos Margaritis (markos) from comment #3)

> I would pick the simplest and fastest to implement long-multiply method for
> this one, speed is not a requirement. We can always optimize later.

the top priority for the embedded application which is commercially
confidential is to fit within 1 to 2 L1 cache lines.

that is *real* tight.

optimisation for "speed" is very low priority indeed.

Knuth Algorithms D and M are perfectly fine and Jacob and I already
did the conversion when doing the madd dsld and divmod instructions.

but for ed25519 a totally different approach is needed because they
did carry-save.  please read the edited comment on that, raise the
bugreports so i can properly fill them in.