Bug 1085

Summary: parallel prefix sum polynomial interpolation
Product: Libre-SOC's first SoC Reporter: Luke Kenneth Casson Leighton <lkcl>
Component: DocumentationAssignee: Luke Kenneth Casson Leighton <lkcl>
Status: CONFIRMED ---    
Severity: enhancement CC: libre-soc-bugs, programmerjake
Priority: High    
Version: unspecified   
Hardware: Other   
OS: Linux   
NLnet milestone: --- total budget (EUR) for completion of task and all subtasks: 0
budget (EUR) for this task, excluding subtasks' budget: 0 parent task for budget allocation:
child tasks for budget allocation: The table of payments (in EUR) for this task; TOML format:
Bug Depends on:    
Bug Blocks: 953, 1019    

Description Luke Kenneth Casson Leighton 2023-05-19 09:47:04 BST
a recipe using prefix sum REMAP is needed that does polynomial
interpolation of non-linear equations, taylor series etc.
Comment 1 Luke Kenneth Casson Leighton 2023-05-19 09:48:28 BST
https://github.com/jiahao/parallel-prefix
Comment 2 Luke Kenneth Casson Leighton 2023-05-19 12:18:34 BST
<markos_> I've done a similar method in the past
<programmerjake> idk, but parallel prefix sum and parallel tree reduction can be used for polynomial evaluation:
<programmerjake> reduce(element_wise_mul(c, prefix_sum([1] + splat(x), op=mul)), op=add) ==
<programmerjake> c[0]+c[1]*x+c[2]*x^2+c[3]*x^3+c[4]*x^4...
<programmerjake> though it tends to have lower precision than using fma because of rounding before adding
Comment 3 Luke Kenneth Casson Leighton 2023-05-19 13:59:44 BST
(In reply to Luke Kenneth Casson Leighton from comment #2)
> <markos_> I've done a similar method in the past
> <programmerjake> idk, but parallel prefix sum and parallel tree reduction
> can be used for polynomial evaluation:
> <programmerjake> reduce(element_wise_mul(c, prefix_sum([1] + splat(x),
> op=mul)), op=add) ==
> <programmerjake> c[0]+c[1]*x+c[2]*x^2+c[3]*x^3+c[4]*x^4...
> <programmerjake> though it tends to have lower precision than using fma
> because of rounding before adding

ok writing this out myself so that my brain absorbs it

* start with:
    [1, x, x, x, x...]
* perform prefix sum multiply (parallel is fine)
    [1, x, x^2, x^3, ....]
* take vector of c
    [c0, c1, c2, c3, c4...]
* straight multiply:
    [c0, c1*x, c2*x^2, ....]
* perform reduction
  c0 + c1*x + c2*x^2 ...

ok that is dead easy.

so now what about parallel reduction using fmac?  let us first try
mapreduce on fmac:
   
    sv.madd/mr res, *c, *xv, res

that looks perfectly fine, and can get accuracy from
fmac.

now parallel reduction, res would have to be a vector,
but i don't believe it works, because *c (or *xv) would
end up getting multiplied in more than once. unless done
as Vertical-First.
Comment 4 Jacob Lifshay 2023-05-19 17:58:38 BST
(In reply to Luke Kenneth Casson Leighton from comment #3)
> (In reply to Luke Kenneth Casson Leighton from comment #2)
> > <markos_> I've done a similar method in the past
> > <programmerjake> idk, but parallel prefix sum and parallel tree reduction
> > can be used for polynomial evaluation:
> > <programmerjake> reduce(element_wise_mul(c, prefix_sum([1] + splat(x),
> > op=mul)), op=add) ==
> > <programmerjake> c[0]+c[1]*x+c[2]*x^2+c[3]*x^3+c[4]*x^4...
> > <programmerjake> though it tends to have lower precision than using fma
> > because of rounding before adding
> 
> ok writing this out myself so that my brain absorbs it
> 
> * start with:
>     [1, x, x, x, x...]
> * perform prefix sum multiply (parallel is fine)
>     [1, x, x^2, x^3, ....]
> * take vector of c
>     [c0, c1, c2, c3, c4...]
> * straight multiply:
>     [c0, c1*x, c2*x^2, ....]
> * perform reduction
>   c0 + c1*x + c2*x^2 ...

parallel reduction is fine here too -- reduce using fadd, though as noted it gets less accuracy than using fma (because fma doesn't round between the mul and add and because higher powers of x tend to give monomials (includes coefficients) that are smaller, so adding a smaller approximate value to a larger exact value (the coefficient) gives a value with higher relative precision). (edit: add forgotten parenthesis)
Comment 5 Luke Kenneth Casson Leighton 2023-05-19 18:06:54 BST
(In reply to Jacob Lifshay from comment #4)

> parallel reduction is fine here too -- reduce using fadd, though as noted it
> gets less accuracy than using fma (because fma doesn't round between the mul
> and add and because higher powers of x tend to give monomials (includes
> coefficients) that are smaller, so adding a smaller approximate value to a
> larger exact value (the coefficient) gives a value with higher relative
> precision).

sounds to me like /mrr (reverse-gear) would be sensible to use,
at the end where the tinier values would accumulate with better
accuracy.

exactly the kind of thing to put in an app note.