| Summary: | allowing LD/ST-Update to select individual regsters needed | ||
|---|---|---|---|
| Product: | Libre-SOC's first SoC | Reporter: | Luke Kenneth Casson Leighton <lkcl> |
| Component: | Specification | Assignee: | Luke Kenneth Casson Leighton <lkcl> |
| Status: | CONFIRMED --- | ||
| Severity: | enhancement | CC: | ghostmansd, libre-soc-isa, programmerjake |
| Priority: | --- | ||
| Version: | unspecified | ||
| Hardware: | Other | ||
| OS: | Linux | ||
| See Also: | https://bugs.libre-soc.org/show_bug.cgi?id=1150 | ||
| NLnet milestone: | --- | total budget (EUR) for completion of task and all subtasks: | 0 |
| budget (EUR) for this task, excluding subtasks' budget: | 0 | parent task for budget allocation: | |
| child tasks for budget allocation: | The table of payments (in EUR) for this task; TOML format: | ||
| Bug Depends on: | |||
| Bug Blocks: | 1047, 1079, 1056 | ||
|
Description
Luke Kenneth Casson Leighton
2023-05-08 18:02:32 BST
(In reply to Luke Kenneth Casson Leighton from comment #0) > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA > > these become harder as the encoding space is only 6 bits (and there > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits > of EXTRA this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER > MASK_SRC 16:18 Execution Mask for Source so has to stay. that leaves just 6 bits to cover 3 registers. here's the bits of RM: Field Name Field bits Description MASKMODE 0 Execution (predication) Mask Kind MASK 1:3 Execution Mask SUBVL 8:9 Sub-vector length ELWIDTH 4:5 Element Width ELWIDTH_SRC 6:7 Element Width for Source EXTRA 10:18 Register Extra encoding MODE 19:23 changes Vector behaviour can't lose mask. can't lose SUBVL (priority for Pack/Unpack, already discussed bug #1077). *could* consider ELWIDTH_SRC, what effect does that have? * Vector of RB offsets could no longer be compressed * SEA becomes pointless could ELWIDTH instead be considered, and the operation width (ld lw lh lb) be used in its place? * yes as long as losing saturation and sign-extending is ok. (setting a larger ELWIDTH than the operation is a way to do zero or sign extending without needing intermediary registers to perform the extsb/h/w. losing ELWIDTH would require the extra instruction and registers). (In reply to Luke Kenneth Casson Leighton from comment #1) > (In reply to Luke Kenneth Casson Leighton from comment #0) > > > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA > > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA > > > > these become harder as the encoding space is only 6 bits (and there > > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits > > of EXTRA > > this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER please define VINDEX -- it is non-standard terminology -- do you mean load/store with index remap? that's basically gather/scatter but done using a different mechanism. actually, assuming the above definition of VINDEX, none of splat/gather/scatter (also includes VINDEX since that's basically gather/scatter) need more than one predicate. They work just fine on other ISAs with at most one predicate (e.g. RVV and AVX2/AVX512 all have separate splat/scatter/gather/compress/expand instructions that only have 1 predicate). The only load/store ops that need more than one predicate are compress/expand load/store (since they are only expressible by twin-predication in SVP64 since there are no dedicated compress/expand instructions or SVP64 MODEs), which can easily be done using ld/std (and maybe the *u or *x versions, but not both) instead of ldux/stdux. iirc the plan was originally to have twin-predication only on 1-in/1-out operations, which ldux/stdux clearly are not. > > > MASK_SRC 16:18 Execution Mask for Source > > so has to stay. that leaves just 6 bits to cover 3 registers. > > here's the bits of RM: > > Field Name Field bits Description > MASKMODE 0 Execution (predication) Mask Kind > MASK 1:3 Execution Mask > SUBVL 8:9 Sub-vector length > ELWIDTH 4:5 Element Width > ELWIDTH_SRC 6:7 Element Width for Source > EXTRA 10:18 Register Extra encoding > MODE 19:23 changes Vector behaviour > > can't lose mask. can't lose SUBVL (priority for Pack/Unpack, already > discussed bug #1077). *could* consider ELWIDTH_SRC, what effect does > that have? > > * Vector of RB offsets could no longer be compressed > * SEA becomes pointless > > could ELWIDTH instead be considered, and the operation width > (ld lw lh lb) be used in its place? > > * yes as long as losing saturation and sign-extending is ok. simple -- just set ELWIDTH larger than the load op and the load op intrinsically will do the sign/zero extend, no need for SVP64 to add sign/zero extension on top of that. (with the sole exception of signed bytes, thanks PowerISA for being non-orthogonal) saturation can still be done -- saturating from the load's type to the dest type (ELWIDTH + saturation's unsigned/signed bit). so this removes any need for ELWIDTH_SRC on any load/store ops afaict. (In reply to Jacob Lifshay from comment #2) > (In reply to Luke Kenneth Casson Leighton from comment #1) > > (In reply to Luke Kenneth Casson Leighton from comment #0) > > > > > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA > > > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA > > > > > > these become harder as the encoding space is only 6 bits (and there > > > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits > > > of EXTRA > > > > this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER > > please define VINDEX sm=1<<r3. or just sm=r3 where one bit is set. there is probably another name for it. >-- it is non-standard terminology -- do you mean > load/store with index remap? no. i would have said Indexed REMAP. > predicate). The only load/store ops that need more than one predicate are > compress/expand load/store i.e. all of them (as far as the actual scalar ld/sts are concerned) > iirc the plan was originally to have twin-predication only on 1-in/1-out > operations, which ldux/stdux clearly are not. the address (EA) is considered to be "1" in this case. > > could ELWIDTH instead be considered, and the operation width > > (ld lw lh lb) be used in its place? > > > > * yes as long as losing saturation and sign-extending is ok. > > simple -- just set ELWIDTH larger than the load op and the load op > intrinsically will do the sign/zero extend, no need for SVP64 to add > sign/zero extension on top of that. (with the sole exception of signed > bytes, thanks PowerISA for being non-orthogonal) deep joy. and it isn't _particularly_ useful to do shorter (load then truncate, that's just dumb). > saturation can still be done -- saturating from the load's type to the dest > type (ELWIDTH + saturation's unsigned/signed bit). > > so this removes any need for ELWIDTH_SRC on any load/store ops afaict. okaay. now we are cooking with gas. next stage, given two free bits, is to work out what regs can be expanded from EXTRA2 to EXTRA3. * lwzux RT,RA,RB if vectorised and used for list-pointer-chaining, it is RT and RA that must be allowed to be one-different. RB, because it is not updated, need not be EXTRA3. * stdux RS,RA,RB likewise. aieee this is going to be fun. (In reply to Luke Kenneth Casson Leighton from comment #3) > (In reply to Jacob Lifshay from comment #2) > > (In reply to Luke Kenneth Casson Leighton from comment #1) > > > (In reply to Luke Kenneth Casson Leighton from comment #0) > > > > > > > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA > > > > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA > > > > > > > > these become harder as the encoding space is only 6 bits (and there > > > > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits > > > > of EXTRA > > > > > > this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER > > > > please define VINDEX > > sm=1<<r3. or just sm=r3 where one bit is set. there is probably another > name for it. the standard name is extractelement or extract https://llvm.org/docs/LangRef.html#extractelement-instruction imho it may be more efficient to simply add r3 to the load address and perform a scalar load (optionally SVP64 prefixed) rather than setting sm=1<<r3, since that's much simpler and simple hardware then won't issue VL load ops for only one of them to succeed. extractelement is only really useful when extracting from a vector already in registers, since you can't always just add to the address for that. (In reply to Jacob Lifshay from comment #4) > (In reply to Luke Kenneth Casson Leighton from comment #3) > > (In reply to Jacob Lifshay from comment #2) > > > (In reply to Luke Kenneth Casson Leighton from comment #1) > > > > (In reply to Luke Kenneth Casson Leighton from comment #0) > > > > > > > > > lwzux,LDST_IDX,,2P,EXTRA2,EN,d:RT,d:RA,s:RB,0,RA_OR_ZERO,RB,0,RT,0,0,RA > > > > > stdux,LDST_IDX,,2P,EXTRA2,EN,d:RA,s:RS;s:RA,s:RB,0,RA_OR_ZERO,RB,RS,0,0,0,RA > > > > > > > > > > these become harder as the encoding space is only 6 bits (and there > > > > > are 3 regs, RT/RS RA RB) due to Twin-Predication taking up 3 bits > > > > > of EXTRA > > > > > > > > this cannot be lost as it destroys VSPLAT VINDEX VGATHER VSCATTER > > > > > > please define VINDEX > > > > sm=1<<r3. or just sm=r3 where one bit is set. there is probably another > > name for it. > > the standard name is extractelement or extract > https://llvm.org/docs/LangRef.html#extractelement-instruction > > imho it may be more efficient to simply add r3 to the load address and > perform a scalar load (optionally SVP64 prefixed) rather than setting > sm=1<<r3, since that's much simpler and simple hardware then won't issue VL > load ops for only one of them to succeed. > > extractelement is only really useful when extracting from a vector already > in registers, since you can't always just add to the address for that. oh, and insert or insertelement for the other way: https://llvm.org/docs/LangRef.html#insertelement-instruction (In reply to Jacob Lifshay from comment #4) > the standard name is extractelement or extract > https://llvm.org/docs/LangRef.html#extractelement-instruction > > imho (please do drop that, it's an affectation that gets tiring. we don't need to know that your opinion is "humble" - here we just need to know what your [valued] insights are, as third-person-objective constructive input. also please trim) > it may be more efficient to simply add r3 to the load address and > perform a scalar load (optionally SVP64 prefixed) rather than setting > sm=1<<r3, since that's much simpler and simple hardware then won't issue VL > load ops for only one of them to succeed. ta-daaa, now you're getting it. and that's an optimisation that would be performed by hardware that chose to implement micro-coding (which does *not* mean "like intel does it", it just means "some form of rewriting" rather than "straight naive 1:1". microwatt does micro-coding into OP_ADD) (In reply to Luke Kenneth Casson Leighton from comment #6) > (In reply to Jacob Lifshay from comment #4) > > it may be more efficient to simply add r3 to the load address and > > perform a scalar load (optionally SVP64 prefixed) rather than setting > > sm=1<<r3, since that's much simpler and simple hardware then won't issue VL > > load ops for only one of them to succeed. > > ta-daaa, now you're getting it. and that's an optimisation that would > be performed by hardware that chose to implement micro-coding (which does > *not* mean "like intel does it", it just means "some form of rewriting" > rather than "straight naive 1:1". microwatt does micro-coding into OP_ADD) umm, you seem to have missed my point which is that programmers should write a scalar load instruction (sv.ldx r4, r5, r3) rather than sv.ld/sm=1<<r3 r4, 0(r5) since simple cpus won't perform that optimization since that's more complex to do. (In reply to Jacob Lifshay from comment #7) > umm, you seem to have missed my point which is that programmers should write > a scalar load instruction (sv.ldx r4, r5, r3) rather than sv.ld/sm=1<<r3 r4, > 0(r5) since simple cpus won't perform that optimization since that's more > complex to do. whoops, that should have been sv.ld/sm=1<<r3 r4, 0(*r5) (In reply to Jacob Lifshay from comment #7) > umm, you seem to have missed my point which is that programmers should write > a scalar load instruction (sv.ldx r4, r5, r3) rather than sv.ld/sm=1<<r3 r4, > 0(*r5) since simple cpus won't perform that optimization since that's more > complex to do. blech, costs an extra register (RB=r3) but it is the same thing... or is it? ermermerm... oh! it isn't! not quite - it's a multiply/shift on r3. and needs a vector source. no you can use /els then the immediate becomes a multiplier... let me check, i can never remember if RA.isvec: svctx.ldstmode = indexed elif els == 0: svctx.ldstmode = unitstride elif immediate != 0: svctx.ldstmode = elementstride and then: elif svctx.ldstmode == elementstride: # element stride mode srcbase = ireg[RA] offs = i * immed # j*immed for a ST and... oh hang on if you really want r3 as an index, you can do element-strided on RB: if els and !RA.isvec and !RB.isvec: svctx.ldstmode = elementstride if svctx.ldstmode == elementstride: EA = ireg[RA] + ireg[RB]*j # register-strided so the syntax for that is: sv/ldx/els *RT, RA, RB # yes, just scalar on RB. |