| Summary: | DataPointer concept: long-immediate references | ||
|---|---|---|---|
| Product: | Libre-SOC's first SoC | Reporter: | Luke Kenneth Casson Leighton <lkcl> |
| Component: | Specification | Assignee: | Luke Kenneth Casson Leighton <lkcl> |
| Status: | CONFIRMED --- | ||
| Severity: | enhancement | CC: | cand, libre-soc-isa, programmerjake |
| Priority: | --- | ||
| Version: | unspecified | ||
| Hardware: | PC | ||
| OS: | Linux | ||
| See Also: |
https://bugs.libre-soc.org/show_bug.cgi?id=213 https://bugs.libre-soc.org/show_bug.cgi?id=894 |
||
| NLnet milestone: | NLNet.2019.10.032.Formal | total budget (EUR) for completion of task and all subtasks: | 0 |
| budget (EUR) for this task, excluding subtasks' budget: | 0 | parent task for budget allocation: | |
| child tasks for budget allocation: | The table of payments (in EUR) for this task; TOML format: | ||
| Bug Depends on: | 894 | ||
| Bug Blocks: | |||
|
Description
Luke Kenneth Casson Leighton
2020-04-14 18:02:03 BST
A banked data scheme, like on the 65816, would be a slightly simpler version. Instead of incrementing a pointer, the data bank register contains the high bits, and the instruction contains the low bits. Such a setup also lets you access larger-than-11-bit immediates, and is basically MIPS's small data section made customizable. It would still be relocatable, just in 11 bit alignments. The linker would adjust the set-data-bank-register instructions on program load, for position-independent code. (In reply to cand from comment #1) > A banked data scheme, like on the 65816, would be a slightly simpler > version. Instead of incrementing a pointer, the data bank register contains > the high bits, and the instruction contains the low bits. that's the 2nd version i came up with > Such a setup also > lets you access larger-than-11-bit immediates, however you still need bits (or an instruction) specifying the immediate width, somehow. which is why i borrowed a couple of bits from the immediate to do that. > and is basically MIPS's small > data section made customizable. interesting. worth investigating. > It would still be relocatable, just in 11 bit alignments. The linker would > adjust the set-data-bank-register instructions on program load, for > position-independent code. nice. my feeling is we should take thus seriously because large immediate data load is such a massive part of programs yet they are highly space inefficient. Yes, the amount of instructions POWER needs to load a 64-bit immediate is not very efficient. (In reply to cand from comment #3) > Yes, the amount of instructions POWER needs to load a 64-bit immediate is > not very efficient. mitch alsup mentioned that those comprise something like 6% of all instructions (!). i'm blown away that this has been a missed opportunity to significantly reduce code size and i-cache usage. particularly as a reduction in i-cache size obeys a square law reduction in power consumption. therefore if we can get it down 5% that's a 10.25% reduction in power (!) (In reply to Luke Kenneth Casson Leighton from comment #4) > particularly as a reduction in i-cache > size obeys a square law reduction in power consumption. I would expect the cache power usage to be linearly proportional to it's size in bits once it's big enough to split into many smaller sram chunks, since each sram chunk would use about the same power and you'd only need a linear number of them. I have not been able to find any useful references for that though. > therefore if we can > get it down 5% that's a 10.25% reduction in power (!) yes -- except that cache sizes are almost always a power of 2 in size (or rarely 3 * 2^n), so the only effect is probably to reduce the access rate. (In reply to Jacob Lifshay from comment #5) > (In reply to Luke Kenneth Casson Leighton from comment #4) > > particularly as a reduction in i-cache > > size obeys a square law reduction in power consumption. > > I would expect the cache power usage to be linearly proportional to it's > size in bits once it's big enough to split into many smaller sram chunks, > since each sram chunk would use about the same power and you'd only need a > linear number of them. whilst the cache power usage itself is linear (that being one dimension of the square effect), the increased size means that on loops you cannot fit as much code of the average loop *into* that cache, meaning that you end up with more L2 cache lookups. this i believe is where the 2nd of the two dimensions of the square effect comes in. it's quite... odd :) (In reply to Luke Kenneth Casson Leighton from comment #6) > (In reply to Jacob Lifshay from comment #5) > > (In reply to Luke Kenneth Casson Leighton from comment #4) > > > particularly as a reduction in i-cache > > > size obeys a square law reduction in power consumption. > > > > I would expect the cache power usage to be linearly proportional to it's > > size in bits once it's big enough to split into many smaller sram chunks, > > since each sram chunk would use about the same power and you'd only need a > > linear number of them. > > whilst the cache power usage itself is linear (that being one dimension > of the square effect), the increased size means that on loops you cannot > fit as much code of the average loop *into* that cache, meaning that you > end up with more L2 cache lookups. this i believe is where the 2nd of > the two dimensions of the square effect comes in. I think you might be incorrect: increasing the size of the L1 cache should increase hit rate which decreases the rate of accessing L2 and memory. Having a bigger cache allows fitting more code into the cache, reducing the miss rate. So, increasing the cache size would have a slightly less than linear increase in power consumption due to decreasing L2 accesses. https://reverseengineering.stackexchange.com/questions/21944/powerpc-toc-and-sda DataPointer turns out to be an extension of the concept "TOC" register (r2) |