| Summary: | evaluate minerva for base in libre-soc | ||
|---|---|---|---|
| Product: | Libre-SOC's first SoC | Reporter: | Luke Kenneth Casson Leighton <lkcl> |
| Component: | Source Code | Assignee: | Luke Kenneth Casson Leighton <lkcl> |
| Status: | RESOLVED FIXED | ||
| Severity: | enhancement | CC: | libre-soc-bugs, programmerjake |
| Priority: | --- | ||
| Version: | unspecified | ||
| Hardware: | PC | ||
| OS: | Linux | ||
| NLnet milestone: | --- | total budget (EUR) for completion of task and all subtasks: | 0 |
| budget (EUR) for this task, excluding subtasks' budget: | 0 | parent task for budget allocation: | |
| child tasks for budget allocation: | The table of payments (in EUR) for this task; TOML format: | ||
|
Description
Luke Kenneth Casson Leighton
2020-03-11 14:31:19 GMT
will need adjusting to make the datapath between the core and L1 wider -- 64-bit at the very least, 128-bit or wider preferred. (In reply to Jacob Lifshay from comment #1) > will need adjusting to make the datapath between the core and L1 wider -- > 64-bit at the very least, 128-bit or wider preferred. yes. four LD/STs @ 32-bit is the minimum viable data width to the L1 cache, realistically. preferably four LD/STs @ 64 bit. this is a monster we're designing! address widths also need to be updated: i'm going to suggest parameterising them because we might not have time to do an MMU (compliant with the POWER ISA), just have to see how it goes. do note that compressed texture decoding needs to be able to load 128-bit wide values (a single compressed texture block), so our scheduling circuitry should be designed to support that. They should always be aligned, so we won't need to worry about that in the realignment network. (In reply to Jacob Lifshay from comment #3) > do note that compressed texture decoding needs to be able to load 128-bit > wide values (a single compressed texture block), okaaay. > so our scheduling circuitry > should be designed to support that. They should always be aligned, so we > won't need to worry about that in the realignment network. whew. so that's 128-bit-wide for _textures_... that's on the *load* side. are there any simultaneous (overlapping) "store" requirements? are the code-loops tight enough to require simultaneous 128-bit LD *and* 128-bit ST? (In reply to Luke Kenneth Casson Leighton from comment #4) > (In reply to Jacob Lifshay from comment #3) > > do note that compressed texture decoding needs to be able to load 128-bit > > wide values (a single compressed texture block), > > okaaay. > > > so our scheduling circuitry > > should be designed to support that. They should always be aligned, so we > > won't need to worry about that in the realignment network. > > whew. > > so that's 128-bit-wide for _textures_... that's on the *load* side. are > there any simultaneous (overlapping) "store" requirements? are the > code-loops tight enough to require simultaneous 128-bit LD *and* 128-bit ST? yes and no -- there is code that will benefit from simultaneous loads and stores (memcpy and probably most other code that has both loads and stores in a loop), however it isn't strictly necessary. It will be highly beneficial to support multiple simultaneous 8, 16, 32, or 64-bit loads to a single cache line all being able to complete simultaneously independently of alignment in that cache line. Also misaligned loads that cross cache lines (and possibly page boundaries), though those don't need to complete in a single cache access. All the above also applies to stores, though they can be a little slower since they are less common. I realize that that will require a really big realignment network, however the performance advantages I think are worth it. For a scheduling algorithm for loads that are ready to run (6600-style scheduler sent to load/store unit for execution, no conflicting stores in-front, no memory fences in-front), we can have a queue of memory ops and each cycle we pick the load at the head of the queue and then search from the head to tail for additional loads that target the same cache line stopping at the first memory fence, conflicting store, etc. Once those loads are selected, they are removed from the queue (probably by marking them as removed) and sent thru the execution pipeline. We can use a similar algorithm for stores. To find the required loads, we can use a network based on recursively summarizing chunks of the queue entries' per-cycle ready state, then reversing direction from the summary back to the queue entries to tell the entries which, if any, execution port they will be running on this cycle. There is then a mux for each execution port in the load pipeline to move the required info from the queue to the pipeline. The network design is based on the carry lookahead network for a carry lookahead adder, which allows taking O(N*log(N)) space and O(log(N)) gate latency. Loads/Stores that cross a cache boundary can be split into 2 load/store ops when sent to the queue and loads reunited when they both complete. They should be relatively rare, so we can probably support reuniting only 1 op per cycle. RMW Atomic ops and fences can be put in both load and store queues where they are executed once they reach the head of both queues. |