User Tools

Site Tools


multi-cycle_enhancements

Introduction

This page lists optimizations which could be made to give additional performance with multi-cycling. In many cases multi-cycling already produces circuits much faster than without (e.g. 30% improvements in execution time), however results can be this good for all circuits with a few extra changes that I wasn't able to finish. They are listed in order of importance.

See Profiling-Driven Multi-Cycling for an introduction.

-through Constraints

It is often the case that a register begins a number of multi-cycle path, and these paths ends at the FSM (cur_state register) or a RAM (memory controller, usually a store). The problem is that even though all these paths end at different places in the circuit (different store instructions for mem controller or different icmp branch instructions for FSM), the ending of all these paths is the same, i.e. they all finish at the memory controller or the FSM. So often we have multiple paths starting from the same point and ending at the same point (and usually that end point is one of these 2 common cases).

The problem is that our multi-cycle constraints are of the form:

set_multicycle_path -from [get_registers {*KeySchedule__preheader8_indvar41_reg*}] -to [get_registers {*cur_state*}] -setup -end 4
set_multicycle_path -from [get_registers {*KeySchedule__preheader8_indvar41_reg*}] -to [get_registers {*cur_state*}] -hold -end 3

I.e., specify a src, a dst, and a slack. So if we have 2 paths with the same src/dst, and one path has multi-cycle latency 2, and the other has latency 20, then in order for this constraint to be applied to both paths we must pick the minimum of the 2 constraints, and almost certainly the 20-cycle path (forced to finish in 2 cycles) is now critical.

My fix applies -through constraints, and is turned on in legup.tcl using: set_parameter MULTI_CYCLE_ADD_THROUGH_CONSTRAINTS 1. However, the algorithm isn't very smart and doesn't always find a good way to add the -through constraints. This is the biggest problem with multi-cycling, and if we could print the -through constraints properly (needs some more work) most circuits would get much faster.

Another solution is to change the way we do the de-pipelining to make these kinds of unbalanced paths impossible, e.g. add some registers back so that it's not the same src/dest register for these paths.

Latency Investigation

Because the multi-cycling algorithm finds lots of cycle slack (e.g. across basic blocks), we don't need timing constraints in the scheduler for all basic blocks. E.g. if timing constraints would make a basic block have latency 4 cycles, but the paths in that BB will already have slack 7 due to slack computations from multi-cycling, timing constraints can be omitted for that basic block. That might make its latency go down to 2 instead of 4, and the multi-cycle slack down to 5 instead of 7, but that still might be more than enough.

A smart way to do it would be to find the multi-cycle slack (done now in GenerateRTL) then redo scheduling to not add timing constraints for BB with slack. Or, if software profiling is enabled, we know we'll add slack for infrequent BB so don't apply timing constraints to those.

I didn't have time to implement this, so instead I just turned timing constraints of for every BB. Of course this hurt Fmax for some circuits, even though it improved cycles, but the hope was that multi-cycling will get back some of the Fmax. And while some of the circuits did slow down, others stayed almost the same in terms of Fmax (because the critical path is likely not in the data path, so the timing constraints didn't help). Those circuits whose Fmax stayed flat or only dropped slightly ended up with some reductions in execution time because their cycles dropped.

FYI, the cycle drops can be seen here:

The cycle counts from the EUC paper:

http://www.legup.org:9100/builders/linux_x86_stratix4/builds/968/steps/shell_11/logs/stdio

The cycles with all timing constraints off, and multi-cycling + SW profiling (SW profiling raises cycles, but still cycles are way down. Note most circuits get slower because I just turned off all constraints but some sped up)

http://www.legup.org:9100/builders/linux_x86_stratix4/builds/1018/steps/shell_11/logs/stdio

Load Registers

From Profiling-Driven Multi-Cycling:

MULTI_CYCLE_DUPLICATE_LOAD_REG will force each load from memory (local and global) to have a unique load register, so that it can hold the loaded value are feed multi-cycle paths. While this is not necessary for multi-cycling, not setting this will reduce the opportunities for multi-cycling and I have not recently tested without it.

Right now, without this option we won't have as many multi-cycle paths (and it might not even work). But we don't really need to pull out load registers for every load, just the ones which are scheduled close to other loads. If a load does not have other loads scheduled right after it, e.g. the next load is 3 cycles away, then this load can still feed multi-cycle paths of latency up to 3 or maybe 4 (since loads take 2 cycles to complete). This would save a lot of registers. It's also worth just trying with MULTI_CYCLE_DUPLICATE_LOAD_REG turned off since that might not make things worse and fix the problem.

Also, as long as MULTI_CYCLE_DUPLICATE_LOAD_REG is on, local RAMs must have latency 2. That boosts fmax over latency 1 but of course hurts the cycle count.

SW Profiling Enhancements

The SW profiling only pushes multi-cycle destinations in a BB (stores, instructions used across BB, function calls), but not PHIs or Loads. But loads can also be dests of MC paths on their address port, so maybe we should push those too. However, then any “downstream” instruction from that load needs to be pushed twice as much.

For example, if we load a value, do some computation, then store it, then if we push the load (to dilate the schedule of whatever path the load terminated), we would need to push the store by twice as much in order to also dilate the path from the load to the store. One good idea to do this that Jason had was to just detect when we push loads, and then push everything scheduled after the load by 1 right away. I.e. do an incremental re-scheduling where when the load is pushed everything after it is pushed as well, and then when it's time for those things to be pushed, push them an additional cycle, etc.

We might also benefit from “pushing” PHIs. If a PHI is in an infrequent BB, then pushing it by 1 (or making its BB “start” 1 cycle later) would dilate the schedule of every incoming multi-cycle path.

Other

  • Multi-Cycling Dividers: The option in legup.tcl set_parameter MULTI_CYCLE_REMOVE_REG_DIVIDERS 1 was originally added to multi-cycle dividers, but this was not completed. Since in many cases dividers will not be nearly fully pipelined (e.g. maybe 1 or 2 divisions in flight at once), we can multi-cycle the dividers and save many registers.
  • MULTICYCLE_TODO” areas: I got rid of most of these but there are still some comments which I left as TODOs
multi-cycle_enhancements.txt · Last modified: 2014/07/18 22:30 by stefan