Some ideas for scheduling:
Scheduler 0: All-in-one-ASAP-scheduler (current LegUp scheduler)
Summary: Right now, we have an ASAP scheduler that supports multi-cycle operations (ie load), zero-cycle operations (chaining fast operations) and a hardware constraint of one load/store/call instruction per state. In terms of cycle count, we are a bit on the high side, but this is okay because our fmax is very high. The fmax peaks around 100MHz when we synthesize the memory controller using Cyclone II's memory bits, so there is much work that can be done to improve scheduling.
Pros: Functionality, Simplicity
Useful Thing 1: Personalized Dependency Graph
Summary: It would be nice to only have to calculate all data dependencies once and store them in an efficient data structure. Each node would contain an instruction and other instruction-specific data such as state, dependencies, uses and any new data the schedulers may require.
Pros: Modularity, Runtime
Scheduler 1: List scheduling
Summary: Given hardware constraints and the mobility for each operation, schedule constrained operations.
Requires: Personalized Dependency Graph, Simple ASAP Scheduler, Simple ALAP Scheduler, means to specify constraints independent of scheduler
Implementation: When the scheduler finds a hardware constraint that is not met, it moves the operation with the highest mobitity (ALAP state - ASAP state) to the next state and reschedules its dependencies.
Pros: Resource Constraint Functionality, Decrease Latency somewhat
Scheduler 2: Packed ASAP
Summary: Using timing data gathered for each instruction, chain as many operations as possible in one state to decrease the total number of states and maintaining a similar fmax.
Requires: Personalized Dependency Graph, Timing Data
Implementation: Same as ASAP, but each instruction will also need to remember its place in the current state. Assume A has delay 40, B 30, C 20 and D 50 and a cycle has delay 100 and each operation depends on the previous one. A would schedule in state 0 delay 40, B in state 0 delay 70, C in state 0 delay 90. D cannot fit in state 0, so it gets scheduled to state 1, ending at delay 50. The delays will be “remembered” in the personalized dependency graph.
Pros: Decrease Latency
Scheduler 3: Force-Directed Scheduling
Summary: A heuristic to decrease parallelism, thus decreasing area when resources are shared. This should be able to work on a chained schedule
Requires: Personalized Dependency Graph, Simple ASAP scheduler, simple ALAP scheduler, Binding (to get any results)
Pros: Decreased Area
Scheduler 4: Force-Directed List Scheduling
Summary: A resource-constrained scheduler heuristic, similar to Force-Directed Scheduling.
Requires: Force-Directed Scheduling
And in the possible distant future:
Scheduler 5: Code Speculation, aka optimizing across Basic Blocks
Summary: Optimize across basic blocks by speculating branches.
Implementation: If there is a good amount of area to spare, go ahead and speculate branches. Not as hard as it sounds. Instead of moving instructions across basic blocks, move states across basic blocks. Reuse ASAP scheduling but on states instead of instructions. If a basic block has predecessors, then try to ASAP schedule its states into each predecessor. Change what state the predecessor branches if needed.
For example BB1 has States A1, B1, C1, and T1 branching to BB3. State BB2 has States A2, B2, and T2 branching to BB3. State BB3 has states A3, B3 and T3, which terminates. BB3 has 2 predecessors, BB1 and BB2. If A3 and B3 only depend on A1 in BB1, then schedule A3 in parallel with B1, B3 in parallel with C1 and have T1 branch to BB3's state T3 (saving 2 states). If A3 and B3 both depend on B2 in BB2, schedule A3 in parallel with T2 and have T2 branch to BB3's state B3 (saving one state). Now since A3 is no longer run in BB3, it can be removed from BB3.
Update: To solve phi dependencies, instead of moving the phi assign statement to its basic block's last state, move it to the same statement which the phi variable is assigned. If assigned to a constant, put that to the first state (make sure speculation pushes up at most to the second state). Also, since we're moving entire states, I think in the statement machine, having an or case could work for the entire state, effectively just adding an or per code speculation instead of the entire state's area. For example, case stateA: case stateB: /* do operation */.
Requires: Personalized Data Dependency Graph, Resource Usage Estimation
Pros: Can decrease latency by trading off area
Scheduler 6: Pipelining, aka optimizing across Backward Branching Basic Blocks
Summary: Pipeline basic block(s) if there is a some area to spare.
Implementation: Find how many states it takes to calculate the value needed by the phi instruction (loop index). This will give an upper limit on how many pipeline stages can be done (number of states / states to calculate loop index). Then do something equivalent to splitting up the basic block into n basic blocks (similar to loop unwinding) and running code speculation on those blocks. May have to introduce some new registers for the pipeline stages, but otherwise resources can be shared.
Requires: Personalized Data Dependency Graph, Resource Usage Estimation, Code Speculation