User Tools

Site Tools


meeting_minutes

June 3, 2011

June 3 Meeting minutes [BB] = put on the back burner

Topics:

  • Throughput driven scheduling.
    • Optimize latency*period
  • Profiling-driven scheduling.
  • Loop pipelining/clever function inlining
  • SDC Implementation
  • User constrained scheduling (pragmas)
  • [BB] Partitioning program to software vs hardware
  • Binding
    • sharing functional units good only for large, possibly for smaller units chained together?
    • 4 LUT vs 6 LUT
    • collapse mux into functional unit
  • [BB] Other HW architectures
  • HLS ↔ memory architecture interactions
    • pipelining
    • number of memory ports vs latencies
  • [BB] auto parallelising function calls
  • LLVM compiler passes
  • Clang
    • investigate pragmas
  • Speculative scheduling

Outcomes:

Andrew:

  • Loop pipelining
  • Loop unrolling
  • port clang
  • HLS ↔ Memory architecture interactions.

Jason:

  • Unordered List ItemSDC implementation

Stefan:

  • Unordered List ItemBinding utility study

Kevin:

  • Unordered List ItemUnordered List ItemMemory profiling

Leave LLVM passes till later

Post GUI/debugging framework as ECE design project (2 man team)

June 1, 2011

Research areas:

  1. µP/Accelerator Interface
  2. Parallel µP/Accelerators

1. µP/Accelerator Interface

DE4 Port

  • Contact Steve to set up meeting

Ways of talking to RAM (multiport and multipump)

Multiport:

  • Currently if two accelerators access the single port cache they block each other, so 2 ports should cause improvement. But this was not seen because off chip memory is slow, hence switch to DE4 which uses DDR2
  • Extreme cases are 1 port (Avalon arbitration) vs. one port per accelerator. In between, could have e.g. 2 accelerators / port, saving area since # RAMs = # input x # output ports
  • We could offer priority to accelerators which are computation bound  requires memory access profiling

Multipump / Multi-Clock Domain:

  • Less area, but the system must be clocked slower. Only practical to clock the system 2x (maybe 4x) slower

Parallelization Schemes

  • James has implemented polling

Partition Program

  • Allow the programmer to partition the program data

Partition Data

  • DE4 has multiple banks, and so each accelerator can have data in a different bank

Pre-Fetching

  • One FSM can fetch data and the other can perform computation

Multiple Caches / customization of cache parameters

  • E.g. Intel has 2 L1 caches instead of a multi-port large L1 cache
  • When are dual L1 caches beneficial over a large L1 cache?

Cache Size

  • Easy to change on DE4

Memory Access Profiling

  • LLVM pass
  • Analyze Parallelism

Priorities

  • DE4 Port
  • Combing multi-port and multi-pump with DE4. If this is not successful, examine multiple memories / cache size.

Projects for Stefan and Kevin

  • Cache Simulator for memory access profiling: simulate CHStone designs in modelsim and save memory accesses (address being accessed and where it is placed in the cache) in a text file, then build cache simulator
  • Benchmarks: analyze the rest of CHStones and also e.g. tiled matrix multiply

August 10, 2010

Andrew

  • Looked at jpeg, some output makes no sense
  • Keep jpeg for now, email CHStone about jpeg and a golden result
  • Start writing paper after returning from UBC

Victor

  • Present at BA 1170 10:30 tomorrow

Ahmed

  • Present at BA 1180 2:30 tomorrow

Mark

  • Dynamic power is consistent around 12 mW for VanProf (through half of CHStone)
  • High confidence results
  • Power improvement likely better than area improvement

James

  • Long discussion about second data points for paper
  • Move 3rd most compute-intensive function to hardware
    • Alternative, most compute-intensive function with no memory access
  • Move a different function if the 3rd is very similar to 2nd function
  • Try crossing clock domains on ModelSim

Jason

  • Added PC speculation for branching, moved some ID into IF, etc…
  • Removed exception handling mux to save one logic level
  • Basically performing manual retiming on critical path

August 6, 2010

James

  • blowfish is working on the hybrid system, currently working on jpeg
  • Look into crossing clock domains for a hybrid flow
    • Maybe ask Paul Chow for advice
    • Measure execution time even if only functional in simulation for comparison
  • Try more than one accelerator at once
  • Some benchmarks jump into the accelerator multiple times and this is working
  • Want 2nd data point for September paper
    • More important than automating the hybrid flow

Victor + Ahmed

  • Give dry run UnERD presentation at SF2104, Monday at noon

August 5, 2010

Ahmed

  • Pipelined division: easy to change number of pipeline stages
  • Pipelined multiplier is more difficult
    • Stall logic doesn't think multiplication requires a stall
  • Division is functional, working on multiplication

Andrew

  • Pipelined divider works
  • Some questions after presenting:
  • How do we measure parallelism?
    • Maybe count number of instructions schedule into a state and take the average
  • How much of our performance is due to LLVM?
    • Would running different LLVM compiler passes help increase or decrease or results?
  • How do we know if/when the final output from LegUp is good enough for practical use?

August 4, 2010

Mark

  • Preliminary power results are looking good, VanProf beats SnoopP by a few times
  • Output is not correct though from either profiler

James

  • Cache fix works on motion, trying out aes
  • BuildBot nightly run: 23/27 tests work on navy.eecg, 17/27 work on Andrew's computer
    • jpeg, aes, shift and div_const fail on navy

Victor + Ahmed

  • Made UnERD podium presentation on Wednesday, August 11
  • Dry run Monday meeting

Victor

  • Presented Force-Directed Scheduling (FDS) and List Scheduling on the whiteboard
  • Go over the math of FDS on Friday

Jason

  • Retiming analysis changed the critical path, which isn't present in the HDL
  • Without register retiming, FMax ~ 55MHz
  • Without register retiming, the critical path is (chained):
    • 32-1 mux for register decoding (2 in parallel)
    • 32-bit register value compare
    • Addition of PC to the branch offset
  • Possible fixes:
    • Use M4K to store registers instead of using registers
    • Speculate new PC to allow add and compare to go in parallel

August 3, 2010

Mark

  • Hierarchy profiling works
  • Working on measuring the power of VanProf
    • Trying to do this with tcl script instead of readmemh
  • Read through other profiler paper from DAC 2009
    • Monitors loops (short backward branching) across functions and context switching (multi-threading)
    • Area: ~10% of ARM processor, hard to compare with FPGA area
    • Only monitors loops, not functions
    • Can filter out user-specified functions
    • Uses its own cache to store data, about 98% accuracy
    • Monitors PC, similar to VanProf
    • No power data
    • Loops could be a future target for VanProf

Ahmed

  • Tried cache enable fix on board
  • Does not work when reused on accelerators
  • May want to remove ModelSim warnings for Tiger (but, higher priority in tuning Tiger)

Jason

  • Putting more of the processor together, such as the instruction cache lowers Fmax from ~93MHz to ~55 MHz
  • ALU (mult and div) fix seems easy compared to instruction cache + decode loop

James

  • Pushed new processor on git
  • adpcm benchmark now works
  • Working on jpeg benchmark
  • Give short summary on progress on Thursday

Andrew

  • Figure out what operations are worth sharing
  • Let multipliers and dividers for now
    • Try to fit benchmarks on DE2 quickly

Victor

  • Give a white board talk on Force-Directed Scheduling tomorrow

July 27, 2010

Mark

  • try to get power results for vanProf vs. SnoopP

Ahmed

  • debugging cacheClkEn with SignalTap
  • had to install XP viertual machine for Mips Communication Server
  • divider does have max number of pipeline stages in GUI

Victor

  • will look into software division on chstone, see the instruction count difference

Andrew

  • looking at build bot performance graphs (like Chrome has)
  • will want to script hW/SW results overnight when all parts are ready
  • will investigate accelerating different functions for sha

James

  • sha/mips/gsm work with accelerator/processor system
  • motion gives wrong result b/c of cacheClken

Jason

  • we need to be able to distinguish between accelerating functions with descendants and leaf functions
  • if number of accelerated functions > number of functions in benchmark, just use the all-HW value
  • normalize x-axis with % exec time, not number of function (ie 30-40%, 40-50%, etc)

July 26, 2010

Ahmed

  • Posted estimation results to wiki
  • working on optimizing tiger this week

Li

  • divider runs @ 97Mhz
    • pipeline stages ⇐ number of bits
    • Hopefully Tom can direct us to better divider
    • Need to confirm how Ahmed's numbers gave results for >32 stagers & 32 bits
    • Divider requires one new signal over lpm – signed or unsigned divider
    • Works with small cases but untested/not integrated in Tiger
    • Floating Point divider with 23bit mantissa and 5 stages –> 130Mhz, so seems we should be able to be faster
  • Vacation from Tuesday until end of August

Jason

  • Pushed dfsin through excite flow
    • “Round & Pack” function issue – C flips LSB, verilog was flipping all
  • now all chstone benchmarks correctly run through excite flow for comparison with LegUp
  • Reviews due August 18

James

  • SHA benchmark works with accelerator flow
  • working on 64 bit data accesses (caches, etc)
  • cacheClkEn signal in cache controller screws up 3 benchmarks
    • Ahmed looking into it with SignalTap
  • His verilog code stays the same across different benchmarks, just small bugs popup with new benchmarks

Andrew

  • Progblem with new LLVM-gcc, debugging
  • build bot installed, waiting on new port from eecg
  • UBC presentation August 12
  • Weighted bipartite matching to be implemented next (GAUT uses this, seems to be most popular algorithm)
    • most binding emphasizes sharing registers & adders, but this doesn't make sense for FPGAs
  • Possible optimization is to trade mux for register to save value to next state
  • For LegUp release, will have force-directed scheduling & weighted bipartite matching binding

July 22, 2010

Mark

  • Will be checking VanProf's power today
  • Will also check if the debug stub in Tiger is required

Li

  • Will finish up divider modification
  • Mark or Ahmed will pick up whatever is left

Andrew

  • Will make seperate class for muxes
  • Set up buildbot last night
  • Will set up graphing component which show metics vs. revisions

July 21, 2010

Jason

  • GProf on x86 is what everyone does
    • Probably the best methodology for now
  • Tried to get dfsin through eXcite
  • Reduced errors from 35~15
  • Got an extension for the eXcite license till October
  • Also tried pushing dhrystone through eXcite
    • Didn't work
    • Excite doesn't support string constants
    • Will try to work around

Li

  • Divider in Tiger looks like its fully pipelined
  • Creating a wrapper
    • Will get rid of one divider and multiplier since they're not needed
  • Got a clean version of Tiger which is working now
  • Going to China on Tueseday, and returning sometime in August

Mark

  • Checking out other Hardware Profilers

Ahmed

  • Compared results to estimation
    • Results are not credible since some CycloneII devices used up to 300 DSPs
    • Restricted them to 70 and running the tests again
  • Will have new results by tomorrow
  • Will also try synthesis keep to see if results get better

James

  • 3 benchmarks still not running on newly modified Tiger
    • Will probably solve those bugs by today

July 20, 2010

Andrew

  • Looking at first binding cut
  • Victor's chaining making merge a bit difficult
  • Looking at more binding implementations
  • Need to make scheduler more awake of binding
  • Will set up build bot to regularly test new commmits or versions of LegUp

Victor

  • implementing list scheduling
  • Will not be here tomorrow

Ahmed

  • Will look into what resources can be shared
    • Maybe a bool ShouldWeShare() function
  • Will look into mux area growth
  • Also will push a couple of simple designs through LegUp to check:
    • Redundant registers
  • Looking into a full no-prune option for Quartus

Mark

  • Can take down Tiger's area by removing redundant operations
  • Still trying to change critical path with memory in it
  • Is there a way to select target architecture in GProf?
  • Area for VanProf almost matches earlier estimate (~3%)

Li

  • Check throughput of LPM_mult see if we can do a new divide operation
  • Tiger checked out but still not working

James

  • Will look into getting chstone running through Tiger with Mark
  • Will then try to setup a test suite
  • Also lending a hand to Li to get up Tiger

July 19, 2010

Mark

  • Has areas for the profiler (VanProf)
  • Currently verifying results
  • Made scripts to verify results against other profilers (gprof etc.)
  • Will get power results for profilers after results are verified
  • Tiger's critical path is in a FIFO
    • Can find out how to play around with its settings
  • Would like to increase Tiger's frequency to 50MHz asap
    • Can do that using higher effort compilation (Design Space Explorer maybe)
  • Changing Tiger's divider to one with more pipeline stages shouldn't be too hard

Li

  • Still looking at improving Tiger
  • Might be able to reduce area by removing signed multipler/divider
    • Would add some preceeding mux/combinational logic to make signed/unsigned into one
    • Checking the differences between signed and unsigned multiply/divide to see whether this is even possible
  • Looked at Tiger's 11-stage pipelined divider and found out how it is set up

James

  • Moved cache out of Tiger
  • Transfer became slower when crossing clock domains
  • Can play around with Avalon's settings to try and increase the transfer speed
  • Regardless, accelerators can now access the cache which is beneficial

Andrew

  • Almost done with binding

Ahmed

  • Profiled no-dsp multipliers
  • Added the code that uses them into estimator
  • Will try to find a way to specify whether each multiplier is LUT based
  • Working on a script that parses estimator's results
  • The results will be posted soon

July 9, 2010

Mark

  • Implemented Profiler in hardware
  • It matched C++ Profiler results
  • Still some small subtleties to take care of
  • Might have to write Snoopy code to test against
  • Might be able to use LLVM flow to get information about the C programs
    • Or maybe use simulation to get that information
  • Will chstone suffice as tests for the profiler?
    • Some of them are sufficiently large
    • Will use them for now
  • Will be interesting to find instructions/cycle for Tiger MIPS
  • Also instructions/cycle for chstone using the profiler
  • What might cause less than 1 instruction/cycle?
  • Profiler should (and will) be able to track multiple metrics

Ahmed

  • The output bitwidth of the profiled multipliers was the reason for the Fmax discrepancy
  • MIPs timing result more accurate now
  • Looking at other benchmarks and debugging results
  • Added new LUT count in order to have both register and combinational components of LEs
  • Will run the script and get complete data even for pipelined multipliers/muxes
  • Maybe put estimation/area code into its own pass?
  • Will simulate power without sdf file to see difference when glitches completely ignored

James

  • When processor polls for done signal, runs the cache's state machine
    • In that time, accelerator can't access cache
  • Should we make the cache dual port?
  • Or would it be easier to create a “busy” signal from the cache
  • For the time being, will do “blocking” implementation
  • To support multiple accelerators:
    • We eventually will have to either do multi-port caches
    • Or we can just use a single-port cache with arbitration
    • Or dual-port cache with arbitration for the accelerators

Victor

  • Modifying ASAP scheduler to be more modular
  • Making it easy to add a new scheduler pass
  • Working on ALAP scheduler as well
  • We want to give multiple schedules based on resource constraints
  • Force-directed list scheduling considers hardware constraints
    • Can add extra states accordingly
  • as H/W constraints would be more efficient than human specified constraints

Andrew

  • Happy Birthday!
  • Took Friday off (its ok the project lives on!!)

July 8, 2010

Tom

  • glitch filtering: leave on (default)
    • simulation time step is very small (1ps)
    • glitches can occur that only last for a few time steps
    • these glitches are too short to cause a voltage swing on the wire
    • glitch filtering removes glitches that are too short to matter
  • Quartus 10.0 vector-less power
    • can be enabled with a quartus.ini variable
    • but doesn't support CycloneII

Victor

  • presentation on structs
  • LegUp struct support is equal to or better than all HLS tools we've looked at
  • doesn't support passing structs by value to a function or returning structs from a function
    • rarely occurs in practice
  • memory controller:
    • Steve: could we duplicate the input multiple times and just use the ram byte enable?
      • we do this - but you still need muxing
      • we'd have to show you a schematic of the steering logic to fully explain this
    • claim: we need a 20-1 mux
      • I don't think this is true when you look at the generated hardware
  • Tom: can run NiosII at 250-300MHz if you separate the processor from the rest of the system
    • might be able to get a higher DMIPS number

Steve

  • We want LegUp to support a multi-core and multi-FPGA environment
    • can accelerators be running on different FPGAs?
    • the current bottleneck is the shared memory
      • global memory must be shared between all cores/accelerators
      • note: accelerated functions with no side effects will not have this problem
  • eXcite: using 'channels' to communicate between s/w and h/w
    • this approach is clumsy and probably a side effect of compiling the s/w C code separately from the h/w C code
    • we have a shared memory so we shouldn't need the 'channel' construct

Ahmed

  • mismatch between Legup 64bit multiply fmax and script
    • take a look at critical path: timing analyzer→locate in chip planner
    • script: 64bits x 64bits = 128bits
      • legup will truncate the result to 64bits
  • area estimate: usually estimates slightly less than actual
  • Tom: by default Quartus converts state machines into one-hot encoding
    • need to set 'user defined' flag to use the encoding in the verilog file

James

  • C2H flow works for an accelerator reading/writing to an array
  • Steve: what if the processor stores global data in a register?
    • if the accelerator modifies the global the register will be wrong
    • do we need to mark globals as 'volatile' (like interrupt handlers)?
    • LLVM might handle this assuming the h/w accelerator matches the C code exactly and doesn't write to other memory

Li

  • still working on Lab 6
  • Tom: make sure you record what problems you encountered and how you solved them
  • using SignalTap and Modelsim

Andrew

  • working on binding infrastructure
  • send Tom an example of a ModelSim testbench for students

Mark

  • profiler working in hardware and simulated in Modelsim

July 7, 2010

Ahmed

  • read a few papers on early resource estimation in HLS
  • some algorithms perform a simple place & route to improve estimation
  • power:
    • Tom: should we turn glitch filtering on or off?
    • quartus_eda produces a verilog netlist and .sdf (standard delay format) file.
      • .sdf has 3 delays for each timing arc: min/max/avg
      • are we using average?
  • Quartus 10 just came out
    • we should stick to the older version to avoid rerunning all our results

Jason

  • eXcite df* benchmarks:
    • the input/output vectors were initialized wrong: upper 32 bits mangled
      • now only 2 errors
      • after fix dfsin latency went up and got closer to legup time
    • problems seem to be related to 64 bit operations
      • in chstone, only the df* benchmarks use 64 bit integers
    • fmax results were using CycloneII Auto device with speed grade 6

Li

  • missed meeting due to headache

July 6, 2010

Mark

  • C++ version of profiler working
    • reads ELF file, runs gxemul, captures instr trace, uses hashing algorithm, counts each function call
    • verifying counts
  • profiler doesn't take into account cache misses
    • could extend to count the number of cycles instead of the number of instructions
  • could use this profile info to drive C2H flow

Ahmed

  • debugging early delay estimate
    • at least 30% off sometimes much more
    • mips: detects correct critical path (64bit multiplier)
      • estimate: 67 MHz
      • actual: 97 MHz
      • doesn't match intuition: we should be estimating a higher fmax
        • our estimates are derived from single operations that can be easily place & routed
  • area estimates
    • currently underestimating area: need to take into account muxing
  • read some papers on high level synthesis early estimation
  • fast estimation could allow us to produce multiple verilog files:
    • best_delay.v, best_area.v, best_power.v

Jason

Andrew

  • incrementally adding binding infrastructure
  • will still allow raw verilog to be output
    • in the case of testbench non-synthesizable code “initial begin”

Victor

  • looking into Tiger modelsim simulation on Linux
  • reading force directed scheduling paper
  • will need to implement a as-late-as-possible (ALAP) scheduler

James

  • debugging C2H flow: works in simulation but doesn't work on board
  • going to simulate post-synthesis .vo file

Li

  • revising lab solns
  • will look into optimizing Tiger processor fmax

Undergrad presentations are on August 11.

July 5, 2010

Mark

James

  • basic C2H architecture is working
  • modified TigerMIPS data cache to be shared with accelerators
    • 1 cycle reads if cache hit
  • added 2 slave ports to Tiger dcache. added 2 master ports to accelerator
    • 1 port: address
      • need a separate port because the avalon address is actually an offset from the base address. We need the full address
    • 1 port: data
  • architecture can support multiple accelerators without new ports
  • uses polling (like C2H)
    • add a tcl option for waitrequest - less power, and slightly less cycles than polling
    • every cycle Tiger reads from cache - wastes power
  • So we'll have the functionality of C2H
    • but better because we don't flush the cache, we share the processor cache
  • tcl file specifies functions to accelerate
    • add arguments to handle overloading
  • how does the user modify the C wrapper to allow processor and accelerator to run concurrently?
    • modify the generated wrapper
    • generate an async wrapper and then a polling function
    • create pthread to simulate parallelism
    • use SystemC
  • use interrupt?
    • user specifies a callback function
  • how to put everything in h/w?
    • accelerate 'main' in tcl file
    • flow automatically puts all functions called by main() into h/w

July 1, 2010

Ahmed

  • started code to map LLVM instructions to the corresponding operation name in the tcl configuration file
  • adding delay estimation code into the Verilog backend pass
    • adding the LegupConfig analysis pass as a prereq to get data from the tcl file

James

  • gsm still has problems
    • avalon write signals look correct
    • global variable read/write works, but local variables on the stack aren't being written
    • could be the TigerMIPS cache
      • will try to flush cache after calling an accelerator
      • try making local variables 'volatile'?
  • Andrew: C2H pass is creating many unnecessary altsyncrams for gsm

Victor

  • disabling link time optimization
    • this avoids inlining every functions
    • more test coverage. Will be better for James's C2H flow
    • we can turn LTO on manually when benchmarking chstone to get the 10% improvement
  • Working on TigerMIPS linux flow

Andrew

  • binding infrastructure

June 30, 2010

Ahmed

  • Dynamic Power is linearly proportional to frequency.
  • So if we get the dynamic power at 1MHz we can use this formula to calculate power:
    • power = static power + dynamic power * fmax in Mhz
  • We can run a non-power aware scheduler to get a rough estimate of fmax, then iterate to achieve better power

Victor

  • fixing a bug with memset/memcpy when you disable function inlining: http://legup.org/bugs/show_bug.cgi?id=8
  • Pointers have been shrunk to 32 bits, dhrystone is running correctly
  • Looking into getting the Tiger modelsim flow running on linux

Mark

  • working on testbench for profiler
    • run Tiger in Modelsim, collect program counters, then run profiler testbench using the collected PCs
  • Tiger MIPS CHstone results
    • dfsin, jpeg aren't running correctly
    • adpcm is faster on Tiger than Nios
    • will look into adding a performance counter onto the avalon bus to confirm results

James

  • accelerated gsm benchmark
    • moved the compute intensive function: gsm_lpc_analysis to h/w
    • all memory is stored in SRAM accessed over the avalon bus
  • looking into using the C2H legup flow to automatically generate the C wrapper

Li

  • met with Tom to discuss Lab 6
    • isolating solutions for each part of the lab, not just the final part
  • recommends teaching the students modelsim, possibly with a provided testbench

Andrew

  • binding infrastructure

June 29, 2010

Short meeting today

General

  • Victor and Ahmed will participate in the “undergraduate research day” in August, and will prepare an abstract for submission and send it out to the group for comment.
  • Some people would rather work on Thursday (Canada Day) instead of Friday.
  • On Monday, we'll have a brainstorming session on the options for processor/accelerator interface and the way that users could control that.

Victor

  • Fixed switch statement problems in MIPS back-end of LLVM. Communicated the patch to LLVM developer who maintains the back-end.
  • Will start reading a couple papers on scheduling, with a view to improving the current scheduler. A start point might be to implement as-late-as-possible scheduling.

Ahmed

  • Added command line parameters related to power characterization to his script. User can set # of vectors. User can set toggle probability for primary inputs.
  • Currently working on some minor bugs.
  • Most likely will start working on early delay estimation, early area estimation. I.e., actually using the data from his models in high-levels ynthesis.

Andrew:

  • Generated a very nice list of tasks that need to be done. This is for Ahmed & Victor, who are close to wrapping up their current activities.

June 28, 2010

Andrew:

  • Working on binding infrastructure: separating binding from RTL code generation. The circuit will be represented in memory using a generic representation not tied to any RTL language. The intent is that in future, we may want to add other back-end code generators, e.g. VHDL.
  • Mark pointed out that SPREE may have some related work.

Mark

  • Generating #s for the CHStone benchmarks running on Tiger: cycle count. Will post results on-line when ready.
  • Seeing that some of the CHStone circuits are producing incorrect results.
  • Also working on his profiler. See that 2/3 M4K RAMs will be sufficient

Ahmed

  • Writing script for power characterization of hardware modules.
  • Going to add parameters to control input activity, # of vectors to simulate for. The user will be able to specify a toggle probability for the primary inputs.
  • For at least a few modules, will compare the power #s with and without the full delays (affects glitches).

Victor

  • Fixed “select operator” ( ? : ) MIPS code generation in LLVM. Fix also seems to repair the 64-bit comparison bug. Isn't sure if these bugs are indeed fixed.
  • Wrote a script to alter LLVM MIPS assembly to fix the switch statement bug.
  • Filed bugs against LLVM MIPS back-end.
  • All the CHStone benchmarks work through LLVM, targeting MIPS, with -O3 compiler optimizations.
  • Found the DMIPS for drhystone has increased, is investigating why.

James

  • Working on writes to memory for passing parameters/data between HW / SW.
  • We agreed that we should have a brainstorming session on the key interfaces we should provide for the processor talking to the HW accelerators, taking the best ideas from C2H & eXCite.

Li

  • ECE342 labs are done; he is still debugging Lab 6.
  • Going to start analyzing the performance of Tiger MIPS, see if anything obvious that can be done to improve the crit path & clock freq.

June 22, 2010

Andrew

  • Keep working on Binding.
  • Currently, the tool just prints out verilog text. Instead of that, wanna have some separation between verilog and other stuffs.

*Will have internal presentation on Binding.

  • Idea on supporting SystemC.
  • Cleaning up Scheduler(Binding should go after Scheduling)

Ahmed

  • Getting the power data.
  • Tuned NIOS so that it can be run at frequency of 131-187Mhz.(Size does not increase as much)
  • Try using ModelSim to random numbers for simulation.

Mark

  • The NIOS is working now.
  • Blocking at one spot(large mux latency)
  • Have a accurate cycle result now
  • Will have a comparison table for NIOS vs. Tiger

Li

  • Finishing up Lab6(trying to put all the components together)

June 21, 2010

James

  • Fixed the problem with M4K(Some Altera Option can double the size of M4K)
  • Measured the performance of Tiger MIPS.(4X slower than NIOS)
  • The measurement is done in seconds, should be more accurate in cycles.

Ahmed

  • Tried the NIOS II running at high frequency about 130Mhz.
  • The optimal configuration can be run at 160-170Mhz.
  • Cyclone II can support up to 50Mhz. If we wanna make it higher, need to tune it by ourselves.

Andrew

  • Went though C2H functions, and find out what need to put into hardware and what need to be in software.

Mark

  • Implementing the system and stop at some spot. Should be easy fixes.
  • Once have all the parts ready, it would be easy to implement SnooP and compare between them.

Li

  • Reading tutorials on Tiger MIPS, and trying to understand all the files in the MIPS project related to timing.
  • keep working on Lab 6

June 18, 2010

Mark

  • fixing up system from GIT screwup

James

  • MIPS benchmark + processor system works
    • only in simulation b/c synthesis doesn't fit on board (too many M4Ks)
  • Obstacles
    • mips.v large, had to figure out timing
    • this process can be generalized/automated – next benchmarks will be easier
    • each accelerator has own SW wrapper
  • data cache + ins. cache = ~70/105 M4Ks
    • this is twice as large as it should be
    • for some reason caches are using dual port despite being specified as single
  • some benchmarks don't fit on chip with pure HW

Li

  • look into optimizing tiger processor to improve fmax
  • working on Lab 6 for Steve/Tom

Ahmed

  • Interface with TCL set up
  • using an operations class that contains the data like fmax, latency, etc
  • legup_config class that contains all operators and can access them
    • uses STL map – maps (string)op_name –> (operations)op
    • could also contain chip resources?
  • run NiosII F with speed opt, different seed, fmax constraint, etc.

Victor

  • debugger works, run on chstone
  • all chstone match traces exactly, except GSM on old LegUp code, which shows error with phi nodes
    • this is the same bug andrew found last week, proving that the debugger works and is efficient

Andrew

  • Put up graphs on wiki comparing HW with Nios (of benchmarks that fit on DE2)
    • x axis - HW area increase over Nios II f
    • y axis - speedup of HW over Nios II f

June 16, 2010

Ahmed

  • Nios 2 benchmarking done
    • data + walkthrough up on wiki
    • area + fmax will be posted
    • modelsim gives deterministic results for performance counter (wasn't deterministic on chip)
  • Operator helper object next
    • To get data (latency, area, etc) into LLVM at run
    • will TCL instead of a comma-separated value file
    • TCL can use loops to estimate data (ie for new devices we dont have data for, like Xilinx)
    • pass llvm instruction (or opcode) into class to get data
    • pre-populate vectors in helper class at startup and just return them as needed
    • wildcard on operators in TCL? (ie set all operators delay = 5)
    • how can we use llvm instruction to indicate operators not in C (ie muxes and registers)
    • llvm does signed/unsigned extend then unsigned add (as opposed to signed/unsigned add)

Victor

  • debug tool – trace between lli & modelsim
    • mostly done, making minor tweaks
  • to make changes to structs
    • mips = 32 bits, verilog = 64, so pointers don't correspond
    • need to match verilog struct and pointer to MIPS
    • use MSB of pointer to indicate whether there is a tag inside pointer or not (ie MSB == 1 –> global variable, no tag, MSB == 0 –> local variable, with tag)

James

  • processor + accelerator take up too many BRAMS for cyclone 2
  • can map small arrays into same M4K instead of one for each array

Jason

  • Compiled 3 benchmarks with Excite onto Cyclone II
  • had to use “no_prune” so it didn't synthesize all logic away since return value wasn't an output

Top Priorities

  • Debugger (Victor)
  • Helper object (Ahmed)
  • Tiger Latencies (Mark)
  • Built-in Self Test for MIPS and Tiger

Lower Priorities

  • Timing Analysis
  • Power data (Ahmed)
  • Tune Tiger MIPS

June 14, 2010

Andrew

  • finishing up accelerator wrapper code (cleaning up and separating)
  • move modified code (ie llc, opt) to own folder

James

  • working on Mips benchmark in processor+accelerator system
  • accelerator alone works, but starting with processor causes wrong output and array-out-of-bounds error

Ahmed

  • will get delay results from modelsim (using $time) to confirm Nios's profiling results (since non deterministic)
  • modelsim still giving issues with Nios system
    • might not be linking in altera_mf libraries
  • all benchmarks but SHA are working on Nios

Victor

  • Adding a pass to allow for easier HW debugging
    • add printfs to end of basic blocks (or after assignments) of post-optimized bytecode
    • when compiled through legup, printf –> $display
    • lli and modelsim can be used to compare outputs like traces to track down bugs

Mark

  • outlining schedule of tasks
  • simplify hashing as much as possible
  • get profiler working in modelsim
  • working on the data transfer between PC & FPGA for initializing profiler & retrieving data

Li

  • At convocation

Jason

  • At thesis defense.

June 11, 2010

Andrew

  • Added a tcl configuration file
    • Easy to call from C
    • Can define which function to accelerate
  • Legup will now require TCL library
    • Widely used in many applications so no worries
  • Calls to the accelerated function will be replaced by calls to a wrapper function
  • Thought of an optimization for Victor's memory code
    • When coded in verilog nothing changed
    • Quartus probably optimizing it already
  • Made changes with James to overcome latency caused by wait request signal when accessing SRAM
  • LLVM's 2.7 update caused some performance results to decrease

Ahmed

  • Got all data for NiosII-f
  • Will be tweaking NiosII to match Tiger MIPS not other way around
    • Much easier to play with NiosII
  • Results of benchmarks are not constant +- 1%
    • The results are from actual board runs
  • Seeing if Modelsim results are constant
  • Trying Modelsim rather than Modelsim-Altera
  • Will be working alongside Victor on helper class
  • Will be adding register support in Profiler afterwards
    • Will also try to overcome max I/O pin issue

Mark

  • Looking into other “perfect hashing” algorithms
  • Looking into settings that use 2 and 3 input M4Ks rather than registers
    • Will scale nicely

Jason

  • Will try to push eXCite verilog into quartus and get LUT count and fmax

Victor

  • Looking at an OGG decoder
    • Open source benchmark
  • Looked at Mediabench
    • simplest one would take a very long time to run
  • June 19th – Getting 4 wisdom teeth out

June 9, 2010

Ahmed

  • Looking into disabling printf/UART overhead for the benchmarks
  • All 13 benchmarks ran through NiosII succesfully
  • Looking into profilers for the NiosII

Jason

  • Will check if eXCite can support structs
  • How early do we need to support Floating Point
    • LLVM mips backend does not support FP
    • Would take alot of effort and time

Victor

  • Looking into more useful benchmarks
    • One that has possible and distinct software/hardware components in order to demonstrate speedup
  • Should look into Mediabench
  • Should we get a floating point benchmark?
  • Does CtoH support structs? Do the others?

Mark

  • Decided against binary search
    • Turned out to take extra cycles
  • Found an alternative “Perfect Hashing”
    • Always has a predictable number of cycles
    • Colision free
    • Tested it for multiple benchmarks
      • Took 5 cycles
    • Looking into why Fmax is a little low
    • Will try to use up less cycles and see impact on Fmax

June 8, 2010

Jason

  • Ran 7/12 through eXCite
  • 4 df_* did not give correct output
  • JPEG will be running overnight
  • Contacted eXCite
    • They had to debug other 3 benchmarks (not df)

Mark & Victor

  • Preparing for Friday's presentation

June 7, 2010

Jason

  • Ran 3/12 CHstone benchmarks on eXCite (mips,aes,sha)
    • Very slow runs
    • Will redo JPEG, since first run took about 5 hours then died
  • eXCite can run on multiple families and has options for optimization/pipelining
    • Will try them out
  • Will do a demo of eXCite on Thursday, 24th

Mark

  • Outlined presentation for Thursday
  • Created diagrams and notes
  • Looking into how to implement power profiling
  • Can try looking at power per instruction class
    • But we want power per function
    • Can work around that, but will take a while
  • Looking into implementing CAMs for increase in speed
  • Slight increase in area/power as well
  • Might not be supported in CycloneII, but is in high-end families like Stratix
  • Can try to build them through logic
  • Can just implement later when moving to high-end families
  • Will map thoughts out on blackboard/slides next meeting

Ahmed

  • Should check Registers vs. BRAMs in power profiling
  • Check data on comparators (which Mark might need)
  • Will measure NIOSII DMIPS
  • Decided on output file layout with victor

Victor

June 4, 2010

Mark

  • Most of the profiler design planned out
  • How to map address in memory to corresponding functions
    • Some sort of hash table can be used (Snoopy method)
    • Conversion table that indicates which address correspond to which function, then binary search into the table
  • Started to code verilog
  • Will create flow chart to show flow of profiler

Ahmed

  • Went through 4 tutorials to learn about Modelsim, SoPC Builder
  • Will run 12 CHStone benchmarks and Dhrystone on Nios to compare performance against Tiger

Victor

  • Made update to how Legup handles structs
  • DMIPS (Dhyrstone MIPS) for Legup
    • For Cyclone 2: 100 DMIPS
    • For Stratix 2: 140 DMIPS

Jason

  • Tried out Excite on MIPS benchmark
  • MIPS goes through and shows correct result
  • MIPS takes 4000-5000 clock cycles to complete for code generated by Excite, ~11,000 clock cycles for code generated by Legup
  • Trying to push JPEG benchmark, really slow to compile (just compiling to verilog) on Excite, whereas Legup takes 4 seconds.
  • Runtime may be another metric for paper
  • Excite has a nice GUI which generates a state diagram, and clicking on each state shows which line of C code it corresponds to
  • Excite only supports one C file

James

  • Got SRA example working with having accelerator directly access main memory
  • Memory access (SDRAM) is really slow, takes around 20 clock cycles, will be a huge bottleneck
  • For SRA, getting 2 inputs takes ~50 clock cycles, bursting reduces this to ~20 clock cycles
  • Burst through and grab all inputs needed in the beginning and store in a fifo, Access the fifo during computation

Li

  • Discussed lab 1-4 solutions with Tom
  • Working on lab 5 now

June 2, 2010

Ahmed

  • Done pipelined dividers and multiplexers
  • Multiplexers are really fast, don't need to pipeline
  • Discussion on Muxes (a big Mux may be larger than the shared component (adder), hence may be less total area to replicate the component rather than use mux)

Discussion for future paper:

  • On one end of the spectrum, we can have software only solution, tiger proc running everything in software, on the other end, full hardware solution, Legup synthesizing everything to hardware
  • Along the spectrum, we can have successive different configurations with more or less functions synthesized to hardware
  • This will create a good latency vs area graph for paper
  • Results needed for paper : Area, latency, power in comparison to
    • tiger mips vs NiosII (full software solution)
    • Legup vs other HLS tool (full hardware solution)
      • Other HLS tools : Xpilot (UCLA), SPARK (USCD), Excite (commercial), Forte Cynthesizer (commercial), C2H (commercial)
      • Jason will look into if these tools can be used for comparison in paper
    • Hybrid solution results where some functions are accelerated based on profiling
  • We need benchmark which show advantageous results for the hybrid solution
  • Chstone was created for HW only solution (does not contain things that should be run in SW, i.e. malloc, linked lists), hence may not exhibit good results for the hybrid solution
  • Some tuning may be need for tiger proc to increase performance

June 1, 2010

Short meeting today

Ahmed

  • Finished Pipelined multipliers
  • Working on Muxes (parameters are bitwidth, depth)

Victor

  • Complicated structs (structs of arrays, arrays of structs) are working
  • Presentation next Thursday

James

  • Ease of use of C2H (click on function to accelerate, building project run SOPC and Quartus in background which automatically makes connection to avalon and synthesize)
  • Avalon Master port is created for ( 1. Pointer dereference (* operator) 2. Index into an array ( [ ] operator), 3. Index into a struct, union ( . or → operators), 4. Usage of global or static variable)
  • Cache is flushed whenever accelerator, which has access to same memory as CPU, executes

May 31, 2010

Andrew

  • Fixing bug caused by updating to LLVM 2.7
  • Hack fix for now, all benchmarks work except for GSM
  • Not checked in yet, thus current version in git fails
  • Will check in code for C wrapper

James

  • Reading C2H User Guide and trying out the tool to understand their flow
  • Has lot of good information which can be useful for LegUp

Ahmed

  • Updating src code to test pipelined multipliers
  • After getting results for pipelined multipliers, will move on to dividers, muxes, then onto power
  • Jason suggested building a helper object in C++ which accepts LLVM IR as inputs and outputs area, latency, power

Mark

  • Trying to outline the profiler design
  • Planned out interface b/w profiler/processor, user/processor
  • Will modify mipsload/debug stub for communication
  • Problems with compiling mipsload due to unknown libraries being used (part of binutil package)
  • Snoopy more or less mapped out, which will be useful as a baseline design

Li

  • Received DE2 from Tom
  • Finished ECE342 Lab3
  • For lab1, used an Altera Function for interrupts, whereas the solution used assembly code

Victor

  • Still working on nested structs/arrays
  • Rewrote some of Andrew's code for nested arrays and this code was checked in Thursday
  • Trying to get better, working from home today

May 28, 2010

Li

  • Has gone through the first two 342 labs.
  • Considering improvements to recommend.
  • The former 342 TAs in the room mentioned that labs 1 & 2 were “do-able” by most students.
  • Li: Possibly could give students “skeleton” code that they would modify, rather than having them take a from-scratch approach.
  • Li will formulate suggestions, communicate to Tom & Steve.

Discussion with Davor around Tiger MIPS

  • Davor suggested we consider SPREE which is from UofT and would allow us to use something “invented here”, thereby raising its profile in the research community.
  • Mark showed Davor his analysis of the different processors considered.
  • SPREE appears to have smaller area and faster performance than Tiger MIPS.
  • SPREE supports less of the MIPS instruction set than Tiger. SPREE doesn't support the div instruction, and some others.
  • SPREE has no cache / memory architecture and only uses the on-FPGA block SRAM memory, therefore cannot accommodate big programs.
  • The extra functionality of Tiger likely explains its larger size versus SPREE.
  • Tiger already works with Avalon, SPREE hasn't been tried in that context.
  • Tiger is used in courses in Cambridge and elsewhere and is therefore likely well-vetted/tested.
  • SPREE doesn't offer the complete development/debugging “ecosystem” that Tiger comes with.
  • Davor mentioned that Kaveh (Moshovos' student) is building his own NIOS.
  • Andrew pointed out that there is no LLVM back-end for NIOS.
  • After hearing the pluses/minuses, Davor agreed that Tiger is the best choice for us.
  • General agreement that we want to use UofT-based infrastructure, where possible and convenient.
  • Davor proposed we have some discussion around FPGA soft-processor usage within the different research groups in ECE. Jason: Perhaps one or more of the monthly FPGA seminars could be used for this.
  • Idea for a small project: analyze the critical path of Tiger and see if perhaps some simple Verilog changes can be applied to improve its speed.

Discussion around courses ECE241/342

  • Davor suggested that both courses be moved to System Verilog, as it has some clarity advantages / ease versus Verilog.
  • It was pointed out that System Verilog may not have much traction in industry, versus standard Verilog and VHDL.
  • Davor suggested the project be cut from 241.
  • Jason mentioned the Steve's proposal that the 241 project be made optional (with potential for bonus/incentive), with the alternative to the project being additional labs.

May 26, 2010

Andrew

  • Pragmas can only be supported with clang using IR metadata (latest version of LLVM)
  • Use an external config file instead?
  • Config files cannot be as fine-grained as pragmas, ie for loops, blocks of code, etc…
  • In the long term, we can support both
  • We will need a config file anyways for timing/power data (Ahmed's stuff)
  • Start with hw/sw partitioning before allocation, binding so James has something to work with

James

  • Got simulation working with MIPS processor and Avalon bus
  • Try sra example next
  • Value returned on Avalon bus
  • Figure out hardware accelerators writing to main memory (later)

Mark

  • Most of newlib supported on mips processor and simulator
  • Push processor stuff onto git when it's ready

Victor

  • Look how to support newlib library (as built-in functions) later
  • .mif file generation and getelementptr for structs in progress for arbitrary struct/array nesting

Ahmed

  • Created wiki page for perl script usage
  • Links to generated output files and graphs on wiki too
  • Working on Thursday's presentation

Li

  • Still working on 342 labs, currently on lab 2

May 25, 2010

Andrew

  • Work on pragmas (focus of this week)
  • Checked in modularized scheduler code
  • Read up on C to H, how they handle pragmas
    • Didn't have pragmas to split hardware and software though

James

  • Look at how C to H handles the bus
  • Prototyped with NIOS (via polling) and a simple Verilog adder via Avalon bus
  • Work on sra next
  • Handshaking mechanism via memory-mapped registers (read/write, also a wait signal may be necessary on the Verilog side)

Mark

  • Check in Tiger MIPS flow into git when ready
  • Check in Windows (working) and Linux (almost working) toolsets
  • Problem in jpeg benchmark with exit call
    • Increment an error counter in jpeg?
    • For now, exit replaced with a function call that prints error
  • Develop libraries for system call-like functionality as we go, ie main return value
  • Function lowering, ie printf
  • Try MIPS libraries on Cyclone II
  • Processor should eventually support rand function, etc…
  • Start profiling this week

Ahmed

  • Different device output errors under control
  • TimeQuest gets stuck on fat dividers
    • Use regular timing analysis if TimeQuest fails (ie fmax = 0)
  • Create an html page with graphs per device
  • Make short presentation on Thurs (powerpoint)
  • Register-to-register delay working for all devices
  • Look into pipelined multipliers, dividers, mods (report how many cycles)
  • Start power analysis this week
  • Ask Tom for advice on power

Victor

  • Commit current struct code
  • Make struct code more generic (arrays of structs, structs of structs…)
  • Linked list support? Aka traversal, but not allocation
  • malloc and free probably not worth it to implement in hardware

May 21, 2010

Andrew

James

  • can simulate the whole tiger mips processor in modelsim on windows
    • SOPC builder generates modelsim files
  • steps:
    • use llcm-gcc to compile C code into an ELF binary
    • use elf2dat to transform ELF into a .dat SRAM initialization file
      • mark wrote elf2dat in C
    • run modelsim script to simulate

Mark

Li

  • done first ece342 lab
  • McMaster uses SystemVerilog

Ahmed

  • parsing problem:
    • Stratix: uses ALUTs - 'adaptive' look up tables
    • Cyclone II: uses LUTs
  • No DSP elements on Cyclone II, just embedded multipliers
  • switched to TimeQuest instead of TAN
  • make sure delay is reg-to-reg and not i/o pin to register
  • Power:
    • vectorless: probably need activity factors, the toggle rate at each node
    • vector: uses a .vcd (value change dump) file as input
  • Could generate a few data points:
    • toggle rate = 50% (very high), 25%, 10%

Victor

  • arrays of structs work
  • found function: EmitGEPOffset()
    • can turn getelementptr instruction into a series of adds and multiplies
    • but only useful for constants known at compile time
  • filed bug with legup on 64-bit:

Talk today: Michael McCool

  • problem with OpenCL: only supports 2 memory hierarchies
    • McCool thinks hierarchy should be recursive
  • QuantLib:
    • tons of virtual functions: 10x speedup just converting to C!

May 20, 2010

Andrew

  • new scheduler code is done
    • adding doxygen comments
    • separated classes into separate files
  • next: will look into h/w s/w partition pragmas
    • how do c-to-h, xpilot, impulsec, do it?
  • need documentation for core data structures (like ABC)

Mark

  • bug reports filed to LLVM bugzilla
    • can you post links on the wiki?
    • Mark is subscribed and will be notified of fixes
  • sha has a bug on 64-bit:
    • Jason/Mark fixed using a 32-bit mask
    • Victor had a better fix: change type from 'unsigned long' to 'uint32_t'

Victor

  • struct with non-uniform primitive types work
  • struct assignment works
    • uses llvm memcpy
  • struct initialization of mif files works
    • mif file is passed as a parameter to altsyncram
  • struct support will be a great bullet point in the paper
    • some C-to-H tools don't support structs (ie. eXcite used in the chstone paper)

James

  • created small h/w component attached to the avalon bus
  • waiting on Mark to get fixes to Tiger mips
  • lumping function arguments on the bus for now
    • bus width max is 1024 bits
    • multi-pump: use multiple cycles to send function arguments
      • area/speed trade off could be explored
    • does each slave to master connection have a different data width?
    • optimal speed is when bus width = 256 bits
  • stall processor execution with an avalon waitrequest

Li

  • doing 342 labs
    • last lab only ~6 people finished
  • Jason: will look into keeping a lab room open 24 hours

Ahmed

  • generated delay plots overnight
    • investigating plots where the delay doesn't increase monotonically as bitwidth increases
  • if quartus project exists script just parses the quartus .rpt files
    • add a command line switch to override this
  • eecg server mint is 3x slower than laptop (even with 4 Xeon 1.6Ghz machines)
  • to run overnight use 'screen'
    • allows you to detach shell sessions and then re-attach later
    • allows multiple tabbed bash terminals in the same ssh session
    • free, created in the 80s
  • try compiling without DSP blocks
  • future: use delay lookup to perform timing analysis of legup output
    • loop over all states, check dependencies between instructions, lookup delay
    • also could put multiple operations in a state based on the slack available

May 19, 2010

Andrew

  • states are now stored in a doubly linked list to facilitate random access insertion

Mark

  • emailed ecehelp to install dejaGNU for legup testsuite
  • working on filing chstone bugs and checking changes into git
  • next step: profiling
    • start with snoopy profiler
    • snoopy: two comparators to determine when program counter is within an address range
      • can't add new functions at runtime

James

  • getting tiger mips working
    • tiger mips communication server doesn't work on 64bit windows
    • installed 32-bit WinXP inside a VirtualBox virtual machine
    • had to adjust usb settings and install a driver for altera usb blaster
  • how to pass function arguments to h/w accelerators?
    • right now we use input ports (wires/flip flops)
    • could use a stack in a shared SRAM
      • support recursion? Jianwen Zhu has a paper on this

Victor

  • byte-enable of ram allows us to avoid instantiating multiple rams
    • byte-enable masks bytes when writing
  • structs containing only long integers work in legup
  • struct memory memory will be 64bits wide (8 bytes)
    • to match llvm alignment
  • struct alignment can cause wasted space for example:
    • struct {char a; int b; }
      +----+----+----+----+----+----+----+----+
      | a  |    |    |    |        b          |
      +----+----+----+----+----+----+----+----+
      
  • 3 bytes wasted right after 'char a' because b must be 4-btye aligned (integer)
  • C99 standard doesn't allow reordering members
  • does byte-enable add delay?
    • is it supported on Cyclone II?
      • altera doc says “All TriMatrix memory blocks that are implemented as RAMs, support byte enables…”
  • interesting fact: byte-enable allows the user to set the number of bits in a byte
  • future:
    • right now all global variables are put in a separate ram
    • could run out of rams
    • cluster global variables into one ram
  • can we use the SRAM?
    • we will have to share the Tiger MIPS memory eventually

Ahmed

  • added arithmetic shift
    • a » b, if a is signed you must sign extend
  • looked at verilog in RTL viewer - looks correct
  • not a priority to make the script incremental - will be run overnight
  • plotting results using 'ploticus'
  • next step: power
  • focus on dynamic power for now
    • leakage power might be hard to separate for the operation alone
  • does speed grade affect each operation equally?
    • speed grade is encoded in device part number
    • ie. for Cyclone II used on DE2 the device is EP2C35F672C6
      • Device: EP2C = Cyclone II
      • LEs: 35 = ~35,000 logic elements
      • Package: F = FineLine BGA package
      • Pin Count: 672 = 672 pins
      • Temperature: C = Commercial range
      • Speed Grade: 6 = 6 speed grade (lower is faster)
  • subtract setup time
    • caused by registers so shouldn't count towards operation delay
    • but this is a second order effect
      • we don't take into account placement and routing delays
      • we mainly need a relative ordering between operations

Li

  • helping Tom improve 342 labs

May 18, 2010

Conference yesterday

  • Paul Chow's group has used MPI to communicate between the processor and hardware accelerators
    • could pass function arguments using MPI?

James

  • try to prototype the whole system
    • use a simple function from chstone or create one
    • hack the MIPS exe file to call the h/w through avalon
    • don't use memory for now
    • assume processor is halted
    • generate SOPC with gui for now - does it generate a tcl script?

Mark

  • submitting bug reports to LLVM for MIPS backend
    • 64-bit comparison
    • switch statements with >5 cases
  • still isolating adpcm bug
  • send processor flow to James
  • modifying chstone to fix these bugs and checking into git

Andrew/Victor

  • Victor checked in pre-llc pass into git
  • new mailing list: legup-commits@legup.org
  • propose to implement structs in 8 columns of one-byte wide rams
    • a generic byte addressable memory with a 64-bit word size
    • allows writing a single byte in one cycle
    • otherwise to write a single byte we'd have to:
      • read the whole word
      • mask the other bits, and modify one byte of the word
      • write the whole word
  • Jason: this will have very high power consumption
  • is there an altera ram with a mask?
    • usually SRAM controllers allow you to write just to the upper byte
  • most of the work is in the steering logic to handle byte addressability
    • could be reused for multiple ram columns or a single big ram

Ahmed

  • done perl script to gather area/delay for most operators
  • plot the results
    • curves should make sense
  • start getting power info
    • simulate with random vectors
    • only want the adder power so will need to account for the register and i/o pin power
      • measure the power of a baseline without the adder but with registers and i/o pins locked down?

Li

  • looking into relationship b/w C and Verilog
  • possible debugging methods:
    • h/w based: like signaltap
    • s/w based: use modelsim

May 14, 2010

Mark

  • at the hospital all of yesterday with girlfriend
  • reading paper describing 'Design Trotter' a profiling tool found by James

Li

  • reading chstone paper
  • brainstorming how we can pause modelsim simulation to debug
    • could we use verilog procedural interface (VPI) to call C code?
    • maybe only allow users to stop at the end of a state
    • might not make sense to pause in the middle of combinational logic statements
  • look into llvm-db: llvm's debugger

Ahmed

  • wrote perl script that runs quartus and parses area (ALUTs, registers) and reg-to-reg delay
    • supports signed addition, multiplication for 8, 16, 32, 64 bits
    • pass FPGA family on command line
  • Adding other LLVM operators: /, %, and, or, xor, >, ==
    • look at llvm assembly language reference
      • binary operations
      • bitwise binary operations
      • 'icmp' instruction

Victor

  • pushing changes to git repository for legup pre-link pass

Andrew

  • gave Victor access to git
  • new mailing list: legup-commits@legup.org
    • every time someone commits a message is sent to the list
    • will allow discussion of commits
  • give git access to Ahmed to add perl script and Mark to modify chstone

May 13, 2010

Ahmed

  • surround operations by registers
    • otherwise you'll route from I/O pins and have artificially high delays
  • set a high clock constraint and make sure tsu and tco constraints aren't set
    • so registers don't move towards pins

Victor

  • Steve: do we care about strcpy? is it useful to support?
    • we added a strcpy function to Dhrystone to avoid synthesizing the whole C library

James - Soc Bus Alternatives

  • ARM AMBA has fees for the IP core
  • Wishbone only specifies spec, not enough value add
  • Avalon looks like best option
    • don't need to implement our own SOPC Generator
    • SOPC can be driven by tcl scripts
    • Tiger MIPS uses avalon buses
    • cannot be used on xilinx hardware (license)
  • we should allow Xilinx CoreConnect to be added in the future

Jason

  • top bullet: h/w designs offer compelling advantages
  • do we want people to think we're competing with Intel?
  • process node debate:
    • Intel:
      • 32nm in production (Core i3, Jan 2010)
      • 22nm expected Q4 2011
    • TSMC:
      • 40nm in production
      • 28nm volume production by Q3 2010
      • 20nm expected Q3 2012 (risk production) Q1 2013 (volume production)
    • Altera
      • Stratix IV (40nm) shipping
      • Stratix V (28nm) samples by Q1 2010, Quartus support by Q2 2010
    • Intel ahead of Altera (32nm vs 40nm) until Q3 2010
    • Altera ahead of Intel (28nm vs 32nm) until Q4 2011
    • Cool table (As of April 27,2010)
  • could beat Intel in very parallel applications: option pricing

Andrew

  • add the Legup flowchart slide
    • explain that LLVM is a series of passes
  • show results on Nios II/f
  • don't use jargon, SSA: static single assignment
  • don't publicize legup.org yet, first impressions are important
  • release set for early fall
  • double precision floating point adder in C sounds like one line

Desh

  • new project at Altera to use OpenCL to program FPGAs
    • OpenCL is a popular library for programming GPUs
    • allows user to program parallelism
  • Apple has added support to LLVM
    • Legup could be used to generate hardware kernels
  • more details on Monday

May 12, 2010

Jason

  • should be done jury duty by fri

Mark

  • all chstone benchmarks except jpeg run on processor (minus 5 optimization passes)
  • little-endian: just had to pass -el flag to llc
  • strangely 7 benchmarks passed even without the correct endian
    • assembly instructions aren't affected by endian
    • but the data section of assembly will change
    • endian determines the byte ordering, not the bits within a byte
    • so only affects data overflowing the first byte (>255)
  • update wiki:
    • changes to chstone so we can update git repository
    • command options to llc
    • which passes are missing

Andrew

  • Send chstone changes to mark
  • working on scheduler
  • dry-run tomorrow: a couple slides
    • current status of our tool: using LLVM, etc
    • what we're working on right now
    • Steve: should we mention legup.org yet?
    • might want to wait until we're ready to release
  • give git access to Victor

James

  • looking into ARM AMBA bus
    • specification is free
    • must pay for IP cores that implement the bus
  • CoreConnect: unclear whether you can freely distribute
  • Leaning towards wishbone: free to distribute
  • presenting bus alternatives tomorrow
  • can we use multi-port caches?
    • See Eric LaForest's paper in FPGA 2010: Efficient Multi-Ported Memories for FPGAs

Victor

  • integration issues
    • separating out llvm and legup code into different folders
    • new exe for legup?
      • but we can pass target specific options to llc
      • or use environment variables for now
    • we modified llc to output “.v” file. Could we use -o instead?
    • integration isn't high priority so don't work on it now
  • SPECint is not free
  • dealing with structs is a higher priority
    • could make one big word sized ram like an actual computer memory
    • how do you deal with char arrays?
    • could choose not to support hierarchical structs or arrays of structs
  • checkin memcpy memset changes

Ahmed

  • Reading a book on perl

Discussion

  • could have a separate cache for each accelerator and processor
    • flush the caches for now then add cache coherency later

May 11, 2010

Ahmed

  • Modelsim installed
  • Ran chstone benchmarks 11/12 pass
    • adpcm fails (return value 1)
    • victor had run into this problem and fixed with a printf
      • probably a scheduling problem
  • Reading Gajski's high-level synthesis book chapter
  • Look into Cyclone II EP2C35F672C6 for now (DE2)
    • Manually get timing/area for verilog operators
    • Look into pipelined functional units
    • Use perl for scripting. Andrew can help.

Li

  • hasn't heard back about internet
  • Reading Gajski's high-level synthesis book chapter
  • msn: fireofnight@msn.com
  • might not make it tomorrow: having lunch with team at IBM

Mark

  • all 12 chstone benchmarks simulate correctly with optimizations
    • but 5 optimization passes must be disabled: simplifycfg, mem2reg, scalarRep, LICM, GVN
  • 7 chstone benchmarks work on the processor
    • simulator is big-endian
    • processor is little-endian

James

  • SONIC bus
    • looks very similar to the other buses (CoreConnect, Avalon, etc)
    • hard to determine licensing. Must email them to receive details

Victor

  • pre-link pass finished
  • memcpy and memset can handle types other than 'int'
    • supports long long, int, char
  • note: currently only works for llvm-gcc not clang
    • alignment parameter is different for clang
  • observation: LLVM optimizes based on the number of elements in an array
    • < ~8: use individual instructions
    • 8-16: use llvm instrinsics (llvm.memset, llvm.memcpy)
    • > ~16: use a loop
  • could we make the backend pass a .so shared library?
    • actually we'll probably need a new exe for legup
    • to handle command options and file i/o

Andrew

  • working on scheduler
  • went through the legup presentation with Victor, Ahmed, and Li
  • send original chstone paper

Discussion

  • do we need another pass to generate the bus connecting the processor to hardware accelerators?
    • could be slow. incremental compilation?
  • wishbone has licensing advantages and works for both altera/xilinx
    • could just support avalon for now and let xilinx users rewrite the generation pass
    • similar to altsyncram
  • hardware accelerators
    • would they run at different clock frequencies?
    • which are masters vs slaves?
    • memory controller should be a master

May 10, 2010

Ahmed

  • setting up internet
  • installing modelsim
  • download legup and run chstone tests

Mark

  • chstone working with debug flag -g
  • disabled simplifyCFG optimization pass
    • 7/12 chstone benchmark now work

James

  • still working on a course report due this friday
  • will present bus interconnection alternatives thursday
  • received IBM CoreConnect license
  • looking into SONIC SoC bus architecture
  • note: Mark's processor has an avalon bus
    • isn't it possible to connect an avalon bus to another bus type?
    • can you use avalon on xilinx?
    • do we need SOPC builder for verilog generation?
    • could generate one SOPC slave and modify this slave as necessary to support multiple hardware accelerators
      • avoids having to constantly regenerate in SOPC builder
  • does the license allow us to distribute Avalon or CoreConnect?
  • Design Trotter: profiler mentioned in paper
    • HW/SW Interface Impact on an Adaptive Multimedia System Performance: Case study
    • Gives a metric measuring the parallelism of a function
    • Used to decide which functions to convert into hardware

Li

  • msn: fireofnight@msn.com
  • contact Steve about internet
  • email ecehelp@ece.utoronto.ca? Li is not a registered student so they might not have him on file

Victor

  • added a new pass before llc, run with 'opt'
  • lowers llvm intrinsics like llvm.memset and llvm.memcpy to C library calls
  • modifying memcpy and memset to handle types other than 'int'

Andrew

  • working on scheduler
  • forward email about FPGA seminar this monday
  • show Victor and Ahmed the legup presentation given to Desh

Discussion

  • stall the processor while calling hardware accelerators
  • mux the cache between the processor and accelerators
  • Need a few passes to handle memory accesses from legup
    • Pass one:
      • Determine memory mapped addresses of hardware accelerators
      • Generate assembly (.s) and hardware LLVM bitcode (.bc)
    • Pass two:
      • Use scripts to convert assembly into an ELF file and extract the addresses of global variables
    • Pass three:
      • Convert hardware bitcode to verilog using the address information from pass two

May 7, 2010

Victor

  • creating a new pass before backend
    • will handle llvm intrinsics like memset and memcpy
  • also fixing memcpy to handle char types
  • will need to implement memset for dhrystone

Mark

  • all chstone benchmarks working in mips simulator with '-g' option
  • using optimizations '-O1' or '-O2'
    • mips and jpeg don't compile
    • all other benchmarks return incorrect results
    • investigating which opt pass is causing the problem
  • could create a simple C program to demonstrate the mips backend bug
    • see what llvm instructions cause the problem
  • could try -m32 to target a 32-bit machine from the gcc front-end

Andrew

  • add minutes page on wiki
  • working on scheduler.
    • states are no longer named after basic blocks
      • easier to correlate with modelsim

James

  • will request IBM CoreConnect license

Lee

  • Steve's new M.Sc. student from McMaster
  • will look into a debugging framework
    • display state machine and datapath in a GUI
    • debug in hardware and scan out registers?
  • Will look into:
    • Qt - C++ GUI framework
    • download legup and run testbench
    • does modelsim have an api we can hook into for debugging
    • possibly to call C functions from verilog?
  • Eclipse plugin might be a possible framework
    • cons: uses java, heavy weight
  • Planning to sit in SF2206 with James

Jason

  • Send out Lee's email
  • Will IM for monday meeting if not in jury duty

May 6, 2010

Mark

  • 10/12 chstone working in simulator
  • comparing 64 *bit integers has bugs in llvm mips backend
  • working on dfdiv and then dfsin

Andrew

  • working on scheduler. refactoring functions to be <40 lines

Victor

  • dhrystone has unions and structs
  • need to add struct support to legup
  • altsyncrams have uniform data width, structs can have different field sizes
  • how do we store structs?
    • wide rams that store entire struct
      • cons: you must read the entire struct before writing a single field
    • make the ram as wide as the largest field of struct
      • wastes memory
    • one ram for each field
      • too many rams?
    • group fields by size. 4 integer fields are stored in one integer sized ram
  • need to support an array of structs

James

  • last presentation today. then done courses
  • will send out report on bus interfaces done for Jason's course
  • looking into licensing for various bus interfaces
  • can avalon be used on xilinx fpgas?
  • what are the restrictions on IBM CoreConnect, etc.
  • will try out a few bus interfaces to measure performance
  • no papers gave performance information * highly application specific
  • will present results at a future meeting

May 5, 2010

Wiki setup: http://legup.org/wiki/

  • has instructions to setup Modelsim SE

Victor

  • Setting up desk
  • Need an ethernet cable
  • move scanner computer and Thierry's old computer
  • looking into Dhrystone
    • could be issues with string functions: strcpy, strcmp

Mark/Jason

  • 6/12 chstone benchmarks compile with llvm and run on GXemul mips simulator
  • GXemul mips simulator is reliable and useful for debugging (supports

function traces)

  • long term issues with llvm mips backend
  • many serious bugs ie. not allocating enough stack space for local variables
  • do we fix the backend ourselves?
    • very complicated, big time commitment
  • an alternative is a llvm/gcc mashup
    • split code into hardware and software portions
    • use gcc to get mips executable
    • use llvm to get hardware
    • scripts to combine the two
    • would probably take as much time as fixing the llvm backend

Andrew

  • will show presentation given to Desh on Monday (Ahmed hopefully

will be better)

  • working on scheduler pass

Jason

  • jury duty next week
  • can skype on monday
meeting_minutes.txt · Last modified: 2011/06/03 14:45 (external edit)