June 3, 2011
June 3 Meeting minutes [BB] = put on the back burner
Topics:
Throughput driven scheduling.
Profiling-driven scheduling.
Loop pipelining/clever function inlining
SDC Implementation
User constrained scheduling (pragmas)
[BB] Partitioning program to software vs hardware
Binding
sharing functional units good only for large, possibly for smaller units chained together?
4 LUT vs 6 LUT
collapse mux into functional unit
[BB] Other HW architectures
HLS ↔ memory architecture interactions
[BB] auto parallelising function calls
LLVM compiler passes
Clang
Speculative scheduling
Outcomes:
Andrew:
Jason:
Stefan:
Kevin:
Leave LLVM passes till later
Post GUI/debugging framework as ECE design project (2 man team)
June 1, 2011
Research areas:
µP/Accelerator Interface
Parallel µP/Accelerators
1. µP/Accelerator Interface
DE4 Port
Ways of talking to RAM (multiport and multipump)
Multiport:
Currently if two accelerators access the single port cache they block each other, so 2 ports should cause improvement. But this was not seen because off chip memory is slow, hence switch to DE4 which uses DDR2
Extreme cases are 1 port (Avalon arbitration) vs. one port per accelerator. In between, could have e.g. 2 accelerators / port, saving area since # RAMs = # input x # output ports
We could offer priority to accelerators which are computation bound requires memory access profiling
Multipump / Multi-Clock Domain:
Parallelization Schemes
Partition Program
Partition Data
Pre-Fetching
Multiple Caches / customization of cache parameters
Cache Size
Memory Access Profiling
LLVM pass
Analyze Parallelism
Priorities
Projects for Stefan and Kevin
Cache Simulator for memory access profiling: simulate CHStone designs in modelsim and save memory accesses (address being accessed and where it is placed in the cache) in a text file, then build cache simulator
Benchmarks: analyze the rest of CHStones and also e.g. tiled matrix multiply
August 10, 2010
Andrew
Looked at jpeg, some output makes no sense
Keep jpeg for now, email CHStone about jpeg and a golden result
Start writing paper after returning from UBC
Victor
Ahmed
Mark
James
Long discussion about second data points for paper
Move 3rd most compute-intensive function to hardware
Move a different function if the 3rd is very similar to 2nd function
Try crossing clock domains on ModelSim
Jason
Added PC speculation for branching, moved some ID into IF, etc…
Removed exception handling mux to save one logic level
Basically performing manual retiming on critical path
August 6, 2010
James
blowfish is working on the hybrid system, currently working on jpeg
Look into crossing clock domains for a hybrid flow
Try more than one accelerator at once
Some benchmarks jump into the accelerator multiple times and this is working
Want 2nd data point for September paper
Victor + Ahmed
August 5, 2010
Ahmed
Pipelined division: easy to change number of pipeline stages
Pipelined multiplier is more difficult
Division is functional, working on multiplication
Andrew
Some questions after presenting:
How do we measure parallelism?
How much of our performance is due to LLVM?
How do we know if/when the final output from LegUp is good enough for practical use?
August 4, 2010
Mark
Preliminary power results are looking good, VanProf beats SnoopP by a few times
Output is not correct though from either profiler
James
Cache fix works on motion, trying out aes
BuildBot nightly run: 23/27 tests work on navy.eecg, 17/27 work on Andrew's computer
Victor + Ahmed
Victor
Jason
Retiming analysis changed the critical path, which isn't present in the HDL
Without register retiming, FMax ~ 55MHz
Without register retiming, the critical path is (chained):
32-1 mux for register decoding (2 in parallel)
32-bit register value compare
Addition of PC to the branch offset
Possible fixes:
August 3, 2010
Mark
Hierarchy profiling works
Working on measuring the power of VanProf
Read through other profiler paper from DAC 2009
Monitors loops (short backward branching) across functions and context switching (multi-threading)
Area: ~10% of ARM processor, hard to compare with FPGA area
Only monitors loops, not functions
Can filter out user-specified functions
Uses its own cache to store data, about 98% accuracy
Monitors PC, similar to VanProf
No power data
Loops could be a future target for VanProf
Ahmed
Tried cache enable fix on board
Does not work when reused on accelerators
May want to remove ModelSim warnings for Tiger (but, higher priority in tuning Tiger)
Jason
James
Pushed new processor on git
adpcm benchmark now works
Working on jpeg benchmark
Give short summary on progress on Thursday
Andrew
Victor
July 27, 2010
Mark
Ahmed
Victor
Andrew
looking at build bot performance graphs (like Chrome has)
will want to script hW/SW results overnight when all parts are ready
will investigate accelerating different functions for sha
James
Jason
we need to be able to distinguish between accelerating functions with descendants and leaf functions
if number of accelerated functions > number of functions in benchmark, just use the all-HW value
normalize x-axis with % exec time, not number of function (ie 30-40%, 40-50%, etc)
July 26, 2010
Ahmed
Li
Jason
James
SHA benchmark works with accelerator flow
working on 64 bit data accesses (caches, etc)
cacheClkEn signal in cache controller screws up 3 benchmarks
His verilog code stays the same across different benchmarks, just small bugs popup with new benchmarks
Andrew
Progblem with new LLVM-gcc, debugging
build bot installed, waiting on new port from eecg
UBC presentation August 12
Weighted bipartite matching to be implemented next (GAUT uses this, seems to be most popular algorithm)
Possible optimization is to trade mux for register to save value to next state
For LegUp release, will have force-directed scheduling & weighted bipartite matching binding
July 22, 2010
Mark
Li
Andrew
Will make seperate class for muxes
Set up buildbot last night
Will set up graphing component which show metics vs. revisions
July 21, 2010
Jason
GProf on x86 is what everyone does
Tried to get dfsin through eXcite
Reduced errors from 35~15
Got an extension for the eXcite license till October
Also tried pushing dhrystone through eXcite
Li
Divider in Tiger looks like its fully pipelined
Creating a wrapper
Got a clean version of Tiger which is working now
Going to China on Tueseday, and returning sometime in August
Mark
Ahmed
Compared results to estimation
Will have new results by tomorrow
Will also try synthesis keep to see if results get better
James
July 20, 2010
Andrew
Looking at first binding cut
Victor's chaining making merge a bit difficult
Looking at more binding implementations
Need to make scheduler more awake of binding
Will set up build bot to regularly test new commmits or versions of LegUp
Victor
Ahmed
Will look into what resources can be shared
Will look into mux area growth
Also will push a couple of simple designs through LegUp to check:
Looking into a full no-prune option for Quartus
Mark
Can take down Tiger's area by removing redundant operations
Still trying to change critical path with memory in it
Is there a way to select target architecture in GProf?
Area for VanProf almost matches earlier estimate (~3%)
Li
James
Will look into getting chstone running through Tiger with Mark
Will then try to setup a test suite
Also lending a hand to Li to get up Tiger
July 19, 2010
Mark
Has areas for the profiler (VanProf)
Currently verifying results
Made scripts to verify results against other profilers (gprof etc.)
Will get power results for profilers after results are verified
Tiger's critical path is in a FIFO
Would like to increase Tiger's frequency to 50MHz asap
Changing Tiger's divider to one with more pipeline stages shouldn't be too hard
Li
Still looking at improving Tiger
Might be able to reduce area by removing signed multipler/divider
Looked at Tiger's 11-stage pipelined divider and found out how it is set up
James
Moved cache out of Tiger
Transfer became slower when crossing clock domains
Can play around with Avalon's settings to try and increase the transfer speed
Regardless, accelerators can now access the cache which is beneficial
Andrew
Ahmed
Profiled no-dsp multipliers
Added the code that uses them into estimator
Will try to find a way to specify whether each multiplier is LUT based
Working on a script that parses estimator's results
The results will be posted soon
July 9, 2010
Mark
Implemented Profiler in hardware
It matched C++ Profiler results
Still some small subtleties to take care of
Might have to write Snoopy code to test against
Might be able to use LLVM flow to get information about the C programs
Will chstone suffice as tests for the profiler?
Will be interesting to find instructions/cycle for Tiger MIPS
Also instructions/cycle for chstone using the profiler
What might cause less than 1 instruction/cycle?
Profiler should (and will) be able to track multiple metrics
Ahmed
The output bitwidth of the profiled multipliers was the reason for the Fmax discrepancy
MIPs timing result more accurate now
Looking at other benchmarks and debugging results
Added new LUT count in order to have both register and combinational components of LEs
Will run the script and get complete data even for pipelined multipliers/muxes
Maybe put estimation/area code into its own pass?
Will simulate power without sdf file to see difference when glitches completely ignored
James
When processor polls for done signal, runs the cache's state machine
Should we make the cache dual port?
Or would it be easier to create a “busy” signal from the cache
For the time being, will do “blocking” implementation
To support multiple accelerators:
We eventually will have to either do multi-port caches
Or we can just use a single-port cache with arbitration
Or dual-port cache with arbitration for the accelerators
Victor
Modifying
ASAP scheduler to be more modular
Making it easy to add a new scheduler pass
Working on ALAP scheduler as well
We want to give multiple schedules based on resource constraints
Force-directed list scheduling considers hardware constraints
as H/W constraints would be more efficient than human specified constraints
Andrew
July 8, 2010
Tom
Victor
presentation on structs
LegUp struct support is equal to or better than all HLS tools we've looked at
doesn't support passing structs by value to a function or returning structs from a function
memory controller:
Tom: can run NiosII at 250-300MHz if you separate the processor from the rest of the system
Steve
Ahmed
mismatch between Legup 64bit multiply fmax and script
area estimate: usually estimates slightly less than actual
Tom: by default Quartus converts state machines into one-hot encoding
James
Li
Andrew
Mark
July 7, 2010
Ahmed
Jason
eXcite df* benchmarks:
the input/output vectors were initialized wrong: upper 32 bits mangled
problems seem to be related to 64 bit operations
fmax results were using CycloneII Auto device with speed grade 6
Li
July 6, 2010
Mark
C++ version of profiler working
reads ELF file, runs gxemul, captures instr trace, uses hashing algorithm, counts each function call
verifying counts
profiler doesn't take into account cache misses
could use this profile info to drive C2H flow
Ahmed
debugging early delay estimate
area estimates
read some papers on high level synthesis early estimation
fast estimation could allow us to produce multiple verilog files:
Jason
Andrew
Victor
looking into Tiger modelsim simulation on Linux
reading force directed scheduling paper
will need to implement a as-late-as-possible (ALAP) scheduler
James
Li
Undergrad presentations are on August 11.
July 5, 2010
Mark
James
basic C2H architecture is working
modified TigerMIPS data cache to be shared with accelerators
added 2 slave ports to Tiger dcache. added 2 master ports to accelerator
1 port: address
1 port: data
architecture can support multiple accelerators without new ports
uses polling (like C2H)
add a tcl option for waitrequest - less power, and slightly less cycles than polling
every cycle Tiger reads from cache - wastes power
So we'll have the functionality of C2H
tcl file specifies functions to accelerate
how does the user modify the C wrapper to allow processor and accelerator to run concurrently?
modify the generated wrapper
generate an async wrapper and then a polling function
create pthread to simulate parallelism
use SystemC
use interrupt?
how to put everything in h/w?
July 1, 2010
Ahmed
James
Victor
Andrew
June 30, 2010
Ahmed
Dynamic Power is linearly proportional to frequency.
So if we get the dynamic power at 1MHz we can use this formula to calculate power:
We can run a non-power aware scheduler to get a rough estimate of fmax, then iterate to achieve better power
Victor
-
Pointers have been shrunk to 32 bits, dhrystone is running correctly
Looking into getting the Tiger modelsim flow running on linux
Mark
James
Li
met with Tom to discuss Lab 6
recommends teaching the students modelsim, possibly with a provided testbench
Andrew
June 29, 2010
Short meeting today
General
Victor and Ahmed will participate in the “undergraduate research day” in August, and will prepare an abstract for submission and send it out to the group for comment.
Some people would rather work on Thursday (Canada Day) instead of Friday.
On Monday, we'll have a brainstorming session on the options for processor/accelerator interface and the way that users could control that.
Victor
Fixed switch statement problems in MIPS back-end of LLVM. Communicated the patch to LLVM developer who maintains the back-end.
Will start reading a couple papers on scheduling, with a view to improving the current scheduler. A start point might be to implement as-late-as-possible scheduling.
Ahmed
Added command line parameters related to power characterization to his script. User can set # of vectors. User can set toggle probability for primary inputs.
Currently working on some minor bugs.
Most likely will start working on early delay estimation, early area estimation. I.e., actually using the data from his models in high-levels ynthesis.
Andrew:
June 28, 2010
Andrew:
Working on binding infrastructure: separating binding from RTL code generation. The circuit will be represented in memory using a generic representation not tied to any RTL language. The intent is that in future, we may want to add other back-end code generators, e.g. VHDL.
Mark pointed out that SPREE may have some related work.
Mark
Generating #s for the CHStone benchmarks running on Tiger: cycle count. Will post results on-line when ready.
Seeing that some of the CHStone circuits are producing incorrect results.
Also working on his profiler. See that 2/3 M4K RAMs will be sufficient
Ahmed
Writing script for power characterization of hardware modules.
Going to add parameters to control input activity, # of vectors to simulate for. The user will be able to specify a toggle probability for the primary inputs.
For at least a few modules, will compare the power #s with and without the full delays (affects glitches).
Victor
Fixed “select operator” ( ? : ) MIPS code generation in LLVM. Fix also seems to repair the 64-bit comparison bug. Isn't sure if these bugs are indeed fixed.
Wrote a script to alter LLVM MIPS assembly to fix the switch statement bug.
Filed bugs against LLVM MIPS back-end.
All the CHStone benchmarks work through LLVM, targeting MIPS, with -O3 compiler optimizations.
Found the DMIPS for drhystone has increased, is investigating why.
James
Working on writes to memory for passing parameters/data between HW / SW.
We agreed that we should have a brainstorming session on the key interfaces we should provide for the processor talking to the HW accelerators, taking the best ideas from C2H & eXCite.
Li
ECE342 labs are done; he is still debugging Lab 6.
Going to start analyzing the performance of Tiger MIPS, see if anything obvious that can be done to improve the crit path & clock freq.
June 22, 2010
Andrew
*Will have internal presentation on Binding.
Ahmed
Mark
The NIOS is working now.
Blocking at one spot(large mux latency)
Have a accurate cycle result now
Will have a comparison table for NIOS vs. Tiger
Li
June 21, 2010
James
Fixed the problem with M4K(Some Altera Option can double the size of M4K)
Measured the performance of Tiger MIPS.(4X slower than NIOS)
The measurement is done in seconds, should be more accurate in cycles.
Ahmed
Tried the NIOS II running at high frequency about 130Mhz.
The optimal configuration can be run at 160-170Mhz.
Cyclone II can support up to 50Mhz. If we wanna make it higher, need to tune it by ourselves.
Andrew
Mark
Implementing the system and stop at some spot. Should be easy fixes.
Once have all the parts ready, it would be easy to implement SnooP and compare between them.
Li
June 18, 2010
Mark
James
MIPS benchmark + processor system works
Obstacles
mips.v large, had to figure out timing
this process can be generalized/automated – next benchmarks will be easier
each accelerator has own SW wrapper
data cache + ins. cache = ~70/105 M4Ks
some benchmarks don't fit on chip with pure HW
Li
Ahmed
Interface with TCL set up
using an operations class that contains the data like fmax, latency, etc
legup_config class that contains all operators and can access them
run NiosII F with speed opt, different seed, fmax constraint, etc.
Victor
debugger works, run on chstone
all chstone match traces exactly, except GSM on old LegUp code, which shows error with phi nodes
Andrew
June 16, 2010
Ahmed
Nios 2 benchmarking done
data + walkthrough up on wiki
area + fmax will be posted
modelsim gives deterministic results for performance counter (wasn't deterministic on chip)
Victor
James
Jason
Top Priorities
Lower Priorities
Timing Analysis
Power data (Ahmed)
Tune Tiger MIPS
June 14, 2010
Andrew
finishing up accelerator wrapper code (cleaning up and separating)
move modified code (ie llc, opt) to own folder
James
working on Mips benchmark in processor+accelerator system
accelerator alone works, but starting with processor causes wrong output and array-out-of-bounds error
Ahmed
will get delay results from modelsim (using $time) to confirm Nios's profiling results (since non deterministic)
modelsim still giving issues with Nios system
all benchmarks but SHA are working on Nios
Victor
Mark
outlining schedule of tasks
simplify hashing as much as possible
get profiler working in modelsim
working on the data transfer between PC & FPGA for initializing profiler & retrieving data
Li
Jason
June 11, 2010
Andrew
Added a tcl configuration file
Legup will now require TCL library
Calls to the accelerated function will be replaced by calls to a wrapper function
Thought of an optimization for Victor's memory code
Made changes with James to overcome latency caused by wait request signal when accessing SRAM
LLVM's 2.7 update caused some performance results to decrease
Ahmed
Got all data for NiosII-f
Will be tweaking NiosII to match Tiger MIPS not other way around
Results of benchmarks are not constant +- 1%
Seeing if Modelsim results are constant
Trying Modelsim rather than Modelsim-Altera
Will be working alongside Victor on helper class
Will be adding register support in Profiler afterwards
Mark
Jason
Victor
June 9, 2010
Ahmed
Looking into disabling printf/UART overhead for the benchmarks
All 13 benchmarks ran through NiosII succesfully
Looking into profilers for the NiosII
Jason
Victor
Looking into more useful benchmarks
Should look into Mediabench
Should we get a floating point benchmark?
Does CtoH support structs? Do the others?
Mark
June 8, 2010
June 7, 2010
Jason
Ran 3/12 CHstone benchmarks on eXCite (mips,aes,sha)
eXCite can run on multiple families and has options for optimization/pipelining
Will do a demo of eXCite on Thursday, 24th
Mark
Outlined presentation for Thursday
Created diagrams and notes
Looking into how to implement power profiling
Can try looking at power per instruction class
Looking into implementing CAMs for increase in speed
Slight increase in area/power as well
Might not be supported in CycloneII, but is in high-end families like Stratix
Can try to build them through logic
Can just implement later when moving to high-end families
Will map thoughts out on blackboard/slides next meeting
Ahmed
Should check Registers vs. BRAMs in power profiling
Check data on comparators (which Mark might need)
Will measure NIOSII DMIPS
Decided on output file layout with victor
Victor
June 4, 2010
Mark
Most of the profiler design planned out
How to map address in memory to corresponding functions
Some sort of hash table can be used (Snoopy method)
Conversion table that indicates which address correspond to which function, then binary search into the table
Started to code verilog
Will create flow chart to show flow of profiler
Ahmed
Went through 4 tutorials to learn about Modelsim, SoPC Builder
Will run 12 CHStone benchmarks and Dhrystone on Nios to compare performance against Tiger
Victor
Jason
Tried out Excite on MIPS benchmark
MIPS goes through and shows correct result
MIPS takes 4000-5000 clock cycles to complete for code generated by Excite, ~11,000 clock cycles for code generated by Legup
Trying to push JPEG benchmark, really slow to compile (just compiling to verilog) on Excite, whereas Legup takes 4 seconds.
Runtime may be another metric for paper
Excite has a nice
GUI which generates a state diagram, and clicking on each state shows which line of C code it corresponds to
Excite only supports one C file
James
Got SRA example working with having accelerator directly access main memory
Memory access (SDRAM) is really slow, takes around 20 clock cycles, will be a huge bottleneck
For SRA, getting 2 inputs takes ~50 clock cycles, bursting reduces this to ~20 clock cycles
Burst through and grab all inputs needed in the beginning and store in a fifo, Access the fifo during computation
Li
June 2, 2010
Ahmed
Done pipelined dividers and multiplexers
Multiplexers are really fast, don't need to pipeline
Discussion on Muxes (a big Mux may be larger than the shared component (adder), hence may be less total area to replicate the component rather than use mux)
Discussion for future paper:
On one end of the spectrum, we can have software only solution, tiger proc running everything in software, on the other end, full hardware solution, Legup synthesizing everything to hardware
Along the spectrum, we can have successive different configurations with more or less functions synthesized to hardware
This will create a good latency vs area graph for paper
We need benchmark which show advantageous results for the hybrid solution
Chstone was created for HW only solution (does not contain things that should be run in SW, i.e. malloc, linked lists), hence may not exhibit good results for the hybrid solution
June 1, 2010
Short meeting today
Ahmed
Victor
James
Ease of use of C2H (click on function to accelerate, building project run SOPC and Quartus in background which automatically makes connection to avalon and synthesize)
Avalon Master port is created for ( 1. Pointer dereference (* operator) 2. Index into an array ( [ ] operator), 3. Index into a struct, union ( . or → operators), 4. Usage of global or static variable)
Cache is flushed whenever accelerator, which has access to same memory as CPU, executes
May 31, 2010
Andrew
Fixing bug caused by updating to LLVM 2.7
Hack fix for now, all benchmarks work except for GSM
Not checked in yet, thus current version in git fails
Will check in code for C wrapper
James
Ahmed
Updating src code to test pipelined multipliers
After getting results for pipelined multipliers, will move on to dividers, muxes, then onto power
Jason suggested building a helper object in C++ which accepts LLVM IR as inputs and outputs area, latency, power
Mark
Trying to outline the profiler design
Planned out interface b/w profiler/processor, user/processor
Will modify mipsload/debug stub for communication
Problems with compiling mipsload due to unknown libraries being used (part of binutil package)
Snoopy more or less mapped out, which will be useful as a baseline design
Li
Victor
Still working on nested structs/arrays
Rewrote some of Andrew's code for nested arrays and this code was checked in Thursday
Trying to get better, working from home today
May 28, 2010
Li
Has gone through the first two 342 labs.
Considering improvements to recommend.
The former 342 TAs in the room mentioned that labs 1 & 2 were “do-able” by most students.
Li: Possibly could give students “skeleton” code that they would modify, rather than having them take a from-scratch approach.
Li will formulate suggestions, communicate to Tom & Steve.
Discussion with Davor around Tiger MIPS
Davor suggested we consider SPREE which is from UofT and would allow us to use something “invented here”, thereby raising its profile in the research community.
Mark showed Davor his analysis of the different processors considered.
SPREE appears to have smaller area and faster performance than Tiger MIPS.
SPREE supports less of the MIPS instruction set than Tiger. SPREE doesn't support the div instruction, and some others.
SPREE has no cache / memory architecture and only uses the on-FPGA block SRAM memory, therefore cannot accommodate big programs.
The extra functionality of Tiger likely explains its larger size versus SPREE.
Tiger already works with Avalon, SPREE hasn't been tried in that context.
Tiger is used in courses in Cambridge and elsewhere and is therefore likely well-vetted/tested.
SPREE doesn't offer the complete development/debugging “ecosystem” that Tiger comes with.
Davor mentioned that Kaveh (Moshovos' student) is building his own NIOS.
Andrew pointed out that there is no LLVM back-end for NIOS.
After hearing the pluses/minuses, Davor agreed that Tiger is the best choice for us.
General agreement that we want to use UofT-based infrastructure, where possible and convenient.
Davor proposed we have some discussion around FPGA soft-processor usage within the different research groups in ECE. Jason: Perhaps one or more of the monthly FPGA seminars could be used for this.
Idea for a small project: analyze the critical path of Tiger and see if perhaps some simple Verilog changes can be applied to improve its speed.
Discussion around courses ECE241/342
Davor suggested that both courses be moved to System Verilog, as it has some clarity advantages / ease versus Verilog.
It was pointed out that System Verilog may not have much traction in industry, versus standard Verilog and VHDL.
Davor suggested the project be cut from 241.
Jason mentioned the Steve's proposal that the 241 project be made optional (with potential for bonus/incentive), with the alternative to the project being additional labs.
May 26, 2010
Andrew
Pragmas can only be supported with clang using IR metadata (latest version of LLVM)
Use an external config file instead?
Config files cannot be as fine-grained as pragmas, ie for loops, blocks of code, etc…
In the long term, we can support both
We will need a config file anyways for timing/power data (Ahmed's stuff)
Start with hw/sw partitioning before allocation, binding so James has something to work with
James
Got simulation working with MIPS processor and Avalon bus
Try sra example next
Value returned on Avalon bus
Figure out hardware accelerators writing to main memory (later)
Mark
Victor
Ahmed
Created wiki page for perl script usage
Links to generated output files and graphs on wiki too
Working on Thursday's presentation
Li
May 25, 2010
Andrew
Work on pragmas (focus of this week)
Checked in modularized scheduler code
Read up on C to H, how they handle pragmas
James
Look at how C to H handles the bus
Prototyped with NIOS (via polling) and a simple Verilog adder via Avalon bus
Work on sra next
Handshaking mechanism via memory-mapped registers (read/write, also a wait signal may be necessary on the Verilog side)
Mark
Check in Tiger MIPS flow into git when ready
Check in Windows (working) and Linux (almost working) toolsets
Problem in jpeg benchmark with exit call
Develop libraries for system call-like functionality as we go, ie main return value
Function lowering, ie printf
Try MIPS libraries on Cyclone II
Processor should eventually support rand function, etc…
Start profiling this week
Ahmed
Different device output errors under control
TimeQuest gets stuck on fat dividers
Create an html page with graphs per device
Make short presentation on Thurs (powerpoint)
Register-to-register delay working for all devices
Look into pipelined multipliers, dividers, mods (report how many cycles)
Start power analysis this week
Ask Tom for advice on power
Victor
Commit current struct code
Make struct code more generic (arrays of structs, structs of structs…)
Linked list support? Aka traversal, but not allocation
malloc and free probably not worth it to implement in hardware
May 21, 2010
Andrew
James
Mark
fixing a bug with jpeg benchmark running on mips
sent bug reports to LLVM bugzilla:
filed bug with legup on 64-bit:
Li
Ahmed
parsing problem:
No DSP elements on Cyclone II, just embedded multipliers
switched to TimeQuest instead of TAN
make sure delay is reg-to-reg and not i/o pin to register
Power:
vectorless: probably need activity factors, the toggle rate at each node
vector: uses a .vcd (value change dump) file as input
Could generate a few data points:
Victor
Talk today: Michael McCool
May 20, 2010
Andrew
new scheduler code is done
next: will look into h/w s/w partition pragmas
need documentation for core data structures (like ABC)
Mark
Victor
struct with non-uniform primitive types work
struct assignment works
struct initialization of mif files works
struct support will be a great bullet point in the paper
James
created small h/w component attached to the avalon bus
waiting on Mark to get fixes to Tiger mips
lumping function arguments on the bus for now
bus width max is 1024 bits
multi-pump: use multiple cycles to send function arguments
does each slave to master connection have a different data width?
optimal speed is when bus width = 256 bits
stall processor execution with an avalon waitrequest
Li
Ahmed
generated delay plots overnight
if quartus project exists script just parses the quartus .rpt files
eecg server mint is 3x slower than laptop (even with 4 Xeon 1.6Ghz machines)
to run overnight use 'screen'
try compiling without DSP blocks
future: use delay lookup to perform timing analysis of legup output
loop over all states, check dependencies between instructions, lookup delay
also could put multiple operations in a state based on the slack available
May 19, 2010
Andrew
Mark
James
Victor
byte-enable of ram allows us to avoid instantiating multiple rams
structs containing only long integers work in legup
struct memory memory will be 64bits wide (8 bytes)
struct alignment can cause wasted space for example:
+----+----+----+----+----+----+----+----+
| a | | | | b |
+----+----+----+----+----+----+----+----+
3 bytes wasted right after 'char a' because b must be 4-btye aligned (integer)
C99 standard doesn't allow reordering members
does byte-enable add delay?
interesting fact: byte-enable allows the user to set the number of bits in a byte
future:
can we use the SRAM?
Ahmed
added arithmetic shift
looked at verilog in RTL viewer - looks correct
not a priority to make the script incremental - will be run overnight
plotting results using 'ploticus'
next step: power
focus on dynamic power for now
does speed grade affect each operation equally?
subtract setup time
Li
May 18, 2010
Conference yesterday
James
Mark
submitting bug reports to LLVM for MIPS backend
still isolating adpcm bug
send processor flow to James
modifying chstone to fix these bugs and checking into git
Andrew/Victor
Victor checked in pre-llc pass into git
new mailing list: legup-commits@legup.org
propose to implement structs in 8 columns of one-byte wide rams
a generic byte addressable memory with a 64-bit word size
allows writing a single byte in one cycle
otherwise to write a single byte we'd have to:
Jason: this will have very high power consumption
is there an altera ram with a mask?
most of the work is in the steering logic to handle byte addressability
Ahmed
Li
May 14, 2010
Mark
Li
Ahmed
wrote perl script that runs quartus and parses area (ALUTs, registers) and reg-to-reg delay
supports signed addition, multiplication for 8, 16, 32, 64 bits
pass FPGA family on command line
Adding other LLVM operators: /, %, and, or, xor, >, ==
Victor
Andrew
gave Victor access to git
new mailing list: legup-commits@legup.org
give git access to Ahmed to add perl script and Mark to modify chstone
May 13, 2010
Ahmed
Victor
James - Soc Bus Alternatives
ARM AMBA has fees for the IP core
Wishbone only specifies
spec, not enough value add
Avalon looks like best option
don't need to implement our own SOPC Generator
SOPC can be driven by tcl scripts
Tiger MIPS uses avalon buses
cannot be used on xilinx hardware (license)
we should allow Xilinx CoreConnect to be added in the future
Jason
top bullet: h/w designs offer compelling advantages
do we want people to think we're competing with Intel?
process node debate:
Intel:
TSMC:
Altera
Intel ahead of Altera (32nm vs 40nm) until Q3 2010
Altera ahead of Intel (28nm vs 32nm) until Q4 2011
Cool table (As of April 27,2010)
could beat Intel in very parallel applications: option pricing
Andrew
add the Legup flowchart slide
show results on Nios II/f
don't use jargon, SSA: static single assignment
don't publicize legup.org yet, first impressions are important
release set for early fall
double precision floating point adder in C sounds like one line
Desh
May 12, 2010
Jason
Mark
all chstone benchmarks except jpeg run on processor (minus 5 optimization passes)
little-endian: just had to pass -el flag to llc
strangely 7 benchmarks passed even without the correct endian
assembly instructions aren't affected by endian
but the data section of assembly will change
endian determines the byte ordering, not the bits within a byte
so only affects data overflowing the first byte (>255)
update wiki:
Andrew
Send chstone changes to mark
working on scheduler
dry-run tomorrow: a couple slides
current status of our tool: using LLVM, etc
what we're working on right now
Steve: should we mention legup.org yet?
might want to wait until we're ready to release
give git access to Victor
James
looking into ARM AMBA bus
CoreConnect: unclear whether you can freely distribute
Leaning towards wishbone: free to distribute
presenting bus alternatives tomorrow
can we use multi-port caches?
Victor
Ahmed
Discussion
May 11, 2010
Ahmed
Modelsim installed
Ran chstone benchmarks 11/12 pass
Reading Gajski's high-level synthesis book chapter
Look into Cyclone II EP2C35F672C6 for now (DE2)
Manually get timing/area for verilog operators
Look into pipelined functional units
Use perl for scripting. Andrew can help.
Li
hasn't heard back about internet
Reading Gajski's high-level synthesis book chapter
msn: fireofnight@msn.com
might not make it tomorrow: having lunch with team at IBM
Mark
James
SONIC bus
looks very similar to the other buses (CoreConnect, Avalon, etc)
hard to determine licensing. Must email them to receive details
Victor
pre-link pass finished
memcpy and memset can handle types other than 'int'
note: currently only works for llvm-gcc not clang
observation: LLVM optimizes based on the number of elements in an array
could we make the backend pass a .so shared library?
Andrew
working on scheduler
went through the legup presentation with Victor, Ahmed, and Li
send original chstone paper
Discussion
May 10, 2010
Ahmed
Mark
James
still working on a course report due this friday
will present bus interconnection alternatives thursday
received IBM CoreConnect license
looking into SONIC SoC bus architecture
note: Mark's processor has an avalon bus
isn't it possible to connect an avalon bus to another bus type?
can you use avalon on xilinx?
do we need SOPC builder for verilog generation?
could generate one SOPC slave and modify this slave as necessary to support multiple hardware accelerators
does the license allow us to distribute Avalon or CoreConnect?
Design Trotter: profiler mentioned in paper
HW/SW Interface Impact on an Adaptive Multimedia System Performance: Case study
Gives a metric measuring the parallelism of a function
Used to decide which functions to convert into hardware
Li
Victor
added a new pass before llc, run with 'opt'
lowers llvm intrinsics like llvm.memset and llvm.memcpy to C library calls
modifying memcpy and memset to handle types other than 'int'
Andrew
Discussion
stall the processor while calling hardware accelerators
mux the cache between the processor and accelerators
Need a few passes to handle memory accesses from legup
Pass one:
Pass two:
Pass three:
May 7, 2010
Victor
creating a new pass before backend
also fixing memcpy to handle char types
will need to implement memset for dhrystone
Mark
all chstone benchmarks working in mips simulator with '-g' option
using optimizations '-O1' or '-O2'
mips and jpeg don't compile
all other benchmarks return incorrect results
investigating which opt pass is causing the problem
could create a simple C program to demonstrate the mips backend bug
could try -m32 to target a 32-bit machine from the gcc front-end
Andrew
add minutes page on wiki
working on scheduler.
James
Lee
Steve's new M.Sc. student from McMaster
will look into a debugging framework
Will look into:
-
download legup and run testbench
does modelsim have an api we can hook into for debugging
possibly to call C functions from verilog?
Eclipse plugin might be a possible framework
Planning to sit in SF2206 with James
Jason
May 6, 2010
Mark
10/12 chstone working in simulator
comparing 64 *bit integers has bugs in llvm mips backend
working on dfdiv and then dfsin
Andrew
Victor
dhrystone has unions and structs
need to add struct support to legup
altsyncrams have uniform data width, structs can have different field sizes
how do we store structs?
wide rams that store entire struct
make the ram as wide as the largest field of struct
one ram for each field
group fields by size. 4 integer fields are stored in one integer sized ram
need to support an array of structs
James
last presentation today. then done courses
will send out report on bus interfaces done for Jason's course
looking into licensing for various bus interfaces
can avalon be used on xilinx fpgas?
what are the restrictions on IBM CoreConnect, etc.
will try out a few bus interfaces to measure performance
no papers gave performance information * highly application specific
will present results at a future meeting
May 5, 2010
Wiki setup: http://legup.org/wiki/
Victor
Mark/Jason
function traces)
long term issues with llvm mips backend
many serious bugs ie. not allocating enough stack space for local variables
do we fix the backend ourselves?
an alternative is a llvm/gcc mashup
split code into hardware and software portions
use gcc to get mips executable
use llvm to get hardware
scripts to combine the two
would probably take as much time as fixing the llvm backend
Andrew
will be better)
Jason
jury duty next week
can skype on monday