It's very easy to install dokuwiki. Basically just extract the tarball over the existing installation.
— Andrew Canis 2009/10/18 16:00
There is a bug in the “make hybrid” flow:
dfadd.o: In function `main': (_main_section+0x4c): undefined reference to `float64_add' dfadd.o: In function `main': (_main_section+0x68): undefined reference to `float64_add' make: *** [hybrid] Error 1
Looking in dfadd.sw.ll:
%4 = call i64 bitcast (i64 (i64, i64)* @legup_wrap_float64_add to i64 (i64, i64 (i64, i64)*)*)(i64 %3, i64 (i64, i64)* @float64_add)
There is a reference to float64_add that shouldn't be there. Breaking down this function call:
%4 = call i64 bitcast (i64 (i64, i64)* @legup_wrap_float64_add to i64 (i64, i64 (i64, i64)*)*) (i64 %3, i64 (i64, i64)* @float64_add)
What is that strange bitcast? Before the llvm-ld this was:
%4 = call i64 @legup_wrap_float64_add(i64 %3, i64 (i64, i64)* @float64_add)
Before the sw pass it was:
%4 = tail call i64 @float64_add(i64 %2, i64 %3)
So something is going wrong in the sw pass. It's a bug in ReplaceCallWith() in utils.cpp
%4 = tail call i64 @float64_add(i64 %2, i64 %3)
Becomes:
%4 = call i64 @legup_wrap_float64_add(i64 %3, i64 (i64, i64)* @float64_add)
Okay fixed it.
Seeing another problem with aes when accelerating aes_main:
acanis@acanis-desktop:~/git/legup/examples/chstone_hybrid/aes$ export LEGUP_ACCELERATOR_FILENAME=aes; ../../../llvm/Release+Asserts/bin/opt -legup-config=config.tcl -load=../../../cloog/install/lib/libcloog-isl.so -load=../../../cloog/install/lib/libisl.so -load=../../../llvm/tools/polly/Release+Asserts/lib/LLVMPolly.so -load=../../../llvm/Release+Asserts/lib//LLVMLegUp.so -legup-sw-only < aes.prelto.bc > aes.prelto.sw.bc opt: SwOnly.cpp:205: virtual bool legup::SwOnly::runOnModule(llvm::Module&): Assertion `0 && "Accelerated function is never called or optimized away!\n"' failed.
— Andrew Canis 2011/10/05 12:15
/*-----------------------------------------------* * CLooG configuration is OK * *-----------------------------------------------*/ It appears that your system is OK to start CLooG compilation. You need now to type "make". After compilation, you should check CLooG by typing "make check". If no problem occur, you can type "make uninstall" if you are upgrading an old version. Lastly type "make install" to install CLooG on your system (log as root if necessary). make -C cloog make[1]: Entering directory `/home/acanis/git/new/legup/cloog' CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/bash /home/acanis/git/new/legup/cloog/autoconf/missing --run aclocal-1.11 -I m4 /home/acanis/git/new/legup/cloog/autoconf/missing: line 54: aclocal-1.11: command not found WARNING: `aclocal-1.11' is missing on your system. You should only need it if you modified `acinclude.m4' or `configure.ac'. You might want to install the `Automake' and `Perl' packages. Grab them from any GNU archive site. CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/bash /home/acanis/git/new/legup/cloog/autoconf/missing --run autoconf cd . && /bin/bash /home/acanis/git/new/legup/cloog/autoconf/missing --run automake-1.11 --foreign /home/acanis/git/new/legup/cloog/autoconf/missing: line 54: automake-1.11: command not found WARNING: `automake-1.11' is missing on your system. You should only need it if you modified `Makefile.am', `acinclude.m4' or `configure.ac'. You might want to install the `Automake' and `Perl' packages. Grab them from any GNU archive site. configure.ac:59: error: possibly undefined macro: AM_INIT_AUTOMAKE If this token and others are legitimate, please use m4_pattern_allow. See the Autoconf documentation. configure.ac:73: error: possibly undefined macro: AC_PROG_LIBTOOL configure.ac:75: error: possibly undefined macro: AM_CONDITIONAL make[1]: *** [configure] Error 1
To fix this I just added back in the ./autogen.sh in the cloog directory
— Andrew Canis 2011/10/04 12:15
Looking into why the xor-xor pattern in blowfish is taking up more registers.
strict (no sharing) -> strict off reg: 6853 -> 7318 aluts: 6795 -> 6575
Just sharing the 5 patterns:
Pattern Size: 5 (contents: addi32, xori32, addi32, xori32, xori32, ) Frequency: 15 Number of Pairs: 7
I get an improvment from:
; Combinational ALUTs ; 8,579 / 58,080 ( 15 % ) ; ; Total registers ; 8389 ; ; Logic utilization ; 10,935 / 58,080 ( 19 % ) ;
To:
; Combinational ALUTs ; 6,945 / 58,080 ( 12 % ) ; ; Total registers ; 7494 ; ; Logic utilization ; 9,586 / 58,080 ( 17 % ) ;
So reduction of 895 registers, 1634 aluts, 1349 logic utilization
Just sharing size 3 patterns:
Function: BF_encrypt Pattern Size: 3 (contents: addi32, xori32, addi32, ) Frequency: 16 Number of Pairs: 7 Function: BF_cfb64_encrypt Pattern Size: 3 (contents: ori32, ori32, ori32, ) Frequency: 2 Number of Pairs: 1
I see:
; Combinational ALUTs ; 7,796 / 58,080 ( 13 % ) ; ; Total registers ; 7446 ; ; Logic utilization ; 9,706 / 58,080 ( 17 % ) ;
Only size 1 pairs
Function: BF_encrypt Pattern Size: 1 (contents: xori32, ) Frequency: 50 Number of Pairs: 23 Pattern Size: 1 (contents: addi32, ) Frequency: 32 Number of Pairs: 15 Function: BF_cfb64_encrypt Pattern Size: 1 (contents: ori32, ) Frequency: 6 Number of Pairs: 3 Function: main Pattern Size: 1 (contents: ori32, ) Frequency: 3 Number of Pairs: 1 Pattern Size: 1 (contents: addi32, ) Frequency: 3 Number of Pairs: 1
; Combinational ALUTs ; 7,168 / 58,080 ( 12 % ) ; ; Total registers ; 7336 ; ; Logic utilization ; 9,711 / 58,080 ( 17 % ) ;
— Andrew Canis 2011/09/22 12:15
Look into the restrict keyword for pointers
TODO: - make sure srem/sdiv share together. - srem is only a problem with aes - function inlining would save 2 dividers in jpeg, 1 in sha, 1 in aes (assuming div/rem sharing) Damn. sdiv and srem are used in the same state. - binding aware scheduling would be crucial here
Side note: C2H has some useful benchmarks
— Andrew Canis 2011/09/15 12:15
Stefan just noticed a large drop in LEs in the quality of results page. It occurs for these commits (whihc doesn't make sense)
Stefan Hadjis [Thu, 25 Aug 2011 16:17:22 +0000] Fixed compilation error Stefan Hadjis [Thu, 25 Aug 2011 15:54:48 +0000] Removed include Signals.h Stefan Hadjis [Thu, 25 Aug 2011 15:40:26 +0000] Merge branch 'master' of legup.org:legup Stefan Hadjis [Thu, 25 Aug 2011 14:50:44 +0000] Binding changes for new LLVM version Made small changes to be compatible with the new version of LLVM.
Maybe that's when I changed the quartus version? Build before: Version 9.1 Build 350 03/24/2010
Cycle geomean: 14576.0837656035 Fmax geomean: 80.3956562368825 Latency geomean: 181.702363878993 cat benchmark.csv name time cycles Fmax LEs regs comb mults membits chstone/adpcm 407 31523 77.38 24284 10585 21786 300 27072 chstone/aes 209 15716 75.18 21590 11386 18041 0 36800 chstone/blowfish 2811 197978 70.43 15967 8368 14198 0 150240 chstone/dfadd 6 804 124.98 10113 3911 9564 0 17056 chstone/dfdiv 29 2256 78.27 18079 12521 12256 48 12416 chstone/dfmul 3 291 107.33 5095 2382 4545 32 12032 chstone/dfsin 1010 64433 63.80 33363 18077 26105 86 12832 chstone/gsm 84 5358 63.54 19058 5813 17477 70 10144 chstone/jpeg 33949 1323338 38.98 46485 19051 42071 240 468784 chstone/mips 52 5118 98.11 5042 2044 4492 16 4480 chstone/motion 54 6379 117.52 5449 2406 5000 0 33312 chstone/sha 2843 233875 82.25 17015 8563 14004 0 134368 dhrystone 82 7424 90.93 6893 3737 5611 4 2256 program finished with exit code 0 elapsedTime=25113.470937
Build after: Version 9.1 Build 350 03/24/2010
Cycle geomean: 14576.0837656035 Fmax geomean: 84.5261619961444 Latency geomean: 173.020465101761 cat benchmark.csv name time cycles Fmax LEs regs comb mults membits chstone/adpcm 410 31523 76.91 24100 10173 21358 172 27072 chstone/aes 190 15716 82.52 19730 10508 16113 0 36800 chstone/blowfish 2724 197978 72.69 13367 7684 11544 0 150240 chstone/dfadd 6 804 134.68 8531 3879 7965 0 17056 chstone/dfdiv 25 2256 89.90 14962 10736 9430 48 12416 chstone/dfmul 3 291 101.49 4451 2147 3965 32 12032 chstone/dfsin 943 64433 68.32 30048 16602 23622 86 12832 chstone/gsm 69 5358 77.93 11112 5864 9598 52 10144 chstone/jpeg 33026 1323338 40.07 40614 19051 36075 172 468784 chstone/mips 53 5118 95.83 4244 1718 3871 16 4480 chstone/motion 51 6379 125.57 4726 2322 4273 0 33312 chstone/sha 2738 233875 85.43 15692 8657 12623 0 134368 dhrystone 82 7424 90.43 6566 3673 5338 4 2256 program finished with exit code 0 elapsedTime=21144.868643
No. It's caused by a combination of 1) VerilogWriter fix 2) A few more dividers might have been shared.
Just installed quartus 10.1 sp1. Took 40 minutes to compile dfsin with no_dsps. So the new version seems to be working.
Adding stratix4 to the buildbot:
buildmaster@acanis-desktop:~/buildbot/public_html/perf$ generate_perf.py
And modifying dashboard/overview.html and dashboard/perf.html. Also need to modify process_log.py. Then restart the buildbot.
Actually very easy! Wow. Just noticed I wasn't backing up my buildmaster stuff. Just added it to the backup system. Updating the quartus version on buildbot up to 10.1sp1. Do I have to do something with sdc files? Also I have to fix benchmark.pl to actually work properly. Do I need sdc files? Yep otherwise you have a critical warning.
— Andrew Canis 2011/09/14 12:15
TODO: - make sure srem/sdiv share together. - why are sdiv/srem with constant inputs being instantiated?
Turns out sharing between functions is actually more complicated than I originally thought. You need to instatiate the bound functional unit in the main module and then setup a mux between each instantiated module.
Just setup a branch for this half way done function inlining code (in ~/git/legup):
git checkout -b inlining git commit -a
Cases where there are two sext/zext operations feeding an adder occur in: dfdiv, dfmul, dfsin, gsm, mips, sha, dhrystone
— Andrew Canis 2011/09/13 12:15
I'm trying to turn LegupPass into a ModulePass so we can do binding across function boundaries.
Very strange. It seems like LegupConfig is getting constructed twice…
So basically there are two versions. One is created by llc and I'm not sure about the other one. I think one of them is for function passes?
acanis@acanis-desktop:~/git/legup/examples/loop$ ../../llvm/Release+Asserts/bin/llc -legup-config=../../hwtest/CycloneII.tcl -march=v loop.bc -o loop.v --debug-pass=Details Adding from llc Constructing LegupConfig 0x9f13d40 Constructing LegupConfig 0x9f4b230 Pass Arguments: -targetdata -legupconfig Target Data Layout Legup Configuration ModulePass Manager LegupPass backend Unnamed pass: implement Pass::getPassName() Pass Arguments: -no-aa -legupconfig -legup-LiveVariableAnalysis -memdep -legup scheduler DAG -sdc-sched -simple asap -meta-asap No Alias Analysis (always returns 'may' alias) Legup Configuration FunctionPass Manager LVA Memory Dependence Analysis Legup directed acyclic graph with dependency and other information SDC Scheduler -- use linear programming for scheduling ASAP scheduler without resource constraints Complete ASAP Scheduling 0x9f13c60 Executing Pass 'LegupPass backend' on Module 'loop.bc'... 0x9f47fc0 Required Analyses: LVA, Complete ASAP Scheduling, Legup Configuration Starting doInitialization op_name: signed_comp_lt_32 count: 155 this 0x9f13d40 op_name: signed_comp_lt_32 count: 155 this 0x9f13d40 Starting function: main op_name: signed_comp_lt_32 count: 155 this 0x9f13d40 op_name: signed_comp_lt_32 count: 155 this 0x9f13d40 0x9f4a2d8 Executing Pass 'LVA' on Function 'main'... 0x9f4a2d8 Made Modification 'LVA' on Function 'main'... 0x9f4a2d8 Executing Pass 'Memory Dependence Analysis' on Function 'main'... 0x9f4ad90 Required Analyses: No Alias Analysis (always returns 'may' alias) 0x9f4a2d8 Executing Pass 'Legup directed acyclic graph with dependency and other information' on Function 'main'... 0x9f4aba8 Required Analyses: Memory Dependence Analysis, Legup Configuration op_name: signed_comp_lt_32 count: 0 this 0x9f4b230 llc: /home/acanis/git/legup/llvm/include/llvm/LegupConfig.h:356: legup::Operation* legup::LegupConfig::getOperationRef(std::string): Assertion `Operations.find(op_name) != Operations.end()' failed.
Okay. So this doesn't happen if I make LegupPass a function pass. Strange. But isn't TargetData a immutable pass? Okay. So one example is MergeFunctions, which is a ModulePass which also uses the TargetData info. Okay - it never actually adds TargetData as a required analysis pass. And actually, _none_ of the passes ever add TargetData as a required pass.
Strange. So I can't even get the TargetData analysis from legupschedulerDAG. I'll have to just make LegupConfig a global variable for now.
So lets get divider sharing working. Aes has 11 dividers/remainders. Reduces to 4 after binding. I see at least one case where they aren't being shared across function boundaries.
TODO: make sure srem/sdiv share together.
Very strange. If I call the LVA pass twice
— Andrew Canis 2011/09/12 12:15
I'm going to install the latest version of ubuntu to see if the roccc binaries work.
Error compiling gcc in the roccc installation. I had to install gcc-multilib. Okay. roccc works in the latest version of ubuntu!
— Andrew Canis 2011/09/08 12:15
Installed roccc:
acanis@acanis-desktop:~/roccc/roccc-0.6-distribution$ ./rocccInstall.sh -t ~/roccc/roccc-0.6-install/ ROCCC INSTALLER This process will install ROCCC 2.0 onto your system. Warnings will be recorded in the file warning.log Some steps may take a while The GUI requires Eclipse 3.5 or higher. Please download from www.eclipse.org. Installing modified gcc 4.0.2 for Hi-CIRRF Installing llvm-gcc for Lo-CIRRF Compiling the roccc-compiler proper ROCCC already installed Floating point cores added to the database All of ROCCC is set up! When prompted by the GUI, please enter: /home/acanis/roccc/roccc-0.6-distribution as the ROCCC distribution directory All of ROCCC has been set up. The binaries are located in /home/acanis/roccc/roccc-0.6-distribution/Install
Installed eclipse in ~/eclipse.
Damn when I try to build in roccc I get the error:
/home/acanis/roccc/roccc-0.6-distribution//Install/roccc-compiler/src/../bin/parser: symbol lookup error: /home/acanis/roccc/roccc-0.6-distribution//Install/roccc-compiler//solib/libstdc++.so.6: undefined symbol: _ZNSt7num_getIcSt19istreambuf_iteratorIcSt11char_traitsIcEEE2idE, version GLIBCXX_3.4 Compilation of FFT.c failed.
So I moved the c++ lib into a tmp directory:
acanis@acanis-desktop:~/roccc/roccc-0.6-distribution/Install/roccc-compiler/solib$ mv libstdc++.so.6* tmp/
Now the gui opens okay but I get a new error:
/home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt: /usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by /home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt) /home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt: /usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.11' not found (required by /home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt) Compilation of FFT.c failed.
They're using quite an old version of llvm (2.3) Do I need to downgrade to libstdc++.so.5? Looking at an ldd:
ldd /home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt /home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt: /usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by /home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt) /home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt: /usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.11' not found (required by /home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt) linux-gate.so.1 => (0xb772c000) libsqlite3.so.0 => /usr/lib/libsqlite3.so.0 (0xb7696000) libpthread.so.0 => /lib/tls/i686/cmov/libpthread.so.0 (0xb767c000) libdl.so.2 => /lib/tls/i686/cmov/libdl.so.2 (0xb7678000) libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0xb7589000) libm.so.6 => /lib/tls/i686/cmov/libm.so.6 (0xb7563000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb7554000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb73f1000) /lib/ld-linux.so.2 (0xb772d000)
Damn. I'm nost sure what to do here…
— Andrew Canis 2011/09/07 12:15
I need to look into roccc. Try to compile the chstone benchmarks.
Todo: 1) The interface for the fsm needs to be changed. How do you know how many cycles an instruction takes? 2) Makefile needs to have an option for debugging mode 3) Gui - cleanup APIs 4) forum for mailing list
Looking at the gantt chart for popcount is very interesting. So much has been inlined that the gantt chart is huge. There are 53 states.
Looking into the multi fmax bug in benchmark.pl. jpeg seems to have the bug:
Type : Clock Setup: 'pll50MHz:pll50|altpll:altpll_component|_clk0' Slack : 1.303 ns Required Time : 50.00 MHz ( period = 20.000 ns ) Actual Time : 57.49 MHz ( period = 17.394 ns ) From : tiger:tiger_sopc|data_cache_0:the_data_cache_0|Cache:data_cache_0|dcacheMem:dcacheMemIns|altsyncram:altsyncram_component|altsyncram_9hd2:auto_generated|ram_block1a0~porta_address_reg8 To : tiger:tiger_sopc|tiger_top_0:the_tiger_top_0|tiger_top:tiger_top_0|tiger_tiger:core|tiger_decode:de|always0~1_Duplicate_OTERM447_OTERM459 From Clock : pll50MHz:pll50|altpll:altpll_component|_clk0 To Clock : pll50MHz:pll50|altpll:altpll_component|_clk0 Failed Paths : 0 Type : Clock Setup: 'altera_internal_jtag~TCKUTAP' Slack : N/A Required Time : None Actual Time : 48.41 MHz ( period = 20.658 ns ) From : tiger:tiger_sopc|tigers_jtag_uart_1:the_tigers_jtag_uart_1|vJTAGUart:tigers_jtag_uart_1|FIFO:DataOut|dcfifo:dcfifo_component|dcfifo_4sp1:auto_generated|altsyncram_vu11:fifo_ram|altsyncram_rd91:altsyncram14|ram_block15a0~porta_address_reg7 To : sld_hub:auto_hub|tdo From Clock : altera_internal_jtag~TCKUTAP To Clock : altera_internal_jtag~TCKUTAP Failed Paths : 0
I'm recompiling jpeg in quartus to double check this. Strange. jpeg doesnt compile for me. Very strange. I have a blank function that gets called 3 times:
declare void @mexit_spin(i32) noreturn
How did buildbot not catch this? Okay nm it was due to some new changes I've been making. Retesting with a fresh copy of the repository. remember to compile with quartus you use “make p” to setup the project then “make f”
— Andrew Canis 2011/08/23 12:15
I should probably add a forum to the legup website. Actually what I really need to do is turn the mailing list into more of a forum. Like the nabble forum for llvm.
I think writing a gui is actually very useful. Because I'll be able to clean up the APIs.
How to inline everything? What does inline-threshold do? From the code:
InlineLimit("inline-threshold", cl::Hidden, cl::init(225), cl::ZeroOrMore, cl::desc("Control the amount of inlining to perform (default = 225)"));
I'm going to try to run opt on adpcm:
acanis@acanis-desktop:~/work/legup/examples/chstone/adpcm$ ../../../llvm/Debug+Asserts/bin/opt -debug -inline -inline-threshold=0 < adpcm.bc > adpcm.new.bc; ../../../llvm/Debug+Asserts/bin/llvm-dis adpcm.new.bc Args: ../../../llvm/Debug+Asserts/bin/opt -debug -inline -inline-threshold=0 Inliner visiting SCC: upzero: 0 call sites. Inliner visiting SCC: INDIRECTNODE: 0 call sites. Inliner visiting SCC: printf: 0 call sites. Inliner visiting SCC: main: 4 call sites. NOT Inlining: cost=370, thres=0, Call: tail call fastcc void @upzero(i32 %76, i32* getelementptr inbounds ([6 x i32]* @delay_dltx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @delay_bpl, i32 0, i32 0)) nounwind NOT Inlining: cost=370, thres=0, Call: tail call fastcc void @upzero(i32 %148, i32* getelementptr inbounds ([6 x i32]* @delay_dhx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @delay_bph, i32 0, i32 0)) nounwind NOT Inlining: cost=370, thres=0, Call: tail call fastcc void @upzero(i32 %258, i32* getelementptr inbounds ([6 x i32]* @dec_del_dltx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @dec_del_bpl, i32 0, i32 0)) nounwind NOT Inlining: cost=370, thres=0, Call: tail call fastcc void @upzero(i32 %333, i32* getelementptr inbounds ([6 x i32]* @dec_del_dhx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @dec_del_bph, i32 0, i32 0)) nounwind Inliner visiting SCC: INDIRECTNODE: 0 call sites.
So even though I've set the inline-threshold. Oh wait, You have to set the thresold to be high. To inline all functions run:
acanis@acanis-desktop:~/work/legup/examples/chstone/adpcm$ ../../../llvm/Debug+Asserts/bin/opt -debug -inline-threshold=100000 -inline < adpcm.bc > adpcm.new.bc; ../../../llvm/Debug+Asserts/bin/llvm-dis adpcm.new.bc Args: ../../../llvm/Debug+Asserts/bin/opt -debug -inline-threshold=100000 -inline Inliner visiting SCC: upzero: 0 call sites. Inliner visiting SCC: INDIRECTNODE: 0 call sites. Inliner visiting SCC: printf: 0 call sites. Inliner visiting SCC: main: 4 call sites. Inlining: cost=370, thres=100000, Call: tail call fastcc void @upzero(i32 %76, i32* getelementptr inbounds ([6 x i32]* @delay_dltx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @delay_bpl, i32 0, i32 0)) nounwind Inlining: cost=370, thres=100000, Call: tail call fastcc void @upzero(i32 %410, i32* getelementptr inbounds ([6 x i32]* @dec_del_dhx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @dec_del_bph, i32 0, i32 0)) nounwind Inlining: cost=370, thres=100000, Call: tail call fastcc void @upzero(i32 %335, i32* getelementptr inbounds ([6 x i32]* @dec_del_dltx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @dec_del_bpl, i32 0, i32 0)) nounwind Inlining: cost=-14630, thres=100000, Call: tail call fastcc void @upzero(i32 %225, i32* getelementptr inbounds ([6 x i32]* @delay_dhx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @delay_bph, i32 0, i32 0)) nounwind -> Deleting dead function: upzero CGSCCPASSMGR: Refreshing SCC with 1 nodes: Call graph node for function: 'main'<<0x9d0da18>> #uses=1 CS<0x9d297a4> calls function 'printf' CGSCCPASSMGR: SCC Refresh didn't change call graph. Inliner visiting SCC: INDIRECTNODE: 0 call sites.
— Andrew Canis 2011/08/23 12:15
The linker seems to be optimizing away everything in the sw/hw partitioning case.
To remove untracked files in git (be careful): This removes all directories (d) and ignored files (x)
git clean -fdx
Makefile dependencies are annoying. Dry run can help you see what's going on:
acanis@acanis-desktop:~/git/legup$ make -n mkdir -p cloog/install cd cloog && ./configure --prefix=/home/acanis/git/legup/cloog/install make -C cloog make install -C cloog cd llvm && ./configure --with-cloog=/home/acanis/git/legup/cloog/install --with-isl=/home/acanis/git/legup/cloog/install make -C mips-binutils make -C llvm make -C tiger/hybrid/processor make -C tiger/processor make clean -C tiger/linux_tools make -C tiger/linux_tools make clean -C examples/lib/llvm make -C examples/lib/llvm
Okay I figured out why -j doesn't propagate to recursive calls of make. I need to call $(MAKE) instead of 'make'. So make -j4 works fine now. The only problem is this screws up the nice clean “make -n” shown above.
There's a slight dependency problem with the Transforms/LegUp makefile:
LDFLAGS = $(LLVM_OBJ_ROOT)/lib/CodeGen/$(BuildMode)/IntrinsicLowering.o
Sometimes CodeGen isn't built before this line:
To reproduce run:
rm -rf llvm/lib/Transforms/LegUp/Release+Asserts/ llvm/lib/CodeGen/Release+Asserts/
And you'll see:
llvm[4]: Linking Release+Asserts Loadable Module LLVMLegUp.so g++: /home/acanis/git/legup/llvm/lib/CodeGen/Release+Asserts/IntrinsicLowering.o: No such file or directory
— Andrew Canis 2011/08/23 12:15
Trying to move IterativeModuloScheduling into the Target/Verilog directory. Running into the same errors as before:
IterativeModuloScheduling.cpp:18:33: error: polly/LinkAllPasses.h: No such file or directory IterativeModuloScheduling.cpp:19:37: error: polly/Support/GICHelper.h: No such file or directory IterativeModuloScheduling.cpp:20:38: error: polly/Support/ScopHelper.h: No such file or directory IterativeModuloScheduling.cpp:21:25: error: polly/Cloog.h: No such file or directory IterativeModuloScheduling.cpp:22:31: error: polly/Dependences.h: No such file or directory IterativeModuloScheduling.cpp:23:28: error: polly/ScopInfo.h: No such file or directory IterativeModuloScheduling.cpp:24:32: error: polly/TempScopInfo.h: No such file or directory IterativeModuloScheduling.cpp:39:25: error: cloog/cloog.h: No such file or directory IterativeModuloScheduling.cpp:40:29: error: cloog/isl/cloog.h: No such file or directory
In tools/polly/lib/Makefile there is the line:
CPP.Flags += $(POLLY_INC)
Where POLLY_INC is defined in the polly Makefile.config file. I need to add this include path in the base makefile. Okay I can just add this to the Target/Verilog makefile:
CPP.Flags += -I$(LLVM_SRC_ROOT)/../cloog/install/include \ -I$(LLVM_SRC_ROOT)/tools/polly/include
Actually I'm going to move this to the Transforms/LegUp directory so I can run this as a prepass.
This is annoying:
../../llvm/Debug+Asserts/bin/opt -load=../../llvm/Debug+Asserts/lib/LLVMLegUp.so -legup-prelto < pipeline.prelto.linked.bc > pipeline.prelto.bc Error opening '../../llvm/Debug+Asserts/lib/LLVMLegUp.so': ../../llvm/Debug+Asserts/lib/LLVMLegUp.so: undefined symbol: _ZNK5polly8ScopPass5printERN4llvm11raw_ostreamEPKNS1_6ModuleE
The polly shared library is stored in the tools directory now:
./tools/polly/Debug+Asserts/lib/LLVMPolly.so
This problem again:
Error opening '../../llvm/tools/polly/Debug+Asserts/lib/LLVMPolly.so': libisl.so.7: cannot open shared object file: No such file or directory
Okay, I can fix this by loading these shared libraries manually.
Missing the SchedulerDAG from the Target/Verilog:
Error opening '../../llvm/Debug+Asserts/lib/LLVMLegUp.so': ../../llvm/Debug+Asserts/lib/LLVMLegUp.so: undefined symbol: _ZN5legup17LegupSchedulerDAG2IDE
And SchedulerPass:
../../llvm/Debug+Asserts/bin/opt: symbol lookup error: ../../llvm/Debug+Asserts/lib/LLVMLegUp.so: undefined symbol: _ZN5legup13SchedulerPass14canChainBeforeEPN4llvm11InstructionE
Had to add the .o files to the makefile:
$(LLVM_OBJ_ROOT)/lib/Target/Verilog/$(BuildMode)/LegupSchedulerDAG.o \ $(LLVM_OBJ_ROOT)/lib/Target/Verilog/$(BuildMode)/SchedulerPass.o
Okay. Seems to be working now. Lets just confirm I get the same results as before and I can commit what I have so far. I'm seeing 407 cycles for examples/pipeline:
# run 7000000000000000ns # At t= 815000 cycles= 407 clk=1 finish=1 return_val= 0 # ** Note: $finish : pipeline.v(1390)
Interesting. Before the update I was seeing ~500 cycles. This might be due to the new sdc scheduler? Anyway. It still should be ~300 cycles so I need to fix the cross basic block latency issue. Oh wait. It's because I've modified pipeline.c to have a loop carried dependence. Nope. I changed it back and the cycles doesn't change. Luckily I saved everything in examples/pipeline/ece1754/
I need to actually distribute the gantt.sty latex style.
Trying to debug this tiger issue. I'm running “make tigersim”. Here is what I see after a while: The runtest.log is stalled at:
Running ./../dejagnu/tiger_sim/dfadd.exp ...
Looking at htop:
1899 acanis 20 0 3336 1008 784 S 0.0 0.0 0:00.00 | | `- make test_tiger_sim 1901 acanis 20 0 8360 4792 1560 S 0.0 0.2 0:01.22 | | `- /usr/bin/expect -- /usr/share/dejagnu/runtest.exp -v -v -v -v --all --status=1 ../dejagnu/tiger_sim/adpcm.exp ../dejagnu/tiger_sim/aes.exp ../d 2164 acanis 20 0 3468 1056 804 S 0.0 0.0 0:00.00 | | `- make tigersim 2189 acanis 20 0 3944 1244 1044 S 0.0 0.0 0:00.00 | | | `- /bin/bash -e -c cd /home/acanis/work/legup/examples/chstone/dfadd/../../../tiger/processor/tiger_DE2/tiger_sim && ./simulate 2190 acanis 20 0 4460 1068 600 S 0.0 0.0 0:00.00 | | | `- /bin/bash -e -c cd /home/acanis/work/legup/examples/chstone/dfadd/../../../tiger/processor/tiger_DE2/tiger_sim && ./simulate 2197 acanis 20 0 3068 628 532 S 0.0 0.0 0:00.00 | | | `- tee transcript.txt 2196 acanis 20 0 17016 8416 3004 S 0.0 0.3 0:00.29 | | | `- vish -- -vsim -c -do ../run_sim_nowave.tcl 2212 acanis 20 0 71084 18480 3964 R 96.0 0.7 9:24.91 | | | `- /opt/modelsim/install/modeltech/linux/vsimk -port 56352 -stdoutfilename /tmp/VSOUTwrXtYr -c -do ../run_sim_nowave.tcl 2213 acanis 20 0 71084 18480 3964 S 0.0 0.7 0:00.00 | | | | `- /opt/modelsim/install/modeltech/linux/vsimk -port 56352 -stdoutfilename /tmp/VSOUTwrXtYr -c -do ../run_sim_nowave.tcl 2205 acanis 20 0 3172 960 784 S 0.0 0.0 0:00.00 | | | `- /opt/modelsim/install/modeltech/linux/vlm 1598714592 1226522872 2206 acanis 20 0 4320 2784 1388 S 0.0 0.1 0:00.14 | | | `- /opt/modelsim/install/modeltech/linux/mgls/lib/mgls_asynch -f6,10 1925 acanis 20 0 8360 4792 1560 S 0.0 0.2 0:00.00 | | `- /usr/bin/expect -- /usr/share/dejagnu/runtest.exp -v -v -v -v --all --status=1 ../dejagnu/tiger_sim/adpcm.exp ../dejagnu/tiger_sim/aes.exp 2236 acanis 20 0 6812 4056 1448 S 0.0 0.1 0:00.12 |
Looking in /tmp/VSOUTwrXtYr all I see is:
... Tap Controller State machine output error Time: 0 Instance: test_bench.DUT.the_tiger_top_0.tiger_top_0.debug_controller.VJTInst.sld_virtual_jtag_component.jtag.output_logic a_input=
— Andrew Canis 2011/08/23 12:15
I need to add something to shrink the integer sizes down. There is a presentation here:
llvm.org/pubs/2007-07-25-LLVM-2.0-and-Beyond.pdf
The llvm 2.0 release added arbitrary precision integers:
Primarily useful to EDA / hardware synthesis business: * An 11-bit multiplier is significantly cheaper/smaller than a 16-bit one * Can use LLVM analysis/optimization framework to shrink variable widths * Patch available that adds an attribute in llvm-gcc to get this Implementation impact of arbitrary width integers: * Immediates, constant folding, intermediate arithmetic simplifications * New APInt class used internally to represent/manipulate these * Makes LLVM more portable, not using uint64_t everywhere for arithmetic
I need to get my hands on that patch. Can't seem to find it. Can't find it. I'll have to implement this myself.
For instance, I think this was the case Stefan was looking at in mips:
%6 = phi i32 [ %227, %226 ], [ 0, %.preheader ] %7 = lshr i32 %pc.0, 2 %8 = and i32 %7, 63
63 is all zeros and then 6 ones. So the above code can be turned into:
%pc.0 = phi i32 [ %pc.1, %226 ], [ 4194304, %.preheader ] %7 = lshr i32 %pc.0, 2 %8 = trunc i32 %7 to i6 %9 = and i6 %8, 63 %10 = zext i6 %9 to i32
Need to run this pass after link time optimization. What is the impact of this change? Probably won't affect area because quartus would have already made this optimization. Lets doublecheck. No change in cycles. Yep no impact on area.
— Andrew Canis 2011/08/22 12:15
Getting rid of the array initialization takes us down to:
73735 / 2 = 36867 cycles
So saves exactly 1024 cycles. Just noticed that there are actually no stores happening in the code right now. So that's actually cheating. Were are all these cycles coming from? Is it roughly 32 * 1024 = 32768? Where fully pipelined you can do it in 4 * 1024 = 4096 I forgot about unrolling. Would that fix this?
Interesting run the command (note the -debug option)
opt -mem2reg -loops -loop-simplify -loop-unroll -unroll-threshold=192 -debug
This fully unrolls the 32 loop but leaves the bigger outer loop.
Loop Unroll: F[main] Loop % Loop Size = 105 Too large to fully unroll with count: 1024 because size: 107520>192 will not try to unroll partially because -unroll-allow-partial not given
Now we finish in 26631 / 2 = 13315 cycles So quite a bit better. But still worse. Try to partially unroll outer loop? Doesn't work. How about fully unroll the outer loop by setting the thresold to 107520. Wow that produces a lot of code. Turning off -debug flag. Now llc is taking forever. Oh shit this is stupid. llvm just optimizes everything away. all that's left is printf statments.
Interesting. -unroll-allow-partial works if I increase unroll-threshold to 512:
Loop Unroll: F[main] Loop % Loop Size = 105 Too large to fully unroll with count: 1024 because size: 107520>512 partially unrolling with count: 4 Trip Count = 1024 UNROLLING loop % by 4 with a breakout at trip 0!
I guess this is because 1024 is divisible by 512 (4 times) Cycles doesn't change at all though. So we can still get a 3x improvement by pipelining, which is expected because i think the inner loop has about 3 dependent operations.
There's a bug with the new polly:
acanis@acanis-desktop:~/work/legup/examples/popcount$ ~/work/legup/llvm/Debug+Asserts/bin/opt -load /home/acanis/work/legup/llvm/tools/polly/Debug+Asserts/lib/LLVMPolly.so Error opening '/home/acanis/work/legup/llvm/tools/polly/Debug+Asserts/lib/LLVMPolly.so': libisl.so.7: cannot open shared object file: No such file or directory -load request ignored.
Damn. Can I statically link it in? Well for now I'll just do:
export LD_LIBRARY_PATH=/home/acanis/git/legup/cloog/install/lib/:$LD_LIBRARY_PATH
Getting an error with pollycc:
acanis@acanis-desktop:~/work/legup/examples/popcount$ ~/work/legup/llvm/tools/polly/utils/pollycc popcount.c Polly support not available in opt
Looks like the python script parses the output of the opt help:
['opt', '-load', '/home/acanis/work/legup/llvm/tools/polly/Debug+Asserts/lib/LLVMPolly.so', '-help']
Wrong opt, updating my PATH
export PATH=/home/acanis/work/legup/llvm//Debug+Asserts/bin/:$PATH
Seems to be working now. pollycc produces an a.out file
Okay, my old code was in:
/home/acanis/work/legup/llvm/tools/polly_old/lib/IterativeModuloScheduling.cpp
Note: runOnScop() immediately returns false right now. Okay, I need to figure out how to move this file out of the polly directory…
— Andrew Canis 2011/08/19 12:15
Getting a strange error from quartus. “Word too long”. Okay turns out my PATH is longer than 1024 characters.
Looking at the c-to-verilog example. I think Nadav had pipelining implemented. There is a testbench inside the code. looks like the two array parameters are from two dual ported brams. There is some initialization of the arrays in the testbench:
integer i; initial begin for (i = 0; i < (1<<(ADDRESS_WIDTH-1)); i = i + 1) begin mem[i] <= i; end end
Looks like the mem is just initialized to 0, 1, 2, …
Is it typical to pass arrays into the main module like this in c-to-verilog?
We take significantly longer: 75783/2 = 37891 cycles vs ctoverilog: 41050ns / 10 = 4105 cycles. So about 10x slower. As expected because we dont have pipelining.
There are memory accesses every 40 / 10 = 4 cycles
# 40975w mem[ 1021] == 1021; in= 9 # 41015w mem[ 1022] == 1022; in= 9 # 41055w mem[ 1023] == 1023; in= 10
Having enough memory ports is crucial. Here there are actually 4 ports available. When we pipeline this we will only have 1…
— Andrew Canis 2011/08/18 12:15
Okay. Still a few modelsim warnings on mips, fir, memset. Fixed.
— Andrew Canis 2011/08/15 12:15
Recompiling llvm-gcc 2.8 on the eecg machines. First you need to compile llvm 2.8
acanis@navy:~/llvm-2.8$ ./configure acanis@navy:~/llvm-2.8$ make -j 2 ENABLE_OPTIMIZED=1
Then compile llvm-gcc:
acanis@navy:~/llvm-gcc-4.2-2.8.source/obj$ ../configure --target=i686-pc-linux-gnu --with-tune=generic --with-arch=pentium4 --prefix=`pwd`/../install --program-prefix=llvm- --enable-llvm=/home/acanis/llvm-2.8/ --enable-languages=c,c++ acanis@navy:~/llvm-gcc-4.2-2.8.source/obj$ make -j2 LLVM_VERSION_INFO=2.8
Received the error:
/brown/r/r0/acanis/llvm-gcc-4.2-2.8.source/obj/./gcc/xgcc -B/brown/r/r0/acanis/llvm-gcc-4.2-2.8.source/obj/./gcc/ -B/brown/r/r0/acanis/llvm-gcc-4.2-2.8.sourc e/obj/../install/i686-pc-linux-gnu/bin/ -B/brown/r/r0/acanis/llvm-gcc-4.2-2.8.source/obj/../install/i686-pc-linux-gnu/lib/ -isystem /brown/r/r0/acanis/llvm-g cc-4.2-2.8.source/obj/../install/i686-pc-linux-gnu/include -isystem /brown/r/r0/acanis/llvm-gcc-4.2-2.8.source/obj/../install/i686-pc-linux-gnu/sys-include -O2 -O2 -g -O2 -DIN_GCC -DCROSS_DIRECTORY_STRUCTURE -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -isystem ./ include -fPIC -g -DHAVE_GTHR_DEFAULT -DIN_LIBGCC2 -D__GCC_FLOAT_NOT_NEEDED -Dinhibit_libc -msse -c \ ../../gcc/config/i386/crtfastmath.c \ -o crtfastmath.o /brown/r/r0/acanis/llvm-gcc-4.2-2.8.source/obj/./gcc/as: line 2: exec: -Q: invalid option
I'm going to try again but get rid of x86 specific target stuff:
acanis@navy:~/llvm-gcc-4.2-2.8.source/obj$ ../configure --prefix=`pwd`/../install --program-prefix=llvm- --enable-llvm=/home/acanis/llvm-2.8/ --enable-languages=c,c++ acanis@navy:~/llvm-gcc-4.2-2.8.source/obj$ make -j2 LLVM_VERSION_INFO=2.8 acanis@navy:~/llvm-gcc-4.2-2.8.source/obj$ make install
Okay that worked. The new version of llvm-gcc is in:
~/llvm-gcc-4.2-2.8.source/install/bin
Okay. So the mips bug was related to the fact we're still using llvm-gcc. I think we should move to clang. clang 2.9 works fine on mint.
Basic blocks don't have names in clang. This will make debugging more difficult. Clang has some warnings on the benchmarks: jpeg, malloc
memset fails with:
FAIL: memset Dest Pointer: i8* %arr Unknown pointer destination in intrinsic argument UNREACHABLE executed at PreLTO.cpp:159! 0 opt 0x088d7379 1 opt 0x088d7a41 2 0x4001e400 __kernel_sigreturn + 0 3 libc.so.6 0x402ae098 abort + 392 4 opt 0x088c4788 llvm::report_fatal_error(llvm::Twine const&) + 0 5 LLVMLegUp.so 0x404175a5 legup::LegUp::getIntrinsicMemoryAlignment(llvm::CallInst*) + 255 6 LLVMLegUp.so 0x40417846 legup::LegUp::lowerLegupInstrinsic(llvm::CallInst*, llvm::Function*) + 278 7 LLVMLegUp.so 0x40417b54 legup::LegUp::lowerIfIntrinsic(llvm::CallInst*, llvm::Function*) + 286 8 LLVMLegUp.so 0x4041aa07 legup::LegUp::runOnFunction(llvm::Function&) + 207 9 opt 0x0885e70d llvm::FPPassManager::runOnFunction(llvm::Function&) + 343 10 opt 0x0885e8f6 llvm::FPPassManager::runOnModule(llvm::Module&) + 114 11 opt 0x0885e3cc llvm::MPPassManager::runOnModule(llvm::Module&) + 398 12 opt 0x0885fc15 llvm::PassManagerImpl::run(llvm::Module&) + 129 13 opt 0x0885fc7b llvm::PassManager::run(llvm::Module&) + 39 14 opt 0x083f9d19 main + 4778 15 libc.so.6 0x40297775 __libc_start_main + 229 16 opt 0x083ea0b1
struct fails with:
llc: Ram.cpp:155: void legup::RAM::visitConstant(const llvm::Constant*, uint64_t*, std::stack<const llvm::Constant*, std::deque<const llvm::Constant*, std::allocator<const llvm::Constant*> > >&, std::stack<unsigned int, std::deque<unsigned int, std::allocator<unsigned int> > >&, std::stack<unsigned int, std::deque<unsigned int, std::allocator<unsigned int> > >&, unsigned int&, unsigned int&): Assertion `isa<ConstantAggregateZero>(c) || isa<ConstantPointerNull>(c)' failed. 0 llc 0x09144595 1 llc 0x09144c5d 2 0x4001e400 __kernel_sigreturn + 0 3 libc.so.6 0x402ae098 abort + 392 4 libc.so.6 0x402a55ce __assert_fail + 238 5 llc 0x086a4755 legup::RAM::visitConstant(llvm::Constant const*, unsigned long long*, std::stack<llvm::Constant const*, std::deque<llvm::Constant const*, std::allocator<llvm::Constant const*> > >&, std::stack<unsigned int, std::deque<unsigned int, std::allocator<unsigned int> > >&, std::stack<unsigned int, std::deque<unsigned int, std::allocator<unsigned int> > >&, unsigned int&, unsigned int&) + 353 6 llc 0x086a4f05 legup::RAM::initializeStruct() + 595 7 llc 0x086a5077 legup::RAM::buildInitializer() + 111 8 llc 0x086a50fd legup::RAM::generateMIF() + 47 9 llc 0x08671518 legup::VerilogWriter::printMemoryController() + 128 10 llc 0x086741e2 legup::VerilogWriter::print() + 214 11 llc 0x08661480 legup::LegupPass::printVerilog(std::set<llvm::Function*, std::less<llvm::Function*>, std::allocator<llvm::Function*> >) + 112 12 llc 0x086616a2 legup::LegupPass::doFinalization(llvm::Module&) + 220 13 llc 0x09072031 llvm::FPPassManager::doFinalization(llvm::Module&) + 75 14 llc 0x09076902 llvm::FPPassManager::runOnModule(llvm::Module&) + 178 15 llc 0x09076398 llvm::MPPassManager::runOnModule(llvm::Module&) + 398 16 llc 0x09077be1 llvm::PassManagerImpl::run(llvm::Module&) + 129 17 llc 0x09077c47 llvm::PassManager::run(llvm::Module&) + 39 18 llc 0x085fd21f main + 2887 19 libc.so.6 0x40297775 __libc_start_main + 229 20 llc 0x085fb511
— Andrew Canis 2011/08/12 12:15
Added cloog/isl into the repository.
git clone git:repo.or.cz/cloog.git cd cloog ./get_submodules.sh ./autogen.sh ./configure –prefix=~/work/polly/cloog/install make make install I needed to copy .gitmodules into the base legup folder and modify the path: <code> -path = isl +path = cloog/isl </code> Then run: <code> acanis@acanis-desktop:~/work/legup$ cloog/get_submodules.sh Submodule 'isl' (git:repo.or.cz/isl.git) registered for path 'cloog/isl' Cloning into cloog/isl… warning: templates not found /usr/local/share/git-core/templates remote: Counting objects: 9585, done. remote: Compressing objects: 100% (2180/2180), done. remote: Total 9585 (delta 7127), reused 9585 (delta 7127) Receiving objects: 100% (9585/9585), 2.05 MiB | 328 KiB/s, done. Resolving deltas: 100% (7127/7127), done. Submodule path 'cloog/isl': checked out '24e309472a53920bdf19130a12c9ccec320c1867' </code>
Now I added the new folder:
git add cloog/isl
Whoops. That didn't work. Okay I don't think I can use submodules here. I just need to check out both paths.
Looking in cloog/.gitmodules the repo for isl is git:repo.or.cz/isl.git cloog revision: <code> commit 225c2ed62fe37a4db22bf4b95c3731dab1a50dde Author: Sven Verdoolaege skimo@kotnet.org Date: Sun Jul 10 09:27:24 2011 +0200 </code> isl revision: <code> commit e536653cbc99d7349eafa5e1a9cba873db3135eb Author: Sven Verdoolaege skimo@kotnet.org Date: Sat Aug 6 22:30:40 2011 +0200 </code> Wait. This revision is different than the submodule one listed above… Doesn't matter. Seeing an error for the hybrids: <code> acanis@acanis-desktop:~/work/legup/examples/chstone_hybrid/adpcm$ ./sim_all_functions … export LEGUP_ACCELERATOR_FILENAME=adpcm; \ ../../../llvm/Debug+Asserts/bin/opt -legup-config=config.tcl -load=../../../llvm/Debug+Asserts/libLLVMLegUp.so -legup-sw-only < adpcm.prelto.bc > adpcm.prelto.sw.bc LLVM ERROR: IO failure on output stream. </code>
This error can't be debugged with gdb. Looking in raw_ostream.cpp
// If there are any pending errors, report them now. Clients wishing // to avoid report_fatal_error calls should check for errors with // has_error() and clear the error flag with clear_error() before // destructing raw_ostream objects which may have errors. if (has_error()) report_fatal_error("IO failure on output stream.");
— Andrew Canis 2011/08/10 12:15
llc infinite loops on gsm. Added -debug segfaults. I'm going to have to recompile in debug mode.
./configure --disable-optimized --with-cloog=/home/acanis/work/polly/cloog/install/ --with-isl=/home/acanis/work/polly/cloog/install/
Seems to be something to do with the new SDC scheduler. Or could it just be taking a long time? I doubt it, gsm never took this long before. There are 30 recursive calls to the function:
(gdb) bt #0 0xb744341d in memmove () from /lib/tls/i686/cmov/libc.so.6 #1 0x0916e8dd in mat_appendrow () #2 0x09161c57 in add_constraintex () #3 0x086a90db in legup::SDCScheduler::addTimingConstraints (this=0x97754b8, Root=0x978d6f8, Curr=0x978ac78, PartialPathDelay=108.549011) at SDCScheduler.cpp:227 #4 0x086a9117 in legup::SDCScheduler::addTimingConstraints (this=0x97754b8, Root=0x978d6f8, Curr=0x978ad18, PartialPathDelay=104.96801) at SDCScheduler.cpp:232 #5 0x086a9117 in legup::SDCScheduler::addTimingConstraints (this=0x97754b8, Root=0x978d6f8, Curr=0x978af98, PartialPathDelay=101.168007) at SDCScheduler.cpp:232 ... #29 0x086a9117 in legup::SDCScheduler::addTimingConstraints (this=0x97754b8, Root=0x978d6f8, Curr=0x978d6f8, PartialPathDelay=4.29199982) at SDCScheduler.cpp:232 ---Type <return> to continue, or q <return> to quit--- #30 0x086a91f5 in legup::SDCScheduler::addTimingConstraints (this=0x97754b8, F=@0x970f0d8) at SDCScheduler.cpp:246 #31 0x086a98c6 in legup::SDCScheduler::runOnFunction (this=0x97754b8, F=@0x970f0d8) at SDCScheduler.cpp:430 #32 0x09065971 in llvm::FPPassManager::runOnFunction (this=0x9774d00, F=@0x970f0d8) at PassManager.cpp:1513 #33 0x09065b55 in llvm::FPPassManager::runOnModule (this=0x9774d00, M=@0x970dee0) at PassManager.cpp:1535 #34 0x09065630 in llvm::MPPassManager::runOnModule (this=0x970e378, M=@0x970dee0) at PassManager.cpp:1589 #35 0x09066e65 in llvm::PassManagerImpl::run (this=0x9713f00, M=@0x970dee0) at PassManager.cpp:1671 #36 0x09066ecb in llvm::PassManager::run (this=0xbfd39164, M=@0x970dee0) at PassManager.cpp:1715 #37 0x085f8f0f in main (argc=6, argv=0xbfd392a4) at llc.cpp:396
There are about 300 recursive calls to addTimingConstraints() for some reason. There's actually a huge basic block in gsm with about 100 instructions.
bb.nph.i.i.i: ; preds = %bb17.i.i
Seeing one last problem with make tiger
../../mips-binutils/bin/mipsel-elf-ld -T ../../tiger/linux_tools/lib/prog_link.ld -e main struct.o ../../tiger/tool_source/lib/altera_avalon_performance_counter.o -o struct.elf -EL -L ../../tiger/linux_tools/lib -lgcc -lfloat -luart struct.o: In function `main': (_main_section+0x2c): undefined reference to `memcpy' struct.o: In function `main': (_main_section+0x6c): undefined reference to `memcpy' make: *** [tiger] Error 1
I don't get this. Why didn't I run into this before? Don't we lower memcpys into legup instructions? I see the memcpy in the .s file:
main: ... # BB#0: # %entry ... jal memcpy
I don't see a memcpy in the .ll (there is a legup_memcpy_4 though). Damn. This must be created in the MIPS backend? I might have to write a memcpy manually. Just like Mark had to write a printf.
Tiger libraries are stored in:
../../tiger/linux_tools/lib
Sources are in:
../../tiger/tool_source/lib
I can find memcpy inside libgcc.a. Which should be included. I see a mem.c file in the source directory. Compiles fine if I add:
../../tiger/tool_source/lib/mem.o
So this compiles okay. But now make emulwatch doesn't match:
acanis@acanis-desktop:~/work/legup/examples/struct$ diff -u lli.txt sim.txt --- lli.txt 2011-08-04 16:04:13.000000000 -0400 +++ sim.txt 2011-08-04 16:04:15.000000000 -0400 @@ -69,7 +69,7 @@ %exitcond=0 legup_memcpy_4:bb %indvar=d - %3=cdcd1514 + %3=1514 %indvar.next=e %exitcond=1 legup_memcpy_4:return
The code looks like:
void legup_memcpy_4(uint32_t * d, const uint32_t * s, size_t n) { uint32_t * dt = d; const uint32_t * st = s; n >>= 2; while (n--) *dt++ = *st++; }
The .ll:
bb: ; preds = %bb, %bb.nph %indvar = phi i32 [ 0, %bb.nph ], [ %indvar.next, %bb ] %st.04 = getelementptr i32* %s, i32 %indvar %dt.03 = getelementptr i32* %d, i32 %indvar %3 = load i32* %st.04, align 4 store i32 %3, i32* %dt.03, align 4 %indvar.next = add i32 %indvar, 1 %exitcond = icmp eq i32 %indvar.next, %tmp %4 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([111 x i8]* @11, i32 0, i32 0), i32 %indvar, i32 %3, i32 %indvar.next, i1 %exitcond) br i1 %exitcond, label %return, label %bb
So it looks like the load doesn't match. cdcd = 1100 1101 1100 1101 For some reason this is zeroed out in gxemul. Indvar = d = 13. Actually this even happens with “make watch” but the final result is still correct. I completely removed the legup_memcpy_4 code (disabling prelto pass) still fails. Very strange. Now make emulwatch simulation just stops right after:
pointSum:return %retval1=11
Why would nothing else get printed? There are three calls to pointSum. gxemul only calls the function once.
Interesting. So it looks like the breakpoint never triggers at the return address of main. Instead it triggers at the end of the code:
../lib/gxemul.exp -E testmips -e R3000 struct.elf -p `../../tiger/linux_tools/lib/../find_ra struct.emul.src` -p 0xffffffff80000180 -q exit at: pc = 0xffffffff80000180 reg: v0 = 0x0000000000000011
The return address of main is:
acanis@acanis-desktop:~/work/legup/examples/struct$ ../../tiger/linux_tools/lib/../find_ra struct.emul.src 0xffffffff800319f8
So the return address of pointSum is probably incorrect… If I comment out the pointSum calls everything works fine. Lets see what the return address is:
acanis@acanis-desktop:~/work/legup/examples/struct$ gxemul -E testmips -e R3000 struct.elf -p `../../tiger/linux_tools/lib/../find_ra struct.emul.src` -p 0xffffffff80000180 -q -p 0xffffffff8003002c BREAKPOINT: pc = 0xffffffff8003002c (The instruction has not yet executed.) GXemul> s ffffffff8003002c: 27bd0008 addiu sp,sp,8 GXemul> ffffffff80030030: 03e00008 jr ra <sum+0x110> ffffffff80030034: 00000000 (d) nop GXemul> ffffffff80030148: 00028021 addu s0,zr,v0
Looks fine. Jumps back right after pointSum call. Wait what's this. I step a few more times and:
GXemul> ffffffff80030150: 8fa4003e lw a0,62(sp) [0xffffffffa0007e6e] [ exception ADEL vaddr=0xffffffffa0007e6e pc=0xffffffff80030150 <sum+0x118> ] GXemul> ffffffff80000180: 00000000 nop BREAKPOINT: pc = 0xffffffff80000180 (The instruction has not yet executed.)
There should really be another check in the test suite that we never break on the second breakpoint
Looking up this exception:
4.8.9 Address Error Exception — Instruction Fetch/Data Access An address error exception occurs on an instruction or data access when an attempt is made to execute one of the following: • Fetch an instruction, load a word, or store a word that is not aligned on a word boundary • Load or store a halfword that is not aligned on a halfword boundary • Reference the kernel address space from user mode Note that in the case of an instruction fetch that is not aligned on a word boundary, PC is updated before the condition is detected. Therefore, both EPC and BadVAddr point to the unaligned instruction address. In the case of a data access the exception is taken if either an unaligned address or an address that was inaccessible in the current processor mode was referenced by a load or store instruction. Cause Register ExcCode Value: ADEL: Reference was a load or an instruction fetch ADES: Reference was a store
The lw was trying to load a 32-bit word from address 62 + sp into reg a0. 62 = 0x3e added to the sp is 0xffffffffa0007e6e (as shown by vaddr above). the last 4 bits are: 1110 but the last two bits must be 0 to be aligned to 32-bit. I'm going to file a bug.
GXemul> reg cpu0: pc = 0xffffffff80000180 < no symbol > ... cpu0: a0 = 0x000000000b0a0908 s4 = 0x0000000000000004 ... cpu0: t5 = 0x0000000000000000 sp = 0xffffffffa0007e30
I'm just going to file a bug report. Okay no. It seems to be something to do with this line:
../../mips-binutils/bin/mipsel-elf-ld -T ../../tiger/linux_tools/lib/prog_link_sim.ld -e main struct.o -o struct.elf -EL -L ../../tiger/linux_tools/lib -lgcc -lfloat -luart_el_sim -lmem_el_sim
That's because nothing gets run unless we link in those libraries.
You can run the Mips test Victor submitted manually like this:
acanis@acanis-desktop:~/work/legup/llvm/test$ ../Debug+Asserts/bin/llvm-lit -v CodeGen/Mips/2010-07-20-Switch.ll -- Testing: 1 tests, 4 threads -- PASS: LLVM :: CodeGen/Mips/2010-07-20-Switch.ll (1 of 1) Testing Time: 0.03s Expected Passes : 1
Submitted an LLVM bug report: http://llvm.org/bugs/show_bug.cgi?id=10634
So lets just wait and see what Bruno Lopes has to say. Could it be an issue with the llvm-gcc?
The following tests don't run gxemul:
./div_const/dg.exp ./overflow_intrinsic/dg.exp ./signeddiv/dg.exp ./phi/dg.exp ./unaligned/dg.exp ./cpp/dg.exp
— Andrew Canis 2011/08/04 12:15
New LLVM version is almost working but I'm seeing the error:
[ 93%] Built target LLVMPolly make -f tools/llvm-config/CMakeFiles/llvm-config.target.dir/build.make tools/llvm-config/CMakeFiles/llvm-config.target.dir/depend make[2]: Entering directory `/home/acanis/work/legup/build' cd /home/acanis/work/legup/build && /home/acanis/cmake-2.8.4-Linux-i386/bin/cmake -E cmake_depends "Unix Makefiles" /home/acanis/work/legup/llvm /home/acanis/work/legup/llvm/tools/llvm-config /home/acanis/work/legup/build /home/acanis/work/legup/build/tools/llvm-config /home/acanis/work/legup/build/tools/llvm-config/CMakeFiles/llvm-config.target.dir/DependInfo.cmake --color= make[2]: Leaving directory `/home/acanis/work/legup/build' make -f tools/llvm-config/CMakeFiles/llvm-config.target.dir/build.make tools/llvm-config/CMakeFiles/llvm-config.target.dir/build make[2]: Entering directory `/home/acanis/work/legup/build' /home/acanis/cmake-2.8.4-Linux-i386/bin/cmake -E cmake_progress_report /home/acanis/work/legup/build/CMakeFiles [ 93%] Updating LibDeps.txt if necessary... cd /home/acanis/work/legup/build/tools/llvm-config && /home/acanis/cmake-2.8.4-Linux-i386/bin/cmake -E copy_if_different LibDeps.txt.tmp LibDeps.txt Error copying file (if different) from "LibDeps.txt.tmp" to "LibDeps.txt". make[2]: *** [tools/llvm-config/LibDeps.txt] Error 1
I had to fix the llvm-config/CMakeLists.txt file as mentioned in:
http://comments.gmane.org/gmane.comp.compilers.llvm.cvs/89287
Now I see:
CMakeFiles/llvm-mc.dir/llvm-mc.cpp.o: In function `llvm::InitializeAllTargetMCs()': /home/acanis/work/legup/build/include/llvm/Config/Targets.def:41: undefined reference to `LLVMInitializeVerilogTargetMC' collect2: ld returned 1 exit status
Okay, slight interface change in LLVM.
Linker error for llc:
Linking CXX executable ../../bin/llc ../../lib/libLLVMVerilog.a(SDCScheduler.cpp.o): In function `legup::SDCScheduler::scheduleAXAP(bool)': /home/acanis/work/legup/llvm/lib/Target/Verilog/SDCScheduler.cpp:379: undefined reference to `set_obj_fnex'
The cmake cache is really annoying. Every time you modify the cmake files you have to run:
rm CMakeCache.txt
To figure out what's happening when you're making:
make VERBOSE=1
Seems like tcl is being added properly here but the lpsolve library isn't being added:
/usr/bin/c++ -fPIC -fno-rtti -g CMakeFiles/llc.dir/llc.cpp.o -o ../../bin/llc -rdynamic -ltcl8.5 ../../lib/libLLVMVerilog.a (SKIPPED) -ltcl8.5 -ldl -lpthread
If I manually rerun this with “-L/usr/lib/lp_solve -llpsolve55” added it works fine. Okay. This was a problem with the llvm/CMakeList.txt file. Great. Everything compiles with cmake. Now lets try autoconf.
Compiler warnings:
LegupTcl.cpp: In function ‘int legup::set_accelerator_function(void*, Tcl_Interp*, int, const char**)’: LegupTcl.cpp:23: warning: cast from type ‘const char*’ to type ‘char*’ casts away constness LegupTcl.cpp: In function ‘int legup::set_operation_attributes(void*, Tcl_Interp*, int, const char**)’: LegupTcl.cpp:35: warning: cast from type ‘const char*’ to type ‘char*’ casts away constness ... LegupTcl.cpp:97: warning: cast from type ‘const char*’ to type ‘char*’ casts away constness LegupTcl.cpp: In function ‘int legup::set_device_specs(void*, Tcl_Interp*, int, const char**)’: LegupTcl.cpp:108: warning: cast from type ‘const char*’ to type ‘char*’ casts away constness ... LegupTcl.cpp:136: warning: cast from type ‘const char*’ to type ‘char*’ casts away constness
Strange error:
make[2]: Entering directory `/home/acanis/work/legup/llvm/unittests/VMCore' llvm[2]: Compiling DerivedTypesTest.cpp for Release+Asserts build DerivedTypesTest.cpp: In function ‘void<unnamed>::PR7658()’: DerivedTypesTest.cpp:24: error: ‘PATypeHolder’ was not declared in this scope
I think the file has been deleted. Yep.
It seems like our llvm-gcc is too old:
llvm-gcc array.c -emit-llvm -c -fno-builtin -m32 -malign-double -I ../lib/include/ -O0 -fno-inline-functions -o array.prelto.1.bc # linking may produce llvm mem-family intrinsics ../../llvm/Release+Asserts/bin/llvm-ld -disable-inlining -disable-opt array.prelto.1.bc -b=array.prelto.linked.bc llvm-ld: error: Cannot load file 'array.prelto.1.bc': Bitcode file 'array.prelto.1.bc' could not be loaded: Invalid ALLOCA record
Clang doesn't have this error. But there are warnings:
clang array.c -emit-llvm -c -fno-builtin -m32 -malign-double -I ../lib/include/ -O0 -fno-inline-functions -o array.prelto.1.bc clang: warning: argument unused during compilation: '-malign-double'
And also verilog errors:
-- Compiling module fct ** Error: array.v(550): 'LEGUP1_F_fct_BB_' already declared in this scope. ... ** Error: array.v(566): 'LEGUP3_F_fct_BB_' already declared in this scope. -- Compiling module main ** Error: array.v(1338): 'LEGUP1_F_main_BB_' already declared in this scope. ... ** Error: array.v(1377): 'LEGUP8_F_main_BB_' already declared in this scope.
Damn. I can't run llvm-gcc 2.9:
llvm-gcc: /lib/tls/i686/cmov/libc.so.6: version `GLIBC_2.11' not found (required by llvm-gcc)
Okay great. llvm-gcc 2.8 works.
Seeing some bugs:
../../../llvm/Release+Asserts/bin/opt: symbol lookup error: ../../../llvm/Release+Asserts/lib/LLVMLegUp.so: undefined symbol: _ZN4llvm17IntrinsicLowering18LowerIntrinsicCallEPNS_8CallInstE
I thought I already fixed this… Looks fine. I included the dependency in lib/Transforms/LegUp/Makefile:
USED_LIBS = LLVMCodeGen
Tried adding back in:
LDFLAGS = $(LLVM_OBJ_ROOT)/lib/CodeGen/$(BuildMode)/IntrinsicLowering.o
Strange. So that worked. Wow. dfmul is suddenly fixed! I'm guessing this was caused by the newer llvm-gcc version?
llc seems to be running into an infinite loop on gsm…
— Andrew Canis 2011/08/03 12:15
Autoconf doesn't work for the latest git llvm and polly?
llvm[0]: Compiling ScheduleOptimizer.cpp for Debug+Asserts build (PIC) ScheduleOptimizer.cpp:30:26: error: isl/schedule.h: No such file or directory
Trying make -n to show makefile commands. The schedule.h file doesn't actually exist… Is this a new header file that has been added in the past few months? Yep. Needed to update my cloog version. Okay this works now. So I actually need to distribute these header files and .so manually. Damn. This also means I need to update LLVM.
Updating to commit:
commit b4f4cbd199318901d12737ded05ebebd8cb21336 Author: David Greene <greened@obbligato.org> Date: Fri Jul 29 20:50:18 2011 +0000
Damn. The merge totally fails. I see a lot of “both added” conflicts.
git checkout --theirs -- Transforms/ git add -u Transforms/
Actually usually you can just manually merge the makefiles and files we changed (looking at git log), then just checkout/add the whole directory.
Testing the Mips backend again. I'm going to submit some bug reports.
— Andrew Canis 2011/07/29 12:15
Adding polly to the repo. Cmake was working before. Trying to get autoconf working. In llvm running ./configure –with-cloog=~/work/polly/cloog/install/ –with-isl=~/work/polly/cloog/install/ gives:
=== configuring in tools/polly (/home/acanis/work/legup/llvm/tools/polly) ... checking for isl in inc_not_give_isl, lib_not_give_isl... configure: error: isl required but not found configure: error: ./configure failed for tools/polly
Wow. You can't use ~ in the path! So annoying.
./configure --with-cloog=/home/acanis/work/polly/cloog/install/ --with-isl=/home/acanis/work/polly/cloog/install/
How to do live variable analysis in SSA? I see from LiveVariables.cpp:
It uses the dominance properties of SSA form to efficiently compute live variables for virtual registers
What does this mean?
// Calculate live variable information in depth first order on the CFG of the // function. This guarantees that we will see the definition of a virtual // register before its uses due to dominance properties of SSA (except for PHI // nodes, which are treated as a special case).
Oh you can just do a depth first traversal of the CFG.
— Andrew Canis 2011/07/26 12:15
Testing the git subtree method locally. Wow. Ran into a really annoying bug with git merge subtree. Turns out you need to specify the directory location otherwise the merge won't work properly:
I see a lot of “both added” conflicts. I'm just going to take LLVM's version and then manually merge the autoconfig changes
git checkout --theirs -- . git add -u .
— Andrew Canis 2011/07/25 12:15
What's the status on the LLVM update and loop pipelining integration?
There's still a bug with gxemul with the new LLVM mips backend:
Running ./chstone/dfmul/dg.exp ... FAIL: gxemul simulation. Expected: reg: v0 = 0x0000000000000000
make emulwatch gives:
acanis@acanis-desktop:~/work/legup/examples/chstone/dfmul$ diff -u sim.txt lli.txt --- sim.txt 2011-07-07 14:55:20.000000000 -0400 +++ lli.txt 2011-07-07 14:55:18.000000000 -0400 @@ -160,12 +160,14 @@ %87=ffff000000000000 %88=1 main:bb3.i26.i - %90=0 + %90=1 +main:bb5.i27.i + %retval.i.i=ffff000000000000 main:float64_mul.exit - %181=3ff8000000000000 + %181=ffff000000000000 %183=ffff000000000000 - %184=1 - %186=1 + %184=0 + %186=0 %189=4 %exitcond=1 main:bb2
The value of %90 is wrong:
float64_is_signaling_nan.exit.i.i: %84 = phi i32 [ %80, %bb.i.i.i ], [ %retval.i11.i.i, %float64_is_signaling_nan.exit14.i.i ], [ 0, %bb16.i.float64_is_signaling_nan.exit14.i.i_crit_edge ] bb3.i26.i: ; preds = %float64_is_signaling_nan.exit.i.i %90 = icmp eq i32 %84, 0
In both cases we're coming from main:bb16.i.float64_is_signaling_nan.exit14.i.i_crit_edge. So %84 = 0. Which is also correct in both traces in basic block main:float64_is_signaling_nan.exit.i.i
Where is this in the assembly? Looking at dfmul.s:
$BB0_40: # %bb3.i26.i # in Loop: Header=BB0_1 Depth=1 addiu $19, $zero, 0 lui $16, %hi(__unnamed_24) xor $19, $16, $19 addiu $4, $16, %lo(__unnamed_24) sltu $5, $19, 1 jal mprintf
From below. I already looked at this. If $19 < 1 then $5 = 1 else $5 = 0. If $19 represents %84 and $5 represents %90 then when $19=0 then $5=1. The sim says $5 (%90) is 0 when it should be 1. I would like to step through this code in gxemul. “make emul” runs the following commands:
../../../mips-binutils/bin/mipsel-elf-ld -T ../../../tiger/linux_tools/lib/prog_link_sim.ld -e main dfmul.o -o dfmul.elf -EL -L ../../../tiger/linux_tools/lib -lgcc -lfloat -luart_el_sim ../../../mips-binutils/bin/mipsel-elf-objdump -d dfmul.elf > dfmul.emul.src gxemul -E testmips -e R3000 dfmul.elf -p `../../../tiger/linux_tools/lib/../find_ra dfmul.emul.src` -p 0xffffffff80000180 -q
Before running “make emul” you need to compile the .s file with:
../../../mips-binutils/bin/mipsel-elf-as dfmul.s -mips1 -mabi=32 -o dfmul.o -EL ../../../mips-binutils/bin/mipsel-elf-ld -T ../../../tiger/linux_tools/lib/prog_link.ld -e main dfmul.o ../../../tiger/tool_source/lib/altera_avalon_performance_counter.o -o dfmul.elf -EL -L ../../../tiger/linux_tools/lib -lgcc -lfloat -luart ../../../mips-binutils/bin/mipsel-elf-objdump -D dfmul.elf > dfmul.src ../../../tiger/linux_tools/lib/../elf2sdram dfmul.elf sdram.dat
Doing an instruction trace with -i. The %90 is printed at line 305279.
Fixing lpsolve dependency. Makefile.config is generated by configure from Makefile.config.in Useful guide: http://llvm.org/docs/MakefileGuide.html#Makefile.config
I had to add a new macro called AX_EXT_HAVE_LIB() because the lpsolve library isn't installed in /usr/lib but in /usr/lib/lp_solve/. The new macro adds the appropriate -L/usr/lib/lp_solve/ flag.
Note the both AX_EXT_HAVE_LIB and AC_SEARCH_LIBS modify the Makefile.config LIBS variable.
For some reason the makefile is broken. The LDFLAGS aren't being added properly:
/home/acanis/git/legup/llvm/Release/bin/tblgen: error while loading shared libraries: liblpsolve55.so: cannot open shared object file: No such file or directory
Okay I just changed this to use liblpsolve_pic.a which is compiled with -fPIC to allow shared linkage.
— Andrew Canis 2011/07/07 12:15
Okay. “make test_tiger_sim” seems to be failing with my new changes to libuart. Looks like it's a problem with mprintf not working. “make tigersim” doesn't produce the expected output.
Alright so adding this back into uart.h (included from stdio.h):
#define printf mprint
But I still don't get the right output from tiger modelsim: For mips:
# 1008533759
For aes:
# 1008533759
What does this number mean? gxemul working fine… Looks like an unitialized value. Strange, when I explicitly add:
main_result = 0; printf ("%d\n", main_result);
I still get the same thing. mprintf() seems to be totally broken:
printf ("---->%d %d %d %d\n", 0, 1, -1, main_result);
Gives:
# ---->1008533759 935190524 201326600 0
Where are these numbers coming from?? Is it some sort of bug in the llvm mips backend maybe? I should test mprintf with gxemul. For some reason printf isn't working with gxemul. Strange because make emulwatch uses printf I do see a litte bit of magic in the emulwatch target:
sed -i "s/\tprintf/\tmprintf/g" mips.s
Oh shit. I need to run “make tiger” first _before_ running “make emul” Okay, I think something is broken with mprintf. This:
printf ("Start\n"); printf ("---->'%d' '%d' '%d' '%d'\n", 0, 1, -1, main_result); printf ("End\n");
Doesn't simulate properly in gxemul:
$ make tiger;make emul ... Start ---->'ffffffff80000180: 00000000 nop BREAKPOINT: pc = 0xffffffff80000180 (The instruction has not yet executed.)
The code dies in the middle of mprintf(). In particular, the variable arguments seem to be failing:
va_arg(arg, int)
Seems to crash the whole program. I bet the mips backend doesn't support variable arguments… I see in the release notes for a newer LLVM version something about improved support for variable arguments in the mips backend.
I'm going to have to install the mips-gcc after all. Installing from the site: http://crosstool-ng.org/ I'm putting the mips gcc in ~/crosstool/gcc gcc gets installed in ~/x-tools/
I need to recompile with hardware-float to avoid this warning:
../../../mips-binutils/bin/mipsel-elf-ld: Warning: mips.elf uses hard float, ../../../tiger/linux_tools/lib/libuart.a(uart.o) uses soft float
— Andrew Canis 2011/07/06 12:15
After Mark's push function_pointer seems to be failing:
make[1]: Leaving directory `/home/acanis/git/legup/examples/function_pointer' function_pointer.c: In function ‘a’: function_pointer.c:3: warning: ‘return’ with a value, in function returning void function_pointer.c: In function ‘b’: function_pointer.c:4: warning: ‘return’ with a value, in function returning void llc: utils.cpp:48: llvm::Function* legup::getCalledFunction(llvm::CallInst*): Assertion `called' failed.
Of course. Because function pointers aren't supported by LegUp. I should make this error more user friendly. Added new test suite files for this.
llist is failing because NULL is undeclared. NULL is normally declared in stdio.h
— Andrew Canis 2011/07/05 12:15
I just pulled Victor's fix to mprintf(). So it seems to work. The make emulwatch now doesn't show any differences.
Interesting. So when I run:
make emulwatch make emultest
The result is correct. But running make tiger; make emultest fails:
exit at: pc = 0xffffffff80031d4c reg: v0 = 0x0000000000000002
To see the output from the dfmul printf, run “make emul”. I see two discrepancies:
a_input=7ff0000000000000 b_input=ffffffffffffffff expected=ffffffffffffffff output=7ff8000000000000 a_input=3ff0000000000000 b_input=ffff000000000000 expected=ffff000000000000 output=3ff8000000000000
But make emulwatch doesn't any difference these errors. How do I track down the problem? Well this is because “make emulwatch” has the correct results. So by adding printfs everywhere the bug is removed.
Strange. When I diff the .src file between the watch version and the original the watch has a slightly different _i2h function:
--- dfmul.emul.src 2011-06-06 23:16:59.000000000 -0400 +++ watch.src 2011-06-06 23:16:53.000000000 -0400 @@ -125,7 +125,7 @@ 800301ac: 00000000 nop 800301b0: 00021880 sll v1,v0,0x2 800301b4: 3c028003 lui v0,0x8003 -800301b8: 244223a0 addiu v0,v0,9120 +800301b8: 24423060 addiu v0,v0,12384 800301bc: 00621021 addu v0,v1,v0 800301c0: 8c420000 lw v0,0(v0) 800301c4: 00000000 nop
There are massive differences in the main function. Very strange. When I shrink the array size to 2 (the two errors) the results are correct. If I remove the bottom 10 elements I still see the error. If I reduce it to 5 elements, the first error goes away. What could cause this? Some kind of bug with the stack when calling a function? When I change N to be 10 when there are only 5 array elements I get the same bug. If I call float64_mul on the first element 10 times I don't see the problem. Is this function call related? Weird, when I comment out the printf I don't get the error. But can't be the printf because “make emulwatch” worked.
Specifically it appears to be the printing of a_input:
// error: printf ("a_input=%016llx\n", a_input[i]); // no error: //printf ("z_output=%016llx\n", z_output[i]); // no error: //printf ("results=%016llx\n", result); // no error //printf ("a_input=%016llx\n", b_input[i]);
It's some kind of bug inside: propagateFloat64NaN(). But when I add a printf after each statement llc dies:
../../../build/bin/llc dfmul.bc -march=mipsel -relocation-model=static -mips-ssection-threshold=0 -mcpu=mips1 -o dfmul.s llc: /home/acanis/work/legup/llvm/include/llvm/CodeGen/LiveInterval.h:355: llvm::SlotIndex llvm::LiveInterval::beginIndex() const: Assertion `!empty() && "Call to beginIndex() on empty interval."' failed. Stack dump: 0. Program arguments: ../../../build/bin/llc dfmul.bc -march=mipsel -relocation-model=static -mips-ssection-threshold=0 -mcpu=mips1 -o dfmul.s 1. Running pass 'Function Pass Manager' on module 'dfmul.bc'. 2. Running pass 'Linear Scan Register Allocator' on function '@propagateFloat64NaN'
I'm not sure if this is related.
Something about printing “a” seems to fix the final errors. Actually if I add a printf for bIsNaN (which is 1) I also fix one of the errors.
printf ("3: %d\n", bIsNaN);
Why would a printf fix anything?
It's like the if statement isn't working properly…
Okay. With my new modified code “make emulwatch” is now giving me this:
--- lli.txt 2011-06-07 02:02:10.000000000 -0400 +++ sim.txt 2011-06-07 02:02:12.000000000 -0400 @@ -160,14 +160,12 @@ %87=ffff000000000000 %88=1 main:bb3.i26.i - %90=1 -main:bb5.i27.i - %retval.i.i=ffff000000000000 + %90=0 main:float64_mul.exit - %181=ffff000000000000 + %181=3ff8000000000000 %183=ffff000000000000 - %184=0 - %186=0 + %184=1 + %186=1 %189=4 %exitcond=1 main:bb2
Looks like the sim.txt is missing a basic block: main:bb5.i27.i
The first difference is:
%90 = icmp eq i32 %84, 0
Maybe the icmp is invalid in the mips assembly?
bb3.i26.i: ; preds = %float64_is_signaling_nan.exit.i.i %90 = icmp eq i32 %84, 0 %91 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([42 x i8]* @23, i32 0, i32 0), i1 %90) br i1 %90, label %bb5.i27.i, label %float64_mul.exit
$BB0_40: # %bb3.i26.i # in Loop: Header=BB0_1 Depth=1 addiu $19, $zero, 0 lui $16, %hi(__unnamed_24) xor $19, $16, $19 addiu $4, $16, %lo(__unnamed_24) sltu $5, $19, 1 jal mprintf nop beq $16, $zero, $BB0_42 nop # BB#41: # in Loop: Header=BB0_1 Depth=1 lw $19, 296($sp) nop j $BB0_76 nop $BB0_42: # %bb5.i27.i # in Loop: Header=BB0_1 Depth=1 lui $2, %hi(__unnamed_25) lw $19, 296($sp) nop beq $17, $zero, $BB0_44 nop # BB#43: # %bb5.i27.i # in Loop: Header=BB0_1 Depth=1 lw $19, 292($sp) nop $BB0_44: # %bb5.i27.i # in Loop: Header=BB0_1 Depth=1 beq $17, $zero, $BB0_46 nop # BB#45: # %bb5.i27.i # in Loop: Header=BB0_1 Depth=1 addu $18, $zero, $21 $BB0_46: # %bb5.i27.i # in Loop: Header=BB0_1 Depth=1 addiu $4, $2, %lo(__unnamed_25) addu $5, $zero, $19 addu $6, $zero, $18 jal mprintf nop j $BB0_76 nop ... $BB0_76: # %float64_mul.exit # in Loop: Header=BB0_1 Depth=1
Why is bb5.i27.i split into so many different basic blocks?
Does this represent the icmp? yes. If $19 < 1 then $5 = 1 else $5 = 0. If $19 represents %84 then when $19=0 then $5=1.
sltu $5, $19, 1
In make watch. %84 seems to be correct. When does %90=0 in gxemul?
It's very hard to correlate the .s file to the final disassembled .src file.
Can I use bugpoint to make this bug smaller?
— Andrew Canis 2011/06/06 12:15
Cool tool for calculating lines of code: sloccount
Okay. There is a very strange bug:
int main () { volatile unsigned long long testing = 0x7FFFFFFFFFFFFFFFULL; printf ("testing=%016llx\n", testing); }
When I run make emulwatch:
acanis@acanis-desktop:~/work/legup/examples/mips_bug$ make emulwatch acanis@acanis-desktop:~/work/legup/examples/mips_bug$ diff -u lli.txt sim.txt --- lli.txt 2011-06-03 15:51:39.000000000 -0400 +++ sim.txt 2011-06-03 15:51:40.000000000 -0400 @@ -1,2 +1,2 @@ main:entry - %0=7fffffffffffffff + %0=ffffffffffffffff
The gxemul emulator seems to sign extend the unsigned long long. What about if it's just a normal 32-bit long? Okay. That matches fine. No sign extend problem. Must be an issue with 64-bit integers. Let's compare the .s assembly with the old version of LLVM. Same problem… Wow. So this is a bug that hasn't been filed yet. The issue must be with something else. I'll file this bug right now.
I think it's just treating an unsigned number as a signed number. Victor mentioned a problem with the ldu instruction.
This works fine:
volatile unsigned long long testing = 0x7FFFFFFFULL;
But this has the sign extend problem:
volatile unsigned long long testing = 0xFFFFFFFFULL;
The diff between the above two snippets:
acanis@acanis-desktop:~/work/legup/examples/mips_bug$ diff -u bad.s good.s --- bad.s 2011-06-03 16:44:36.000000000 -0400 +++ good.s 2011-06-03 16:45:03.000000000 -0400 @@ -18,8 +18,9 @@ sw $16, 20($sp) sw $17, 16($sp) addiu $2, $sp, 24 + lui $3, 32767 ori $2, $2, 4 - addiu $3, $zero, -1 + ori $3, $3, 65535 sw $3, 24($sp) sw $zero, 0($2) lui $3, %hi($.str)
MIPS reference:
LUI -- Load upper immediate Description: The immediate value is shifted left 16 bits and stored in the register. The lower 16 bits are zeroes. Operation: $t = (imm << 16); advance_pc (4); ADDIU -- Add immediate unsigned (no overflow) Description: Adds a register and a sign-extended immediate value and stores the result in a register Operation: $t = $s + imm; advance_pc (4);
In the good case: 32767 « 16 + 65535 = 2147483647. Which is right. But in the bad case: -1 sign extended is all ones. But if $3 is actually 64-bits this will be wrong.
Could it be a problem with the emulator not supporting 64-bit integers? I doubt it.
Looking on the LLVM release notes for 3.0:
Known problems with the MIPS back-end * 64-bit MIPS targets are not supported yet.
But shouldn't matter because Tiger MIPS is a 32-bit processor.
When I run 'make tigerwatch' I get:
acanis@acanis-desktop:~/work/legup/examples/mips_bug$ diff -u lli.txt sim.txt --- lli.txt 2011-06-03 16:50:21.000000000 -0400 +++ sim.txt 2011-06-03 16:50:27.000000000 -0400 @@ -1,2 +1,2 @@ main:entry - %0=ffffffff + %0=0
How are 64-bit integers treated in a MIPS1 ISA? MIPS1 is a 32-bit ISA. So actually “addiu $3, $zero, -1” should be correct.
I'm going to try to step through the code in gxemul. The normal 'make emul' command:
gxemul -E testmips -e R3000 mips_bug.elf -p `../../tiger/linux_tools/lib/../find_ra mips_bug.emul.src`
find_ra finds the return address of the main() function so you know when to break the gxemul simulation. In this case it returns: 0xffffffff80031400 When I look in mips_bug.emul.src I see:
80031400: 03e00008 jr ra
So you must have to pad with breakpoint address with 1's.
There is a slight difference between the mips_bug.emul.src and the mips_bug.s file: mips_bug.s file:
sw $17, 16($sp) addiu $2, $sp, 24 ori $2, $2, 4 addiu $3, $zero, -1 sw $3, 24($sp)
The mips_bug.emul.src file:
8003138c: afb10010 sw s1,16(sp) 80031390: 27a20018 addiu v0,sp,24 80031394: 34420004 ori v0,v0,0x4 80031398: 2403ffff li v1,-1 8003139c: afa30018 sw v1,24(sp)
The addiu became an li. Lets step through:
gxemul -E testmips -e R3000 mips_bug.elf -p 0x80031394
Seems like the registers are actually 64-bit in this machine…
GXemul> s ffffffff80031394: 34420004 ori v0,v0,0x0004 GXemul> s ffffffff80031398: 2403ffff addiu v1,zr,-1 GXemul> reg cpu0: pc = 0xffffffff8003139c <main+0x1c> ... cpu0: v1 = 0xffffffffffffffff s3 = 0x0000000000000000
It seems like the gxemul is simulating a 64-bit little-endian machine:
GXemul> machine serial nr: 1 (nr of NICs: 1) memory: 32 MB cpu0: 5KE, running 64-bit Little-endian (MIPS64, revision 2), 48 TLB entries L1 I-cache: 32 KB, 32 bytes per line, 2-way L1 D-cache: 32 KB, 32 bytes per line, 2-way
— Andrew Canis 2011/06/03 12:15
Did I even apply victor's changes to lib/Target/Mips/MipsRegisterInfo.cpp?
No I didn't. The MIPS backend code exactly matches the git version. So I need to apply Victor's patches manually.
Okay, a bunch of the stack code has been moved into a new file:
MipsFrameLowering.cpp
Okay I've tried to reapply the patch. Only dfmul is failing now.
Seems to be some kind of sign extension problem? In most cases gxemul seems to be sign extending while lli isn't.
— Andrew Canis 2011/06/02 12:15
There is a CallInst function called getArgOperand() which I should be using.
They finally fixed the llvm.vim syntax file.
Okay. I fixed the uadd.overload.* intrinsic problem.
Now I'm down to some gxemul errors for dfmul, llist, loopbug, memset.
Could this be caused by Victor's MIPS changes? Maybe the MIPS backend has been fixed/broken?
Looking into loopbug benchmark, from git commit:
commit 8cdf9e016927d9361144260ccaf87d74e58ebaa8 Author: Andrew Canis <andrew.canis@gmail.com> Date: Tue Aug 24 22:20:04 2010 -0400 Test case for LLVM MIPS backend bug. Expected: $ make $ lli loopbug.bc On MIPS (using gxemul emulator): $ make tiger $ make emul
Just double checked that simple backup is working again. Looks good.
Can I try this with the unmodified llvm version? Here's the command that produces the mips assembly:
../../build/bin/llc loopbug.bc -march=mipsel -relocation-model=static -mips-ssection-threshold=0 -mcpu=mips1 -o loopbug.s
I just installed the 2.9 binaries in ~/downloads/llvm-2.9-mingw32-i386 Same error with the newer version of llc.
I probably incorporated the mips backend changes incorrectly.
— Andrew Canis 2011/06/01 12:15
Whoops, noticed that simple backup wasn't working (isis has gone down).
Mips, gsm fail:
# ** Error: gsm.v(1692): Module 'memset' is not defined. # ** Error: mips.v(740): Module 'memset' is not defined.
This also causes gxemul to fail:
../../../mips-binutils/bin/mipsel-elf-ld -T ../../../tiger/linux_tools/lib/prog_link.ld -e main gsm.o ../../../tiger/tool_source/lib/altera_avalon_performance_counter.o -o gsm.elf -EL -L ../../../tiger/linux_tools/lib -lgcc -lfloat -luart make[1]: Leaving directory `/home/acanis/work/legup/examples/chstone/gsm' gsm.o: In function `main': (_main_section+0x118): undefined reference to `memset'
Looking at the memset test. It looks like the legup versions aren't being linked properly.
../../build/bin/llvm-ld memset.prelto.bc ../lib/llvm/liblegup.a -b=memset.premodulo.bc
First of all it seems like the intrinsic lowering pass is no longer working properly:
acanis@acanis-desktop:~/work/legup/examples/memset$ diff -u memset.prelto.ll memset.premodulo.ll ... -declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture, i32, i32, i1) nounwind - -declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i32, i1) nounwind - declare i8* @memcpy(i8*, i8*, i32) declare i8* @memset(i8*, i32, i32)
There are still two instrinsic calls in there. Rebuilding ../lib/llvm/liblegup.a doesn't help. There is some sort of problem. Basically memset() is not being linked into memset.premodulo.ll
So previously the prelto pass replaces:
call void @llvm.memset.i64(i8* %arr_addr.04.1.i31, i8 0, i64 11, i32 1) nounwind
With:
%16 = call i8* @legup_memset_1(i8* %arr_addr.04.1.i31, i8 0, i64 11) ; <i8*> [#uses=0]
The postfix “_1” indicates a 1 byte alignment. The type of the 3rd argument (length) is i64.
I see in the LLVM manual for the SVN head (http://llvm.org/docs/) that the function name has changed:
declare void @llvm.memset.p0i8.i32(i8* <dest>, i8 <val>, i32 <len>, i32 <align>, i1 <isvolatile>) declare void @llvm.memset.p0i8.i64(i8* <dest>, i8 <val>, i64 <len>, i32 <align>, i1 <isvolatile>)
It's pretty amazing how fast LLVM changes. We're at the 2.7 release and 3.0 is coming out soon. The old 2.7 syntax (from http://llvm.org/releases/2.7/docs/LangRef.html#int_memset)
declare void @llvm.memset.i8(i8 * <dest>, i8 <val>, i8 <len>, i32 <align>) declare void @llvm.memset.i16(i8 * <dest>, i8 <val>, i16 <len>, i32 <align>) declare void @llvm.memset.i32(i8 * <dest>, i8 <val>, i32 <len>, i32 <align>) declare void @llvm.memset.i64(i8 * <dest>, i8 <val>, i64 <len>, i32 <align>)
In release notes for 2.8:
The memcpy, memmove, and memset intrinsics now take address space qualified pointers and a bit to indicate whether the transfer is "volatile" or not.
Our PreLTO seems to be failing. The instrinsics are successfully turned into memcpy calls but those should then be turned into legup_memset_* calls.
It's strange. The old version of the code doesn't lower anything. While the newer version prints (in -debug mode):
Lowering: call void @llvm.memcpy.p0i8.p0i8.i32(i8* %carray1, i8* getelementptr inbounds ([12 x i8]* @C.17.1564, i32 0, i32 0), i32 12, i32 1, i1 false) New instruction: %0 = call i8* @memcpy(i8* %carray1, i8* getelementptr inbounds ([12 x i8]* @C.17.1564, i32 0, i32 0), i32 12)
Wow really strange. After touching the file PreLTO.cpp I now get this error: unknown instruction on intrinsic argument
UNREACHABLE executed at /home/acanis/work/legup/llvm/lib/Transforms/LegUp/PreLTO.cpp:164! Stack dump: 0. Program arguments: ../../build/bin/opt -load=../../build/lib/LLVMLegUp.so -legup-prelto 1. Running pass 'Function Pass Manager' on module '<stdin>'. 2. Running pass 'Pre-Link Time Optimization Pass to lower intrinsics' on function '@main' /bin/bash: line 1: 15114 Aborted ../../build/bin/opt -load=../../build/lib/LLVMLegUp.so -legup-prelto < memset.prelto.linked.bc > memset.prelto.bc
Did the makefile not build this properly before?
Okay, so the code isn't handling getelementptr's properly. Actually I'm a little bit confused by the code. The legup_* prefix is determined by the destination pointer…
call void @llvm.memcpy.i32(i8* %carray1, i8* getelementptr inbounds ([12 x i8]* @C.17.1564, i32 0, i32 0), i32 12, i32 1) call void @llvm.memcpy.i32(i8* %sarray2, i8* bitcast ([12 x i16]* @C.18.1565 to i8*), i32 24, i32 2) call void @llvm.memcpy.i32(i8* %array3, i8* bitcast ([12 x i32]* @C.19.1566 to i8*), i32 48, i32 4) call void @llvm.memcpy.i32(i8* %larray4, i8* bitcast ([12 x i64]* @C.20.1567 to i8*), i32 96, i32 8)
Turns into:
%0 = call i8* @legup_memcpy_1(i8* %carray1, i8* getelementptr inbounds ([12 x i8]* @C.17.1564, i32 0, i32 0), i32 12) ; <i8*> [#uses=0] %1 = call i8* @legup_memcpy_2(i8* %sarray2, i8* bitcast ([12 x i16]* @C.18.1565 to i8*), i32 24) ; <i8*> [#uses=0] %2 = call i8* @legup_memcpy_4(i8* %array3, i8* bitcast ([12 x i32]* @C.19.1566 to i8*), i32 48) ; <i8*> [#uses=0] %3 = call i8* @legup_memcpy_8(i8* %larray4, i8* bitcast ([12 x i64]* @C.20.1567 to i8*), i32 96) ; <i8*> [#uses=0]
Why can't you just use the alignment parameter? For instance:
Lowering for LegUp: call void @llvm.memset.p0i8.i64(i8* %16, i8 0, i64 96, i32 8, i1 false)
The destination is: %16 = bitcast [12 x i64]* %larray to i8* Which points to an array of i64's so the alignment is calculated to be 8 (64/8).
Damn, I just got hit by the new API change again: There was an api change with CallInst operand order. The function is now stored as the last operand instead of the first.
Okay that worked. Down to 16 failures. I have a couple of unexplained gxemul simulation errors…
dfdiv, dfmul, dfsin, sha:
LLVM ERROR: Code generator does not support intrinsic function 'llvm.uadd.with.overflow.i64'!
The actual error comes from:
lib/CodeGen/IntrinsicLowering.cpp:353: report_fatal_error("Code generator does not support intrinsic function '"+
From the LLVM docs:
The 'llvm.uadd.with.overflow' family of intrinsic functions perform an unsigned addition of the two arguments, and indicate whether a carry occurred during the unsigned summation.
So I get code looking like:
%uadd.i = call %0 @llvm.uadd.with.overflow.i64(i64 %105, i64 %106) nounwind %108 = extractvalue %0 %uadd.i, 0 %109 = extractvalue %0 %uadd.i, 1
Which could easily be converted to verilog: {a, b} = c + d;
But what's the best way of handling this? I think the easiest way is to turn this into an i65 addition. And shift out the carry bit. Quartus should easily optimize this to the correct hardware. I'll just add this to the PreLTO pass.
— Andrew Canis 2011/05/31 12:15
mips intrinsic error with new llvm version:
../../../build/bin/opt: symbol lookup error: ../../../build/lib/LLVMLegUp.so: undefined symbol: _ZN4llvm17IntrinsicLowering18LowerIntrinsicCallEPNS_8CallInst
This is a linker error I experienced previously. I fixed the autoconf makefile flow, now I need to fix cmake.
I need to include: LDFLAGS = $(LLVM_OBJ_ROOT)/lib/CodeGen/$(BuildMode)/IntrinsicLowering.o
How do I handle this in cmake?
What does this mean?
add_llvm_loadable_module( LLVMLegUp ..
This creates the build/lib/LLVMLegUp.so shared library. There are only a few other examples of this in the code. This doesn't fix it:
add_dependencies(LLVMLegUp LLVMCodeGen)
I need to actually link the LLVMCodeGen library into the LLVMLegup.so library:
target_link_libraries(LLVMLegUp LLVMCodeGen)
Okay adding this to /llvm/lib/Transforms/LegUp/CMakeLists.txt works.
— Andrew Canis 2011/05/30 15:05
There's an interesting discussion on the CBackend on the LLVM mailing list: http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-November/036278.html
Chris suggests a full rewrite if anyone wants to work on the CBackend:
If anyone was really interested in this, I'd strongly suggest a complete rewrite of the C backend: make use the existing target independent code generator code (for legalization etc) and then just put out a weird ".s file" at the end. -Chris
So I've finished iterative modulo scheduling for a simple example with no recurrences:
int a[N], b[N], c[N]; for (i = 0; i < N; i++) { a[i] = b[i] + c[i]; } return a[N-1];
But it takes 989ns = ~500 cycles. The II=3 so it should take 300 cycles. The prologue and epilogue both require 2 basic blocks.
I need to fix the prologue to branch to the epilog depending on the loop bound. For instance if N=1 then the kernel should be skipped.
Is there an easy way to generate a gantt chart for the reservation table? psTricks seems to have gantt chart generation for latex. Okay found a good sty here: http://www.martin-kumm.de/tex_gantt_package.php
Added a debug macro for legup. Use the option '-debug-only=legup' to only show debugging from LegUp.
I don't understand how this is executing in legup in 1000ns/2=~500 cycles. This means the loop body only takes 5 cycles when it should take 6. Seems like the getelementptr has been chained. It's weird though, because I see it gets scheduled in separate states at one point. Then gets chained in later. What is going on here? The chaining happens somewhere between SchedulerASAP::scheduleBasicBlock() and SchedulerPass::createFSMforBB()
I'm noticing that the scheduler needs to be completely revamped. There is tons of copy pasted code all over the place. For instance, looking in SchedulerMapping::createFSM() function, this looks like an exact copy of the ASAP scheduler code. And the schedulerPass has the exact same copied code too. Why does the DAG need it's own custom asap code? And then this code is repeated again in simpleASAPScheduler.
Okay so I think the bug was this code in SimpleASAPScheduler::getSoonestStateRegUses():
if (depIn->getAsapDelay() + in->getDelay() > InstructionNode::getMaxDelay()) {
Should be this:
if (depIn->getAsapDelay() >= InstructionNode::getMaxDelay() || depIn->getAsapDelay() + in->getDelay() > InstructionNode::getMaxDelay()) {
Basically, you look at the predecessors of the current instruction (depIn). If they have an asapDelay that's equal or greater than the getMaxDelay then you _must_ be in the next state. Otherwise, _only_ if the asapDelay of the predecessor + the delay of the current instruction is _greater_ than the maxDelay would you need to be moved to the next state (there isn't enough room for you to be in the current state with the predecessor).
— Andrew Canis 2011/04/25 15:05 There's an interesting discussion on the CBackend on the LLVM mailing list: http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-November/036278.html
Chris suggests a full rewrite if anyone wants to work on the CBackend:
If anyone was really interested in this, I'd strongly suggest a complete rewrite of the C backend: make use the existing target independent code generator code (for legalization etc) and then just put out a weird ".s file" at the end. -Chris
So I've finished iterative modulo scheduling for a simple example with no recurrences:
int a[N], b[N], c[N]; for (i = 0; i < N; i++) { a[i] = b[i] + c[i]; } return a[N-1];
But it takes 989ns = ~500 cycles. The II=3 so it should take 300 cycles. The prologue and epilogue both require 2 basic blocks.
I need to fix the prologue to branch to the epilog depending on the loop bound. For instance if N=1 then the kernel should be skipped.
Is there an easy way to generate a gantt chart for the reservation table? psTricks seems to have gantt chart generation for latex. Okay found a good sty here: http://www.martin-kumm.de/tex_gantt_package.php
Added a debug macro for legup. Use the option '-debug-only=legup' to only show debugging from LegUp.
I don't understand how this is executing in legup in 1000ns/2=~500 cycles. This means the loop body only takes 5 cycles when it should take 6. Seems like the getelementptr has been chained. It's weird though, because I see it gets scheduled in separate states at one point. Then gets chained in later. What is going on here? The chaining happens somewhere between SchedulerASAP::scheduleBasicBlock() and SchedulerPass::createFSMforBB()
I'm noticing that the scheduler needs to be completely revamped. There is tons of copy pasted code all over the place. For instance, looking in SchedulerMapping::createFSM() function, this looks like an exact copy of the ASAP scheduler code. And the schedulerPass has the exact same copied code too. Why does the DAG need it's own custom asap code? And then this code is repeated again in simpleASAPScheduler.
Okay so I think the bug was this code in SimpleASAPScheduler::getSoonestStateRegUses():
if (depIn->getAsapDelay() + in->getDelay() > InstructionNode::getMaxDelay()) {
Should be this:
if (depIn->getAsapDelay() >= InstructionNode::getMaxDelay() || depIn->getAsapDelay() + in->getDelay() > InstructionNode::getMaxDelay()) {
Basically, you look at the predecessors of the current instruction (depIn). If they have an asapDelay that's equal or greater than the getMaxDelay then you _must_ be in the next state. Otherwise, _only_ if the asapDelay of the predecessor + the delay of the current instruction is _greater_ than the maxDelay would you need to be moved to the next state (there isn't enough room for you to be in the current state with the predecessor).
— Andrew Canis 2011/04/25 15:05
Finished calculating heights of the dependences graph. I need to fix the schedulerDAG - why are reg/mem dependencies split up?
Also a bigger question: what is the delay of an instruction? It's actually determined by the schedule and depends on the chaining that happens. There might be an opportunity here to improve the algorithm.
For some reason, the loads are aliasing:
%scevgep7 = getelementptr [100 x i32]* %b, i32 0, i32 %i.06 %scevgep8 = getelementptr [100 x i32]* %c, i32 0, i32 %i.06 %0 = load i32* %scevgep7, align 4 %1 = load i32* %scevgep8, align 4
Even though %b and %c clearly don't alias…
Running -print-alias-sets:
../../build/bin/opt -legup-config=../../hwtest/CycloneII.tcl -load=../../build/lib/LLVMPolly.so -basicaa -print-alias-sets -modulo-schedule pipeline.premodulo.bc > pipeline.bc Alias Set Tracker: 3 alias sets for 3 pointer values. AliasSet[0xa7957f0, 1] must alias, Ref Pointers: (i32* %scevgep7, 4) AliasSet[0xa795820, 1] must alias, Ref Pointers: (i32* %scevgep8, 4) AliasSet[0xa795850, 1] must alias, Mod Pointers: (i32* %scevgep, 4)
The mem dependence uses are:
%0 = load i32* %scevgep7, align 4 uses: %1 = load i32* %scevgep8, align 4 %0 = load i32* %scevgep7, align 4 uses: store i32 %2, i32* %scevgep, align 4 %1 = load i32* %scevgep8, align 4 uses: store i32 %2, i32* %scevgep, align 4
It's like everything is considered aliased… Not sure why this is happening. Don't have time to fix it now.
So the heights look good. Except for the aliasing issue between the loads.
Height: 6: %i.06 = phi i32 [ 0, %bb.nph ], [ %3, %bb ] Height: 6: %scevgep7 = getelementptr [100 x i32]* %b, i32 0, i32 %i.06 Height: 5: %0 = load i32* %scevgep7, align 4 Height: 4: %scevgep8 = getelementptr [100 x i32]* %c, i32 0, i32 %i.06 Height: 3: %1 = load i32* %scevgep8, align 4 Height: 2: %3 = add nsw i32 %i.06, 1 Height: 1: %scevgep = getelementptr [100 x i32]* %a, i32 0, i32 %i.06 Height: 1: %2 = add nsw i32 %1, %0 Height: 1: %exitcond = icmp eq i32 %3, 100 Height: 0: br i1 %exitcond, label %bb2, label %bb Height: 0: store i32 %2, i32* %scevgep, align 4
— Andrew Canis 2011/04/22 15:05
Moving to cmake (for polly). Debug build:
acanis@acanis-desktop:~/work/legup/build$ cmake ../llvm/ -DCMAKE_BUILD_TYPE=Debug -DCMAKE_PREFIX_PATH=/home/acanis/work/polly/cloog/install/
Had to add Verilog as a target to llvm/CMakeLists.txt
With cmake you can just say:
make llc
I can't include polly as an analysis pass in the backend. Won't work with the build system. Anyway, I need to make modulo scheduling a prepass anyway. Just do all the development in the polly folder to simplify the build issues.
Created a simple example with no loop recurrences:
for (i = 0; i < N; i++) { a[i] = b[i] + c[i]; }
Takes 1007ns/2 = 500 cycles. The .ll:
bb: ; preds = %bb, %bb.nph %i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ] %scevgep = getelementptr [100 x i32]* %a, i32 0, i32 %i.04 %scevgep5 = getelementptr [100 x i32]* %b, i32 0, i32 %i.04 %scevgep6 = getelementptr [100 x i32]* %c, i32 0, i32 %i.04 %0 = volatile load i32* %scevgep5, align 4 %1 = volatile load i32* %scevgep6, align 4 %2 = add nsw i32 %1, %0 volatile store i32 %2, i32* %scevgep, align 4 %3 = add nsw i32 %i.04, 1 %exitcond = icmp eq i32 %3, 100 br i1 %exitcond, label %bb2, label %bb
I would expect this to take 5 cycles: Two loads can be pipelined sequentially, so 3 cycles. Then add takes 1 cycle. Store takes 1 cycles. So 5 cycles * 100 = 500. Wait. But what about the getelementptr instructions? Shouldn't this take 6 cycles? What can I pipeline this too? The resource minimum initiation interval (ResMII) is limited by memory operations, which there are 3 of. There is only a single memory port. Therefore ResMII = 3/1 = 3. So fully pipelined this loop should take 300 cycles. There are no recurrences.
So you can't modify the LLVM IR when iterating over SCoPs:
// Because they operate on Polly IR, not the LLVM IR, ScopPasses are not allowed // to modify the LLVM IR. Due to this limitation, the ScopPass class takes
All I want to do is detect and iterate over the SCoPs in the LLVM IR! What's the point of polly if you can't modify the LLVM?
— Andrew Canis 2011/04/21 15:05
Okay. I finally figured out how to handle the fact we have an upstream branch of LLVM in our repository. git subtree merging.
git submodules don't work. They are too complex and don't fit our workflow. I don't really want to have to run 'git submodule update' every time I modify something in the llvm directory. Instead what I want is a branch of LLVM that tracks the upstream changes. Then occasionally I want to merge in the latest LLVM upstream changes into that branch and then merge it into mainline. The best thing about git subtree is it doesn't change the workflow of anyone else working with LegUp. It's just up to me to occasionally merge these changes.
git remote add llvm_remote http://llvm.org/git/llvm.git git fetch llvm_remote
There are a bunch of releases:
* [new branch] master -> llvm_remote/master ... * [new branch] release_27 -> llvm_remote/release_27 * [new branch] release_28 -> llvm_remote/release_28 * [new branch] release_29 -> llvm_remote/release_29
We were at the LLVM 2.7 release.
git checkout -b llvm_2.7 llvm_remote/release_27 git checkout -b llvm_branch llvm_remote/master
Looking in the release27 branch, the last commit of 2.7 is:
commit 4bbf07421f101f00f4272927b60f7a8383b5cecf Author: Tanya Lattner <tonic@nondot.org> Date: Tue Apr 27 06:53:59 2010 +0000 Commit 2.7 release notes. Update getting started guide for 2.7 git-svn-id: https://llvm.org/svn/llvm-project/llvm/branches/release_27@102412 91177308-0d34-0410-b5e6-96231b3b80d
Very interesting. This commit doesn't exist in the LLVM mainline. I guess this makes sense. They created an svn branch to track the release of 2.7.
Probably the best way to deal with this is forget about 2.7, just grab the latest git repo.
git read-tree --prefix=llvm-git/ -u llvm_branch
The llvm commit head is:
commit e5ff344fc03351eaf8bb3303d0fe359378c09684
Now when I git mv into the subtree I lose all the previous history. No I just need to use:
git log --follow
git annotate still works. Okay so this is fine.
Added LLVM git mainline as a git subtree Used the command: git read-tree --prefix=llvm-git/ -u llvm_branch The latest LLVM git commit in llvm_branch was: commit e5ff344fc03351eaf8bb3303d0fe359378c09684
Is there any way I can merge it into my existing folder. I don't really understand how the subtree feature works…
Okay, I tried a new strategy. Removed llvm-git and just ran:
git merge --squash -s subtree --no-commit llvm_branch
It somehow detected that llvm/ was where the merge should happen. git just completely removed the Verilog directory! Just manually go through and fix the merge.
I wonder if this would work better:
git read-tree --prefix=llvm/ -m -u llvm_branch
Nope. Doesn't work. Can't use prefix and -m options together
I think Victor will have to merge in:
lib/Target/Mips/MipsRegisterInfo.cpp
Something is wrong with the MemoryDependenceAnaysis pass. It's giving me a seg fault. There's a new LLVM idiom for passes:
-static RegisterPass<GVN> X("gvn", - "Global Value Numbering"); +INITIALIZE_PASS_BEGIN(GVN, "gvn", "Global Value Numbering", false, false) +INITIALIZE_PASS_DEPENDENCY(MemoryDependenceAnalysis) +INITIALIZE_PASS_DEPENDENCY(DominatorTree) +INITIALIZE_AG_DEPENDENCY(AliasAnalysis) +INITIALIZE_PASS_END(GVN, "gvn", "Global Value Numbering", false, false)
Another example:
// Register the default SparcV9 implementation... -static RegisterPass<TargetData> X("targetdata", "Target Data Layout", false, - true); +INITIALIZE_PASS(TargetData, "targetdata", "Target Data Layout", false, true) char TargetData::ID = 0;
Wow. Really annoying compiler warning:
SchedulerPass.cpp:50: error: ‘void llvm::initializeSchedulerASAPPass(llvm::PassRegistry&)’ should have been declared inside ‘llvm’
Fix by adding:
namespace llvm { void initializeSchedulerASAPPass(llvm::PassRegistry&); }
Where is initializeSchedulerASAPPass() being called from? There is a short note about the change here: http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-July/033293.html Detailed note: http://permalink.gmane.org/gmane.comp.compilers.llvm.devel/35362
There was also an api change with CallInst operand order. The function is now stored as the last operand instead of the first.
for (CallSite::arg_iterator AI = CI->op_begin()+1, AE = CI->op_end()-1; AI != AE; ++AI) {
PreLTO required IntrinsicLowering so I kept getting the error:
../../../llvm/Debug+Asserts/bin/opt: symbol lookup error: ../../../llvm/Debug+Asserts/lib/LLVMLegUp.so: undefined symbol: _ZN4llvm17IntrinsicLowering18LowerIntrinsicCallEPNS_8CallInstE
I finally figured out you need to add this to the Transforms/LegUp Makefile:
LDFLAGS = $(LLVM_OBJ_ROOT)/lib/CodeGen/$(BuildMode)/IntrinsicLowering.o
I'm seeing this weird case with overflow intrinsics… llvm-ld seems to produce them.
../../../llvm/Debug+Asserts/bin/llc -march=c dfmul.prelto.linked.bc LLVM ERROR: Code generator does not support intrinsic function 'llvm.uadd.with.overflow.i64'!
Even the CBackend doesn't support these intrinsics. But the CppBackend does. Okay, I'm reading in the release notes that the CBackend is no longer actively maintained. So I should just be looking at the CppBackend.
If I turn on NO_INLINE (to prevent link time optimization) I can prevent these intrinsics from being generated.
Tons of gxemul errors. PreLTO seems to be failing occasionally with:
make[1]: Leaving directory `/home/acanis/work/legup/examples/memset' unknown instruction on intrinsic argument UNREACHABLE executed at PreLTO.cpp:164!
Dhrystone has a strange error:
# ** Error: dhry.v(5687): Module 'legup_memcpy_' is not defined.
I don't really have time to look into this anymore.
git cloned polly in the tools directory. Had to run 'make clean' on my llvm because cmake didn't like the in source build.
acanis@acanis-desktop:~/work/legup/build$ cmake ../llvm/ -DCMAKE_PREFIX_PATH=/home/acanis/work/polly/cloog/install/
Okay. Added a few temporary modifications to llc.cpp so I can load in the shared library. Appears to work:
acanis@acanis-desktop:~/work/legup/build$ bin/llc -load lib/LLVMPolly.so
— Andrew Canis 2011/04/20 15:05
The changes I pushed yesterday improved the geomean Time by 8%. Geomean Fmax went up and cycles went down.
Looking into installing Poly in ~/work/polly
Poly:
git clone http://llvm.org/git/llvm.git cd llvm/tools git clone git://repo.or.cz/polly.git
ISL/CLooG:
git clone git://repo.or.cz/cloog.git cd cloog ./get_submodules.sh ./autogen.sh ./configure --prefix=~/work/polly/cloog/install make make install
Now building polly:
cd ~/work/polly mkdir build cd build cmake ../llvm -DCMAKE_PREFIX_PATH=~/work/polly/cloog/install . make
Great setup btw. I should make LegUp more like this! Especially the use of submodules. The thing is, we have made some modifications to LLVM to add tcl and fix the MIPS backend.
Now modifying my path:
export PATH=~/work/polly/build/bin/:$PATH
Looking at examples:
cd ~/work/polly/test make
Polly can't seem to deal with constant integers. For instance, the dependencies are detected if I use:
#define N 1024
But not with:
const int N = 1024;
Polly seems to work. For this example:
for (i = 2; i < N; i++) { array[i] = array[i-2]+1; }
Detects a distance of 2:
Printing analysis 'Polly - Calculate dependences for Scop' for region: '%2 => %7' in function 'main': Must dependences: { Stmt_3[i0] -> Stmt_3[2 + i0] : i0 >= 0 and i0 <= 95; Stmt_3[i0] -> FinalRead[0] : i0 >= 0 and i0 <= 97 } May dependences: { } Must no source: { FinalRead[0] -> MemRef_array[o0] : o0 >= 100 or o0 <= 1; Stmt_3[i0] -> MemRef_array[i0] : i0 >= 0 and i0 <= 1 } May no source: { }
Probably the fastest way to get this up and running is to use the .json file export:
{ "name": "%2 => %7", "context": "{ [] }", "statements": [{ "name": "Stmt_3", "domain": "{ Stmt_3[i0] : i0 >= 0 and i0 <= 97 }", "schedule": "{ Stmt_3[i0] -> scattering[0, i0, 0] }", "accesses": [{ "kind": "read", "relation": "{ Stmt_3[i0] -> MemRef_array[i0] }" }, { "kind": "write", "relation": "{ Stmt_3[i0] -> MemRef_array[2 + i0] }" }] }] }
But how do I map 'Stmt_3' to the actually LLVM IR instruction? Looking in ScopInfo.cpp. Looks like 3 is the name of the basic block (label %3) with % stripped off. Similarly with MemRef_array (%array with % stripped).
json is missing all the dependencies. How can I actually integrate this? I need to use the dependencies analysis class:
CodeGeneration.cpp: Dependences *DP = &getAnalysis<Dependences>(); Pocc.cpp: Dependences *D = &getAnalysis<Dependences>();
What's the quickest way I can integrate this into LegUp? LegUp runs in the backend llc. Can llc load a library like opt can? llc gives an error when trying to load the Polly library:
Error opening '/home/acanis/work/polly/build/lib/LLVMPolly.so': /home/acanis/work/polly/build/lib/LLVMPolly.so: undefined symbol: _ZNK4llvm10RegionPass17createPrinterPassERNS_11raw_ostreamERKSs -load request ignored.
What is RegionPass::createPrinterPass? Do I just need to extend llc to include this? Added it manually to llc (based on opt). Now I get:
Error opening '/home/acanis/work/polly/build/lib/LLVMPolly.so': /home/acanis/work/polly/build/lib/LLVMPolly.so: undefined symbol: _ZTVN4llvm15AliasSetTracker13ASTCallbackVHE
Okay by moving these lines into llc.cpp:
PM.add(new RegionPassPrinter(NULL, Out->os())); createStandardModulePasses(&PM, 3, /*OptimizeSize=*/ false, /*UnitAtATime=*/ true, /*UnrollLoops=*/ true, true, /*HaveExceptions=*/ true, NULL);
I can now load the Polly library.
Next step is the fact that Polly requires the latest git version of LLVM. What's the easiest way to do this? Compiling. Used configure instead of cmake. cmake isn't configured for the tcl changes we made to legup. Errors: Wow. There makefile completely doesn't work. Soo many bugs. Can I fix the tcl problem instead?
Needed to add poly directory to tools/Makefile.
I think I got it:
INCLUDE(FindTCL) if (TCL_FOUND) SET(CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} "-I ${TCL_INCLUDE_PATH}") endif()
Now I get a link error:
../../lib/libLLVMTarget.a(LegupTcl.cpp.o): In function `legup::parseTclFile(std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, legup::LegupConfig*)': LegupTcl.cpp:(.text+0x15): undefined reference to `Tcl_CreateInterp'
Okay, fixed cmake:
INCLUDE(FindTCL) if (TCL_FOUND) # add in the Tcl values if found IF (TCL_INCLUDE_PATH) INCLUDE_DIRECTORIES(${TCL_INCLUDE_PATH}) ENDIF (TCL_INCLUDE_PATH) IF (TCL_LIB_PATH) LINK_DIRECTORIES (${TCL_LIB_PATH}) ENDIF (TCL_LIB_PATH) IF (TCL_LIBRARY) LINK_LIBRARIES (${TCL_LIBRARY}) ENDIF (TCL_LIBRARY) endif()
— Andrew Canis 2011/04/19 15:05
Creating a simple example of pipelining in examples/pipeline. First just a simple, parallel loop with no loop carried dependencies:
do i = 1,100 a(i) = 1
LegUp compiles this an it takes 811ns/2=406 cycles. I would expect this to take 2 cycles per load, which can be pipelined with incrementing the i induction variable. So 2*100=200 cycles. Where do the other 200 cycles come from? Okay, it's because I disabled all LLVM optimizations. Turned them back on. Now 607ns/2 = 304 cycles. Still off. The loop body:
bb: ; preds = %bb, %bb.nph %i.04 = phi i32 [ 0, %bb.nph ], [ %1, %bb ] ; <i32> [#uses=2] %tmp = shl i32 %i.04, 2 ; <i32> [#uses=1] %scevgep = getelementptr [400 x i8]* %0, i32 0, i32 %tmp ; <i8*> [#uses=1] %scevgep5 = bitcast i8* %scevgep to i32* ; <i32*> [#uses=1] store i32 1, i32* %scevgep5, align 4 %1 = add nsw i32 %i.04, 1 ; <i32> [#uses=2] %exitcond = icmp eq i32 %1, 100 ; <i1> [#uses=1] br i1 %exitcond, label %bb2, label %bb
Phi gets removed. Okay, the array offset expression (shl) will take 1 cycle. Adding will take 1 cycle, then the exit cond will take another cycle. The store will take 2 cycles. Still seems like the loop should be taking 2 cycles. Inless the getelementptr and bitcast don't get chained.
Can I print out a dot graph of the dependency graph? Running llc with -debug option, I see:
ASAP: bb State: 0 %i.04 = phi i32 [ 0, %bb.nph ], [ %1, %bb ] ; <i32> [#uses=2] State: 0 %tmp = shl i32 %i.04, 2 ; <i32> [#uses=1] State: 1 %scevgep = getelementptr [400 x i8]* %0, i32 0, i32 %tmp ; <i8*> [#uses=1] State: 1 %scevgep5 = bitcast i8* %scevgep to i32* ; <i32*> [#uses=1] State: 2 store i32 1, i32* %scevgep5, align 4 State: 1 %1 = add nsw i32 %i.04, 1 ; <i32> [#uses=2] State: 2 %exitcond = icmp eq i32 %1, 100 ; <i1> [#uses=1] State: 3 br i1 %exitcond, label %bb2, label %bb
So unfortunately, it seems like the getelementptr actually take an extra cycle. This is because you need to take %0 (the address of the array) and add the %tmp offset. You need the shl because this is an integer array. Why didn't LLVM do strength reduction? TODO: strength reduction (-loop-reduce) isn't run by default?
Also, why is %1 started at control step 1? So the shl is actually cheap to chain because it is a constant shift. Strangely, the phi is actually requiring one cycle. I think there is a bug in the scheduler. I don't think Phi's should require a cycle. Okay. Fixed this issue. I'm going to push it to double check. cycle count is still 607ns/2=307 cycles:
State: 0 %i.04 = phi i32 [ 0, %bb.nph ], [ %1, %bb ] ; <i32> [#uses=2] State: 0 %tmp = shl i32 %i.04, 2 ; <i32> [#uses=1] State: 1 %scevgep = getelementptr [400 x i8]* %0, i32 0, i32 %tmp ; <i8*> [#uses=1] State: 1 %scevgep5 = bitcast i8* %scevgep to i32* ; <i32*> [#uses=1] State: 2 store i32 1, i32* %scevgep5, align 4 State: 0 %1 = add nsw i32 %i.04, 1 ; <i32> [#uses=2] State: 1 %exitcond = icmp eq i32 %1, 100 ; <i1> [#uses=1] State: 2 br i1 %exitcond, label %bb2, label %bb
Okay, the other problem is that a branch is taking an extra cycle. Fixed. Final number of cycles: 407ns/2 = 204 cycles
cstep: 0 %i.04 = phi i32 [ 0, %bb.nph ], [ %1, %bb ] ; <i32> [#uses=2] cstep: 0 %tmp = shl i32 %i.04, 2 ; <i32> [#uses=1] cstep: 1 %scevgep = getelementptr [400 x i8]* %0, i32 0, i32 %tmp ; <i8*> [#uses=1] cstep: 1 %scevgep5 = bitcast i8* %scevgep to i32* ; <i32*> [#uses=1] cstep: 2 store i32 1, i32* %scevgep5, align 4 cstep: 0 %1 = add nsw i32 %i.04, 1 ; <i32> [#uses=2] cstep: 1 %exitcond = icmp eq i32 %1, 100 ; <i1> [#uses=1] cstep: 1 br i1 %exitcond, label %bb2, label %bb
Weird. It looks like this should be 3 cycles. Oh no. 0 literally means not a cycle. The cstep represents when the instruction finishes. The branch looks strange here, finishing before the store. But I think the finite state mahine generation will handle that. So wait. If cstep is the ending state then I was probably wrong to chain after PhiNodes… Just wait until buildbot finishes… Turning off chaining after phinode:
cstep: 0 %i.04 = phi i32 [ 0, %bb.nph ], [ %1, %bb ] ; <i32> [#uses=2] cstep: 0 %tmp = shl i32 %i.04, 2 ; <i32> [#uses=1] cstep: 1 %scevgep = getelementptr [400 x i8]* %0, i32 0, i32 %tmp ; <i8*> [#uses=1] cstep: 1 %scevgep5 = bitcast i8* %scevgep to i32* ; <i32*> [#uses=1] cstep: 2 store i32 1, i32* %scevgep5, align 4 cstep: 1 %1 = add nsw i32 %i.04, 1 ; <i32> [#uses=2] cstep: 2 %exitcond = icmp eq i32 %1, 100 ; <i1> [#uses=1] cstep: 2 br i1 %exitcond, label %bb2, label %bb
This looks like what I want. But wait, doesn't the store require 2 cycles. So shouldn't the branch be forced to cstep 3? Increasing the load latency in the code doesn't change the bra.
Weird. If I change the lower loop bound to 1. I get 300 cycles:
cstep: 0 %indvar = phi i32 [ 0, %bb.nph ], [ %indvar.next, %bb ] ; <i32> [#uses=2] cstep: 0 %tmp = shl i32 %indvar, 2 ; <i32> [#uses=1] cstep: 1 %tmp5 = add i32 %tmp, 4 ; <i32> [#uses=1] cstep: 2 %scevgep = getelementptr [400 x i8]* %0, i32 0, i32 %tmp5 ; <i8*> [#uses=1] cstep: 2 %scevgep6 = bitcast i8* %scevgep to i32* ; <i32*> [#uses=1] cstep: 3 store i32 1, i32* %scevgep6, align 4 cstep: 0 %indvar.next = add i32 %indvar, 1 ; <i32> [#uses=2] cstep: 1 %exitcond = icmp eq i32 %indvar.next, 99 ; <i1> [#uses=1] cstep: 1 br i1 %exitcond, label %bb2, label %bb
This is a strength reduction issue. If I run:
opt -loop-reduce pipeline.bc > pipeline2.bc ../../llvm/Debug/bin/llc -legup-config=../../hwtest/CycloneII.tcl -march=v pipeline2.bc -o pipeline.v -debug &> log
I get it back down to 200 cycles. So this proves that strength reduction isn't run by default for some reason…
cstep: 0 %lsr.iv3 = phi [400 x i8]* [ %tmp6, %bb ], [ %scevgep12, %bb.nph ] ; <[400 x i8]*> [#uses=2] cstep: 0 %lsr.iv = phi i32 [ %lsr.iv.next, %bb ], [ 99, %bb.nph ] ; <i32> [#uses=1] cstep: 0 %lsr.iv37 = bitcast [400 x i8]* %lsr.iv3 to i32* ; <i32*> [#uses=1] cstep: 1 store i32 1, i32* %lsr.iv37, align 4 cstep: 0 %lsr.iv.next = add i32 %lsr.iv, -1 ; <i32> [#uses=2] cstep: 0 %scevgep4 = getelementptr [400 x i8]* %lsr.iv3, i32 0, i32 4 ; <i8*> [#uses=1] cstep: 0 %tmp6 = bitcast i8* %scevgep4 to [400 x i8]* ; <[400 x i8]*> [#uses=1] cstep: 1 %exitcond = icmp eq i32 %lsr.iv.next, 0 ; <i1> [#uses=1] cstep: 1 br i1 %exitcond, label %bb2, label %bb
Okay, I think I'm going to do a loop carried dependency instead:
array[0] = 1; for (i = 1; i < N; i++) { array[i] = array[i-1]+1; } return array[N-1];
300 cycles. With strength reduction 200 cycles. Made array volatile. Now up to 500 cycles. (volatile makes no difference on the previous loop - still 200 cycles w/strength red). New loop body:
bb: %indvar = phi i32 [ 0, %bb.nph ], [ %tmp, %bb ] ; <i32> [#uses=2] %tmp = add i32 %indvar, 1 ; <i32> [#uses=3] %scevgep = getelementptr [100 x i32]* %0, i32 0, i32 %tmp ; <i32*> [#uses=1] %scevgep5 = getelementptr [100 x i32]* %0, i32 0, i32 %indvar ; <i32*> [#uses=1] %1 = volatile load i32* %scevgep5, align 4 ; <i32> [#uses=1] %2 = add nsw i32 %1, 1 ; <i32> [#uses=1] volatile store i32 %2, i32* %scevgep, align 4 %exitcond = icmp eq i32 %tmp, 99 ; <i1> [#uses=1] br i1 %exitcond, label %bb2, label %bb
Control steps:
cstep: 0 %indvar = phi i32 [ 0, %bb.nph ], [ %tmp, %bb ] ; <i32> [#uses=2] cstep: 0 %tmp = add i32 %indvar, 1 ; <i32> [#uses=3] cstep: 1 %scevgep = getelementptr [100 x i32]* %0, i32 0, i32 %tmp ; <i32*> [#uses=1] cstep: 0 %scevgep5 = getelementptr [100 x i32]* %0, i32 0, i32 %indvar ; <i32*> [#uses=1] depend (state: 0): %scevgep5 = getelementptr [100 x i32]* %0, i32 0, i32 %indvar ; <i32*> [#uses=1] cstep: 1 %1 = volatile load i32* %scevgep5, align 4 ; <i32> [#uses=1] cstep: 3 %2 = add nsw i32 %1, 1 ; <i32> [#uses=1] depend (state: 3): %2 = add nsw i32 %1, 1 ; <i32> [#uses=1] cstep: 4 volatile store i32 %2, i32* %scevgep, align 4 cstep: 1 %exitcond = icmp eq i32 %tmp, 99 ; <i1> [#uses=1] cstep: 1 br i1 %exitcond, label %bb2, label %bb
Trying this code:
volatile int array_val; volatile int *array = &array_val; *array = 1; for (i = 1; i < N; i++) { *array = *array+1; //array[i] = 1; } return *array;
Also takes 500 cycles.
bb: ; preds = %bb, %bb.nph %indvar = phi i32 [ 0, %bb.nph ], [ %indvar.next, %bb ] ; <i32> [#uses=1] %1 = phi i32 [ %0, %bb.nph ], [ %3, %bb ] ; <i32> [#uses=1] %2 = add nsw i32 %1, 1 ; <i32> [#uses=1] volatile store i32 %2, i32* %array_val, align 4 %3 = volatile load i32* %array_val, align 4 ; <i32> [#uses=2] %indvar.next = add i32 %indvar, 1 ; <i32> [#uses=2] %exitcond = icmp eq i32 %indvar.next, 99 ; <i1> [#uses=1] br i1 %exitcond, label %bb2, label %bb
Control steps:
cstep: 0 %indvar = phi i32 [ 0, %bb.nph ], [ %indvar.next, %bb ] ; <i32> [#uses=1] cstep: 0 %1 = phi i32 [ %0, %bb.nph ], [ %3, %bb ] ; <i32> [#uses=1] cstep: 0 %2 = add nsw i32 %1, 1 ; <i32> [#uses=1] depend (state: 0): %2 = add nsw i32 %1, 1 ; <i32> [#uses=1] cstep: 1 volatile store i32 %2, i32* %array_val, align 4 depend (state: 1): volatile store i32 %2, i32* %array_val, align 4 cstep: 2 %3 = volatile load i32* %array_val, align 4 ; <i32> [#uses=2] cstep: 0 %indvar.next = add i32 %indvar, 1 ; <i32> [#uses=2] cstep: 1 %exitcond = icmp eq i32 %indvar.next, 99 ; <i1> [#uses=1] cstep: 1 br i1 %exitcond, label %bb2, label %bb
Note how the load is immediately performed after the store. We can do this because the memory controller is shared and we access it sequentially. I don't see how the control steps map to the number of cycles. Here the max cstep is 2. Where the previous example had a max csetp of 4. Yet they both seem to take 5 steps to iterate over the loop. Strength reduction does nothing here.
The actual states are:
state: 2 %indvar = phi i32 [ 0, %bb.nph ], [ %indvar.next, %bb ] ; <i32> [#uses=1] state: 2 %1 = phi i32 [ %0, %bb.nph ], [ %3, %bb ] ; <i32> [#uses=1] state: 2 %2 = add nsw i32 %1, 1 ; <i32> [#uses=1] state: 2 %indvar.next = add i32 %indvar, 1 ; <i32> [#uses=2] state: 7 volatile store i32 %2, i32* %array_val, align 4 state: 7 %exitcond = icmp eq i32 %indvar.next, 99 ; <i1> [#uses=1] state: 7 br i1 %exitcond, label %bb2, label %bb state: 8 %3 = volatile load i32* %array_val, align 4 ; <i32> [#uses=2]
Note the states 2, 7, 8 are sequential. The branch actually happens at state 10. So 2 cycles after the load. So there is some code to handle the memory instruction latency for branches. Which makes 3 + 2 = 5 cycles in the inner loop. As we see. Honestly, I need to make this a lot easier to visualize! The control steps are really only local (within a BB). I think it's actually done in SchedulerMapping::createFSM(). There is a few lines of code to expand the number of states in a basic block to ensure a function finishes:
// need to ensure multi-cycle instructions finish in the basic block unsigned delayState = SchedulerPass::getNumInstructionCycles(I); ...
Nope. Not here.
— Andrew Canis 2011/04/18 15:05
Fixed the problem with my samba printing. Was this bug in ubuntu: http://brainextender.blogspot.com/2009/01/ubuntu-intrepid-too-many-failed.html
— Andrew Canis 2011/04/17 15:05
Added analytics code to blog and wiki (wiki/lib/tpl/default/main.php)
— Andrew Canis 2011/04/15 15:05
Conditional gdb breakpoint:
b translate.cc:244 if !a
For the code:
Breakpoint 3, translate_source (r=0x99dc080) at translate.cc:244 244 if (!a) simple_error("temporary register used before defined");
— Andrew Canis 2011/04/08 15:05
Added Geolocation info:
contab -e # Update geocity data on the 3rd of every month 0 0 3 * * /var/www/updategeocity.sh 2>&1 | mailx -s "update GeoCity" andrew.canis@utoronto.ca
Used the code from:
http://www.sequentiallogic.com/2009/05/29/maxmind-geolite-country-and-geolite-city-made-easy/
Sample code in geo.php
— Andrew Canis 2011/02/16 15:05
Working on a user guide: ~/grad/legup/notes/
— Andrew Canis 2011/02/04 15:05
Changed legup.org → legup.eecg.utoronto.ca
Even with new pipelined dividers aes is slow:
Clock Setup: 'clk' -------------------------------------------------------------------------------------- Clock Setup: 'clk' -------------------------------------------------------------------------------------- Path Number : 1 Slack : -109.304 ns Actual fmax (period) : 8.38 MHz ( period = 119.304 ns ) From : decrypt:decrypt_inst|KeySchedule:KeySchedule_inst|KeySchedule_bb17_indvar_reg[0] To : decrypt:decrypt_inst|KeySchedule:KeySchedule_inst|KeySchedule_bb17_var14_reg[29] From Clock : clk To Clock : clk Required Setup Relationship : 10.000 ns Required Longest P2P Time : 9.809 ns Actual Longest P2P Time : 119.113 ns
What is this path?
acanis@acanis-desktop:~/work/legup/tiger/hybrid/aes$ grep KeySchedule_bb17_indvar_reg aes.v reg [31:0] KeySchedule_bb17_indvar_reg; KeySchedule_bb17_indvar_reg <= KeySchedule_bb17_indvar; KeySchedule_bb29_scevgep55 <= `TAG_g_word_a + 4 * KeySchedule_bb17_indvar_reg; KeySchedule_bb29_scevgep55_1 <= `TAG_g_word_a + 4 * KeySchedule_bb17_indvar_reg + 480; KeySchedule_bb29_scevgep55_2 <= `TAG_g_word_a + 4 * KeySchedule_bb17_indvar_reg + 960; KeySchedule_bb29_scevgep55_3 <= `TAG_g_word_a + 4 * KeySchedule_bb17_indvar_reg + 1440; KeySchedule_bb29_indvar_next <= KeySchedule_bb17_indvar_reg + 32'd1; acanis@acanis-desktop:~/work/legup/tiger/hybrid/aes$ grep KeySchedule_bb17_var14_reg aes.v reg KeySchedule_bb17_var14_reg; KeySchedule_bb17_var14_reg <= KeySchedule_bb17_var14; if (KeySchedule_bb17_var14_reg) begin
Here's the From gate:
always @(posedge clk) begin /* KeySchedule: bb17*/ if (cur_state == 14) begin /* %indvar = phi i32 [ 0, %bb.nph45 ], [ %indvar.next, %bb29 ] ; <i32> [#uses=8]*/ KeySchedule_bb17_indvar_reg <= KeySchedule_bb17_indvar; end end
Here's the To gate:
always @(posedge clk) begin /* KeySchedule: bb17*/ if (cur_state == 37) begin /* %13 = icmp eq i32 %12, 0 ; <i1> [#uses=1]*/ KeySchedule_bb17_var14_reg <= KeySchedule_bb17_var14; end always @(*) begin KeySchedule_bb17_var14 <= 0; /* KeySchedule: bb17*/ if (cur_state == 37) begin /* %13 = icmp eq i32 %12, 0 ; <i1> [#uses=1]*/ KeySchedule_bb17_var14 <= KeySchedule_bb17_var13_reg == 32'd0; end always @(posedge clk) begin /* KeySchedule: bb17*/ if (cur_state == 36) begin /* %12 = srem i32 %j.137, %nk.095 ; <i32> [#uses=2]*/ KeySchedule_bb17_var13_reg <= KeySchedule_bb17_var13; end
I bet it's something to do with the 'srem'. I should pipeline that…
Why is adpcm so slow?
Info: Slack time is -20.841 ns for clock "clk" between source memory "main:main_inst|altsyncram:Mux3_rtl_0|altsyncram_3lu:auto_generated|ram_block1a0~porta_address_reg8" and destination register "main:main_inst|main_quantl_exit_i_n__i_i_reg[31]" Info: Fmax is 45.79 MHz (period= 21.841 ns)
Here's what happens in state 179!
always @(*) begin if (cur_state == 179) begin /* %124 = load i32* getelementptr inbounds ([6 x i32]* @delay_bph, i32 0, i32 5), align 4 ; <i32> [#uses=1]*/ main_quantl_exit_i_var141 <= memory_controller_out; /* %126 = mul nsw i32 %124, %123 ; <i32> [#uses=1]*/ main_quantl_exit_i_var144 <= main_quantl_exit_i_var141 * main_quantl_exit_i_var140_reg; /* %132 = ashr i32 %131, 14 ; <i32> [#uses=2]*/ main_quantl_exit_i_var145 <= main_quantl_exit_i_var143_reg + main_quantl_exit_i_var144; /* %136 = add nsw i32 %135, %132 ; <i32> [#uses=2]*/ main_quantl_exit_i_var146 <= $signed(main_quantl_exit_i_var145) >>> 32'd14 % 32; /* %131 = add nsw i32 %130, %126 ; <i32> [#uses=1]*/ main_quantl_exit_i_var147 <= main_quantl_exit_i_var79_reg + main_quantl_exit_i_var146; /* %137 = sub nsw i32 %30, %136 ; <i32> [#uses=3]*/ main_quantl_exit_i_var148 <= main_bb5_i_var30_reg - main_quantl_exit_i_var147; /* %138 = icmp sgt i32 %137, -1 ; <i1> [#uses=2]*/ main_quantl_exit_i_var149 <= $signed(main_quantl_exit_i_var148) > $signed(-32'd1); /* %n..i.i = select i1 %138, i32 %137, i32 %141 ; <i32> [#uses=1]*/ main_quantl_exit_i_n__i_i <= main_quantl_exit_i_var149 ? main_quantl_exit_i_var148 : main_quantl_exit_i_var150; /* %n..i.i = select i1 %138, i32 %137, i32 %141 ; <i32> [#uses=1]*/ main_quantl_exit_i_n__i_i_reg <= main_quantl_exit_i_n__i_i;
I don't know how this was all put into one state. There must be a problem with the estimations.
Why are multipliers so high for adpcm? Because I stopped sharing 32-bit multipliers. Add that back in. Well, I can only add this back in if I pipeline the multipliers. Otherwise I get a huge hit in fmax, especially with adpcm
Am I sharing the dividers properly?
ac215364@1637b:~/Downloads >grep "lpm_divide " --before-context=1 aes_main.v |grep % /* %12 = srem i32 %j.137, %nk.095 ; <i32> [#uses=2]*/ /* %19 = sdiv i32 %j.137, %nk.095 ; <i32> [#uses=1]*/ /* %15 = sdiv i32 %14, 16 ; <i32> [#uses=1]*/ /* %16 = srem i32 %14, 16 ; <i32> [#uses=1]*/ /* %25 = sdiv i32 %24, 16 ; <i32> [#uses=1]*/ /* %26 = srem i32 %24, 16 ; <i32> [#uses=1]*/ /* %30 = sdiv i32 %29, 16 ; <i32> [#uses=1]*/ /* %31 = srem i32 %29, 16 ; <i32> [#uses=1]*/ /* %35 = sdiv i32 %34, 16 ; <i32> [#uses=1]*/ /* %36 = srem i32 %34, 16 ; <i32> [#uses=1]*/ /* %48 = sdiv i32 %46, 16 ; <i32> [#uses=1]*/ /* %49 = srem i32 %46, 16 ; <i32> [#uses=1]*/ /* %52 = sdiv i32 %45, 16 ; <i32> [#uses=1]*/ /* %53 = srem i32 %45, 16 ; <i32> [#uses=1]*/ /* %56 = sdiv i32 %44, 16 ; <i32> [#uses=1]*/ /* %57 = srem i32 %44, 16 ; <i32> [#uses=1]*/ /* %60 = sdiv i32 %43, 16 ; <i32> [#uses=1]*/ /* %61 = srem i32 %43, 16 ; <i32> [#uses=1]*/ /* %27 = srem i32 %tmp84, 4 ; <i32> [#uses=1]*/ /* %42 = srem i32 %tmp85, 4 ; <i32> [#uses=1]*/ /* %57 = srem i32 %tmp86, 4 ; <i32> [#uses=1]*/ /* %36 = srem i32 %type, 1000 ; <i32> [#uses=2]*/ /* %38 = sdiv i32 %36, 8 ; <i32> [#uses=2]*/
— Andrew Canis 2010/09/20 8:00
So blowfish hasn't compiled properly with quartus since Sept 2:
acanis@navy:/autofs/seth.eecg/brown.r/r0/acanis/buildbot/linux_x86_64/build/examples/chstone/blowfish$ ll --sort=time -rw-r--r-- 1 acanis browngrp 574K Sep 16 11:52 bf.v ... -rw-rw-r-- 1 acanis browngrp 25 Sep 2 15:26 top.done
Getting a strange error:
Info: Found 1 design units, including 1 entities, in source file db/altsyncram_1b13.tdf Info: Found entity 1: altsyncram_1b13 Info: Found 1 design units, including 1 entities, in source file db/mux_ujb.tdf Info: Found entity 1: mux_ujb Error: Current module quartus_map ended unexpectedly Error: Flow compile (for project /autofs/seth.eecg/brown.r/r0/acanis/buildbot/linux_x86_64/build/examples/chstone/blowfish/top) was not successful Error: ERROR: Error(s) found while running an executable. See report file(s) for error message(s). Message log indicates which executable was run last. make[2]: *** [f] Error 3 make[2]: Leaving directory `/autofs/seth.eecg/brown.r/r0/acanis/buildbot/linux_x86_64/build/examples/chstone/blowfish'
Trying to just manually build blowfish:
quartus_sh --flow compile top
Trying to turn off parallel synthesis to see if that helps. Nope.
Trying:
make cleanall make quartus_sh --flow compile top
— Andrew Canis 2010/09/16 8:00
Just had a conflict with a locally modified file:
git reset --hard git pull again
Then I got 'xxxx would be overwritten by merge.' To fix: DON'T DO THIS - I just lost my last two commits!
git fetch git reset --hard origin/master
Debugging issue with gsm make generate-wrapper:
../../../llvm/Debug/bin/opt -legup-config=config.tcl -load=../../../llvm/Debug/lib//LLVMLegUp.so -legup-hw-only < gsm.prelto.bc > gsm.prelto.hw.bc Invalid user of intrinsic instruction! i8* bitcast (void (i8*, i8, i64, i32)* @llvm.memset.i64 to i8*) Broken module found, compilation aborted!
Verifier.cpp fails here:
680 // If this function is actually an intrinsic, verify that it is only used in 681 // direct call/invokes, never having its "address taken". 682 if (F.getIntrinsicID()) { 683 for (Value::use_iterator UI = F.use_begin(), E = F.use_end(); UI != E;++UI){ 684 User *U = cast<User>(UI); (gdb) l 685 if ((isa<CallInst>(U) || isa<InvokeInst>(U)) && UI.getOperandNo() == 0) 686 continue; // Direct calls/invokes are ok. 687 688 Assert1(0, "Invalid user of intrinsic instruction!", U); 689 } 690 }
So the new code is trying to get the address of an intrinsic? How do I print out the bitcode before this failure? Okay need to run with -disable-verify flag:
../../../llvm/Debug/bin/opt -disable-verify -legup-config=config.tcl -load=../../../llvm/Debug/lib//LLVMLegUp.so -legup-hw-only < gsm.prelto.bc > gsm.prelto.hw.bc
That's weird, the error is coming from:
@llvm.used = appending global [5 x i8*] [i8* bitcast (void (i8*, i8, i64, i32)* @llvm.memset.i64 to i8*), i8* bitcast (void (i16*, i32*)* @Autocorrelation to i8*), i8* bitcast (void (i32*, i16*)* @Reflection_coefficients to i8*), i8* bit
What is @llvm.used?
If a global variable appears in the @llvm.used list, then the compiler, assembler, and linker are required to treat the symbol as if there is a reference to the global that it cannot see.
Okay I've just removed the code that produces this llvm.used variable. The only difference I see is that the functions are now all marked as internal. Not sure if this will matter.
— Andrew Canis 2010/09/14 8:00
Binding: sharing dividers has disappointing results. Saves ~1500 LEs on dfsin and dfdiv - so the geomeon only drops ~1%.
Interesting synthesis options: settings→synthesis→more settings. Option to turn small rams into logic:
set_global_assignment -name AUTO_RAM_TO_LCELL_CONVERSION ON
One thing we are missing. We can't transform a simple branch to a mux. For instance:
if (a == 3) b = z; else b = y;
This if statement will need multiple states. State 1: a == 3, state 2: b = z, state 3: b = y. Really we should be able to feed a == 3 into a mux.
— Andrew Canis 2010/09/02 8:00
Hybrid jpeg simulation took 34 hours. Rest of the tests took about an hour. I'm just going to disable jpeg for now.
— Andrew Canis 2010/09/01 8:00
Made changes to buildbot for Tiger perf and generated new folders:
buildmaster@acanis-desktop:~/buildbot/public_html/perf$ git diff diff --git a/buildbot/public_html/perf/generate_perf.py b/buildbot/p index 10eaaac..66cd8cc 100644 --- a/buildbot/public_html/perf/generate_perf.py +++ b/buildbot/public_html/perf/generate_perf.py @@ -140,6 +140,8 @@ PerfTester_list = [ PerfTester('linux_x86', 'Linux Perf'), PerfTester('linux_x86_64', 'Linux 64 Perf'), PerfTester('perf_test', 'Test Perf'), + PerfTester('linux_x86_tiger', 'Linux Perf'), + PerfTester('linux_x86_64_tiger', 'Linux 64 Perf'), # PerfTester('xp-release-dual-core', 'XP Perf'), # PerfTester('xp-release-single-core', 'XP Perf (single)'), # PerfTester('vista-release-dual-core', 'Vista Perf'), buildmaster@acanis-desktop:~/buildbot/public_html/perf$ python generate_perf.py
Tiger simulation flow:
cd examples/sra # compile for mips. Convert .elf to sdram.dat make tiger # $(PROC_DIR) = tiger/hybrid/processor/tiger_cache_on_avalon/tiger_sim # copy sdram.dat into $(PROC_DIR) # run vsim: cd $(PROC_DIR) && vsim -c -do "../run_sim.tcl" make tigersim
Interesting. It's not possible to screw up the history with git –amend. git won't let you push:
There are still modelsim warnings in Tiger. One is an incompatible clock port in the lpm_divide/lpm_mult modules. I don't understand why. One thing I noticed was removing the vsim flag: +acc=rn (display all registers and nets) gets rid of the warning.
— Andrew Canis 2010/08/27 8:00
Interesting, shifters take a lot of area:
shift_ll_32 luts: 159 mux_2_32 luts: 32 shift_ll_64 luts: 410 mux_2_64 luts: 64 shift_rl_32 luts: 159 mux_2_32 luts: 32 shift_rl_64 luts: 410 mux_2_64 luts: 64 signed_comp_eq_mux_32 luts: 53 mux_2_32 luts: 32 signed_comp_eq_mux_64 luts: 106 mux_2_64 luts: 64 signed_multiply_64 luts: 169 mux_2_64 luts: 64 unsigned_divide_64 luts: 4285 mux_2_64 luts: 64
Old results for xpilot mips fast (recompiled for CycloneII):
xpilot_fast_mips/prj ------------ Fmax: 91.65 MHz Latency: 7347 cycles Latency: 80 us Verilog: 48 LOC Family : Cyclone II Device : EP2C15AF484C6 Timing Models : Final Total logic elements : 2,815 / 14,448 ( 19 % ) Total combinational functions : 2,620 / 14,448 ( 18 % ) Dedicated logic registers : 1,449 / 14,448 ( 10 % ) Total registers : 1449 Total pins : 213 / 315 ( 68 % ) Total virtual pins : 0 Total memory bits : 8,192 / 239,616 ( 3 % ) Embedded Multiplier 9-bit elements : 8 / 52 ( 15 % )
— Andrew Canis 2010/08/26 8:00
Literature review for phd transfer. Areas:
What is the “state of the art”? Lets start with the easy one: High-level synthesis optimizations.
— Andrew Canis 2010/08/23 8:00
Got 2 cycle load/store working. So basically the struct memory controller didn't work if I put the memory_controller_out assignment in an always @(posedge clk), but it did work if I created a new memory_controller_out_reg signal. Strange.
Looking into fmaxes on latest run:
Info: Slack time is -13.703 ns for clock "clk" between source register "main:main_inst|main_Proc_3_exit_i_tmp18_reg[5]" and destination memory "memory_controller:memory_controller_inst|ram_one_port:Arr_2_Glob|altsyncram:altsyncram_component|altsyncram_hae1:auto_generated|ram_block1a22~porta_address_reg10" Info: Fmax is 68.01 MHz (period= 14.703 ns) Info: Slack time is -23.143 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:delay_dltx|altsyncram:altsyncram_component|altsyncram_p8d1:auto_generated|ram_block1a14~porta_address_reg2" and destination memory "memory_controller:memory_controller_inst|ram_one_port:accumd|altsyncram:altsyncram_component|altsyncram_9sc1:auto_generated|ram_block1a0~porta_datain_reg8" Info: Fmax is 41.42 MHz (period= 24.143 ns) Info: Slack time is -26.296 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:_str|altsyncram:altsyncram_component|altsyncram_glc1:auto_generated|ram_block1a0~porta_address_reg4" and destination memory "memory_controller:memory_controller_inst|ram_one_port:_str|altsyncram:altsyncram_component|altsyncram_glc1:auto_generated|ram_block1a0~porta_datain_reg6" Info: Fmax is 36.64 MHz (period= 27.296 ns) Info: Slack time is -14.011 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:out_key|altsyncram:altsyncram_component|altsyncram_44d1:auto_generated|ram_block1a2~porta_address_reg11" and destination register "main:main_inst|main_bb12_i_check_025_i_phi_temp[31]" Info: Fmax is 66.62 MHz (period= 15.011 ns) Info: Slack time is -12.449 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:b_input|altsyncram:altsyncram_component|altsyncram_s0d1:auto_generated|ram_block1a32~porta_address_reg5" and destination register "main:main_inst|main_float64_add_exit_var155_reg[31]" Info: Fmax is 74.35 MHz (period= 13.449 ns) Info: Slack time is -17.152 ns for clock "clk" between source memory "main:main_inst|altsyncram:Mux2_rtl_0|altsyncram_2lu:auto_generated|ram_block1a0~porta_address_reg8" and destination register "main:main_inst|main_bb4_i32_i_var113_reg[31]" Info: Fmax is 55.09 MHz (period= 18.152 ns) Info: Slack time is -11.244 ns for clock "clk" between source register "main:main_inst|cur_state.0011111" and destination register "main:main_inst|main_bb22_i_var102_reg[63]" Info: Fmax is 81.67 MHz (period= 12.244 ns) Info: Slack time is -16.636 ns for clock "clk" between source memory "main:main_inst|altsyncram:Selector1_rtl_0|altsyncram_nnu:auto_generated|ram_block1a0~porta_address_reg9" and destination register "main:main_inst|main_bb24_i_i_var193_reg[63]" Info: Fmax is 56.7 MHz (period= 17.636 ns) Info: Slack time is -17.503 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:main_bb_nph19_so|altsyncram:altsyncram_component|altsyncram_psa1:auto_generated|ram_block1a0~porta_address_reg7" and destination register "main:main_inst|main_bb_nph35_i_var106_reg[30]" Info: Fmax is 54.05 MHz (period= 18.503 ns) Info: Slack time is -28.608 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:hana|altsyncram:altsyncram_component|altsyncram_7pc1:auto_generated|ram_block1a6~porta_address_reg11" and destination memory "memory_controller:memory_controller_inst|ram_one_port:JpegFileBuf|altsyncram:altsyncram_component|altsyncram_igd1:auto_generated|ram_block1a7~porta_address_reg8" Info: Fmax is 33.77 MHz (period= 29.608 ns) nfo: Slack time is -15.996 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:_str|altsyncram:altsyncram_component|altsyncram_njc1:auto_generated|ram_block1a0~porta_address_reg0" and destination register "main:main_inst|main_bb45_Hi_0_phi_temp[30]" Info: Fmax is 58.84 MHz (period= 16.996 ns) Info: Slack time is -15.312 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:ld_Bfr|altsyncram:altsyncram_component|altsyncram_hqc1:auto_generated|ram_block1a0~porta_we_reg" and destination register "main:main_inst|main_bb18_var29_reg[31]" Info: Fmax is 61.3 MHz (period= 16.312 ns) Info: Slack time is -12.528 ns for clock "clk" between source register "main:main_inst|cur_state.0001010" and destination memory "memory_controller:memory_controller_inst|ram_one_port:indata|altsyncram:altsyncram_component|altsyncram_l1d1:auto_generated|ram_block1a16~porta_address_reg11" Info: Fmax is 73.92 MHz (period= 13.528 ns)
Getting the fmax of every accelerator to 100MHz. Pipeline depth of udiv's is now equal to bitwidth.
Looking at jpeg:
Info: Slack time is -20.027 ns for clock "clk" between source memory "main:main_inst|altsyncram:Selector1_rtl_0|altsyncram_onu:auto_generated|ram_block1a0~porta_address_reg9" and destination memory "memory_controller:memory_controller_inst|ram_one_port:DecodeInfo_comps_info_quant_tbl_no|altsyncram:altsyncram_component|altsyncram_vlf1:auto_generated|ram_block1a0~porta_we_reg" Info: Fmax is 47.56 MHz (period= 21.027 ns)
Where is the source memory? Can't really tell from the description. There shouldn't be any altsyncrams in main_inst… Dest ram:
reg [1:0] DecodeInfo_comps_info_quant_tbl_no_address; reg DecodeInfo_comps_info_quant_tbl_no_write_enable; reg [7:0] DecodeInfo_comps_info_quant_tbl_no_in; wire [7:0] DecodeInfo_comps_info_quant_tbl_no_out; /* @DecodeInfo_comps_info_quant_tbl_no = internal global [3 x i8] zeroinitializer ; <[3 x i8]*> [#uses=2] */ ram_one_port DecodeInfo_comps_info_quant_tbl_no ( .clk( clk ), .address( DecodeInfo_comps_info_quant_tbl_no_address ), .write_enable( DecodeInfo_comps_info_quant_tbl_no_write_enable ), .data( DecodeInfo_comps_info_quant_tbl_no_in ), .q( DecodeInfo_comps_info_quant_tbl_no_out ) ); defparam DecodeInfo_comps_info_quant_tbl_no.width_a = 8; defparam DecodeInfo_comps_info_quant_tbl_no.widthad_a = 2; defparam DecodeInfo_comps_info_quant_tbl_no.numwords_a = 3; defparam DecodeInfo_comps_info_quant_tbl_no.init_file = "DecodeInfo_comps_info_quant_tbl_no.mif";
Seems like a small altsyncram.
Actually many of the circuits seem to have the altsyncram on their crit path. I'm just going to register the mem controller and use the waitrequest. Fastest circuit is dfadd:
Info: Slack time is -8.681 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:float_exception_flags|altsyncram:altsyncram_component|altsyncram_oce1:auto_generated|ram_block1a0~porta_we_reg" and destination register "main:main_inst|var146_reg[31]" Info: Fmax is 103.3 MHz (period= 9.681 ns)
I don't understand why we_reg would ever be driving something. Isn't that an input to the altsyncram? Oh of course. The altsyncram doesn't have output flops, just input registers. So we_reg is the input flop for the write enable. So the critical path is through the entire altsyncram.
— Andrew Canis 2010/08/19 8:00
To pretty print cpp:
a2ps -o print.ps MetaScheduler.h MetaScheduler.cpp ConstraintScheduling.* Simple* LegUpSchedulerDAG.* Scheduler*
— Andrew Canis 2010/08/17 8:00
To convert jpeg to raw text RGB (.ppm):
acanis@acanis-desktop:~/work/legup/examples/chstone/jpeg$ convert -compress None pic.jpeg pic.ppm
Should be 150×113 = 16,950 pixels. Assuming RGB, there should be 50850 entries. However, printing from YuvToRgb I get 18,432 pixels.
Note: we are decoding a 4:1:1.
An ascii R G B .ppm file looks like (150×113):
P3 150 113 255 202 246 0 ....
Looking at the code:
/* Transform from Yuv into RGB */ for (i = 0; i < 4; i++) { YuvToRgb (rgb_buf[i], IDCTBuff[i], IDCTBuff[4], IDCTBuff[5]); } for (i = 0; i < RGB_NUM; i++) { p_out_vpos = OutData_comp_vpos[i]; p_out_hpos = OutData_comp_hpos[i]; for (j = 0; j < BMP_OUT_SIZE; j++) { p_out_buf[j] = OutData_comp_buf[i][j]; } Write4Blocks (rgb_buf[0][i], rgb_buf[1][i], rgb_buf[2][i], rgb_buf[3][i]); }
So we have four rgb_buf entries:
#define DCTSIZE2 64 #define RGB_NUM 3 int rgb_buf[4][RGB_NUM][DCTSIZE2];
I just cannot figure this code out. It makes no sense to me. I took a look at the original source from Stanford. They have made massive changes to the point where this is basically uncomparable to the original.
— Andrew Canis 2010/08/09 8:00
Buildbot mailing list isn't work. Forgot:
sudo newaliases
— Andrew Canis 2010/07/30 8:00
So the problem circuits are:
very high LEs - jpeg, dfsin high LEs - adpcm, dfdiv
Looking at Ahmed's resource estimates:
acanis@acanis-desktop:~/work/legup/examples/chstone$ grep -i div `find . -name resources.summary` ./dfdiv/resources.summary:Operation "unsigned_divide_64" x 2 ./dfdiv/resources.summary:Critical path contains operation: unsigned_divide_64 ./dfsin/resources.summary:Operation "unsigned_divide_64" x 2 acanis@acanis-desktop:~/work/legup/examples/chstone$ grep -i mult `find . -name resources.summary` ./jpeg/resources.summary:Operation "signed_multiply_32" x 11 ./jpeg/resources.summary:Operation "signed_multiply_nodsp_32" x 32 ./adpcm/resources.summary:Operation "signed_multiply_32" x 11 ./adpcm/resources.summary:Operation "signed_multiply_nodsp_32" x 78 ./dfmul/resources.summary:Operation "signed_multiply_64" x 3 ./dfmul/resources.summary:Operation "signed_multiply_nodsp_64" x 1 ./dfdiv/resources.summary:Operation "signed_multiply_64" x 3 ./dfdiv/resources.summary:Operation "signed_multiply_nodsp_64" x 8 ./motion/resources.summary:Operation "signed_multiply_32" x 2 ./sha/resources.summary:Operation "signed_multiply_32" x 1 ./sha/resources.summary:Critical path contains operation: signed_multiply_32 ./dfsin/resources.summary:Operation "signed_multiply_32" x 1 ./dfsin/resources.summary:Operation "signed_multiply_64" x 3 ./dfsin/resources.summary:Operation "signed_multiply_nodsp_64" x 12 ./mips/resources.summary:Operation "signed_multiply_64" x 2 ./mips/resources.summary:Critical path contains operation: signed_multiply_64 ./gsm/resources.summary:Operation "signed_multiply_32" x 11 ./gsm/resources.summary:Operation "signed_multiply_nodsp_32" x 46
Surprisingly gsm has a ton of multipliers: 57. But they are only 32 bit. jpeg has only 43 32bit multipliers, but probably has more logic in the circuit. gsm is still fairly big, it's right after dfdiv in terms of LEs. So we can save up to:
jpeg: 42 32-bit mults (with a 43-1 mux) dfsin: 14 64-bit mults (with a 15-1 mux). 1 64-bit div. adpcm: 88 32-bit mults (with a 89-1 mux)
Why don't we try this incrementally. Just share half of the 'big' functional units (div/mul) with muxes for now.
reg you look in hwtest/CycloneII.tcl all the operation characteristics are given.
set_operation_attributes <name> <LUTs> <Registers> <LogicElements> <ALUTs> <DSPElements> set_operation_attributes signed_multiply_32 29 96 111 0 6 set_operation_attributes signed_multiply_64 169 192 315 0 20 set_operation_attributes signed_multiply_nodsp_32 694 96 758 0 0 set_operation_attributes signed_multiply_nodsp_64 2748 192 2876 0 0 set_operation_attributes signed_divide_32 1214 96 1278 0 0 set_operation_attributes signed_divide_64 4509 192 4637 0 0 set_operation_attributes signed_modulus_32 1277 96 1341 0 0 set_operation_attributes signed_modulus_64 4604 192 4732 0 0
So cutting down these operations will save a ton of DSPs. When there are no DSPs the LUT count just explodes. So why don't we share after we max out DSPs?
Just looking at nodsps:
./jpeg/resources.summary:Operation "signed_multiply_nodsp_32" x 32 ./adpcm/resources.summary:Operation "signed_multiply_nodsp_32" x 78 ./dfmul/resources.summary:Operation "signed_multiply_nodsp_64" x 1 ./dfdiv/resources.summary:Operation "signed_multiply_nodsp_64" x 8 ./dfsin/resources.summary:Operation "signed_multiply_nodsp_64" x 12 ./gsm/resources.summary:Operation "signed_multiply_nodsp_32" x 46
We can save a ton here. How much do muxes cost?
set_operation_attributes <name> <LUTs> <Registers> <LogicElements> <ALUTs> <DSPElements> set_operation_attributes mux_2_32 32 97 97 0 0 set_operation_attributes mux_2_64 64 193 193 0 0 set_operation_attributes mux_4_32 64 162 194 0 0 set_operation_attributes mux_4_64 128 322 386 0 0 set_operation_attributes mux_8_32 160 291 419 0 0
Pretty cheap. For 2-1 muxes LUTs = bitwidth. Probably best to stick to 2-1 muxes for now. Though they appear to increase linearly for 4-1. 8-1 looks higher.
Chrome builbot changed:
acanis@acanis-desktop:~/buildbot/chrome_buildbot$ svn up U master.chromium.memory/master.cfg U scripts/common/chromium_utils.py U scripts/slave/chromium/sizes.py A scripts/slave/gsutil U scripts/slave/zip_build.py D scripts/master/unittests/master_utils_test.py A scripts/master/unittests/chromium_commands_test.py U scripts/master/unittests/runtests.py U scripts/master/factory/nacl_commands.py U scripts/master/factory/chromeos_commands.py U scripts/master/factory/commands.py U scripts/master/factory/chromium_commands.py U scripts/master/factory/nacl_factory.py U scripts/master/factory/chromeos_factory.py U scripts/master/factory/chromium_factory.py U scripts/master/factory/gclient_factory.py U scripts/master/chromium_status.py U master.nacl.sdk/public_html/announce.html U master.nacl.sdk/master.cfg U perf/dashboard/sizes.html U perf/dashboard/ui/js/plotter.js U perf/generate_perf.py U master.naclports/public_html/announce.html U master.naclports/master.cfg U pylibs/buildbot/README.chromium U pylibs/buildbot/status/web/console.py U master.chromium.fyi/master.cfg U master.chromium.fyi/slaves.cfg U master.nacl/public_html/announce.html U master.nacl/master.cfg U master.nacl/slaves.cfg U master.chromeos/public_html/announce.html U master.chromeos/master.cfg U master.chromeos/slaves.cfg U slave/run_slave.py U master.chromium/master.cfg Updated to revision 54543.
Nice. So I merged in plotter.js and we got slightly nicer graphs (can now click a variable to highlight it)
Setup tiling window manager in gnome:
To setup run:
gconftool-2 -s /desktop/gnome/session/required_components/windowmanager xmonad --type string
Create the file:
$ cat /usr/share/applications/xmonad.desktop [Desktop Entry] Type=Application Encoding=UTF-8 Name=Xmonad Exec=xmonad NoDisplay=true X-GNOME-WMName=Xmonad X-GNOME-Autostart-Phase=WindowManager X-GNOME-Provides=windowmanager X-GNOME-Autostart-Notify=false
Then create ~/.xmonad/xmonad.hs with:
import XMonad import XMonad.Config.Gnome main = xmonad gnomeConfig
— Andrew Canis 2010/07/30 8:00
So it's sort of working. The graph doesn't work because git doesn't have numerical revisions. The important files;
buildmaster@acanis-desktop:~$ ll ./buildbot/public_html/perf/linux-release-hardy/moz/ -rwxr-xr-x 1 buildmaster buildmaster 60 2010-08-01 13:33 graphs.dat -rw-r--r-- 1 buildmaster buildmaster 183 2010-08-01 13:33 total_byte_b-summary.dat
Inside:
buildmaster@acanis-desktop:~$ cat ./buildbot/public_html/perf/linux-release-hardy/moz/graphs.dat [{"units": "kb", "important": true, "name": "total_byte_b"}] buildmaster@acanis-desktop:~$ cat ./buildbot/public_html/perf/linux-release-hardy/moz/total_byte_b-summary.dat {"traces": {"IO_b": ["5000.0", "0.0"]}, "rev": "a0d345abf9adf82074f0ad38ab6910b128c1147d"} {"traces": {"IO_b": ["43457.0", "0.0"]}, "rev": "a0d345abf9adf82074f0ad38ab6910b128c1147d"}
Had to modify generic plotter to keep a map of # to gitrevision.
I think I know what the issue is. Basically the chromium step has init 2 arguments: self, log_processor. But buildbot calls the factory with only 1:
step = factory(**args)
okay, so I had to actually modify my factory.py in ~/buildbot/buildbot-0.8.1 and reinstall buildbot. Remember you have to restart buildbot if you make changes to the python classes.
diff --git a/buildbot-0.8.1/buildbot/process/factory.py b/buildbot-0.8.1/buildbot/process/factory.py index 384feb2..e981239 100644 --- a/buildbot-0.8.1/buildbot/process/factory.py +++ b/buildbot-0.8.1/buildbot/process/factory.py @@ -60,11 +60,14 @@ class BuildFactory(util.ComparableMixin): if kwargs: raise ArgumentsInTheWrongPlace() s = step_or_factory.getStepFactory() - elif type(step_or_factory) == type(BuildStep) and \ - issubclass(step_or_factory, BuildStep): - s = (step_or_factory, dict(kwargs)) + #elif type(step_or_factory) == type(BuildStep) and \ + # issubclass(step_or_factory, BuildStep): + # s = (step_or_factory, dict(kwargs)) + #else: + # raise ValueError('%r is not a BuildStep nor BuildStep subclass' % step_or_factory) + # Fix needed for chrome perf else: - raise ValueError('%r is not a BuildStep nor BuildStep subclass' % step_or_factory) + s = (step_or_factory, dict(kwargs)) self.steps.append(s) def addSteps(self, steps):
I've disabled perf expectations in master.cfg:
'expectations': False,
Remember RESULTS must have a space after the equals!
Useful vim command: gF (goto line of file)
— Andrew Canis 2010/07/30 8:00
Chrome buildbot templates:
acanis@acanis-desktop:~/buildbot/chrome_buildbot$ gvim pylibs/buildbot/status/web/index.html
Console doesn't work because:
The console view is still in development. At this moment it supports only the source control managers that have an integer based revision id, like svn.
Put chrome perf in buildbot/public_html. Ran:
python generate_perf.py
Generated a bunch of directories I think make_expections analyzes old data to calculate the max delta/variance to generate an 'perf_expectations.json' file, see:
In scripts/master/factory/chromium_commands.py:
def AddUploadPerfExpectations(self, factory_properties=None): """Adds a step to the factory to upload perf_expectations.json to the master. """ perf_id = factory_properties.get('perf_id') if not perf_id: logging.error("Error: cannot upload perf expectations: perf_id is unset") return slavesrc = "src/tools/perf_expectations/perf_expectations.json" masterdest = ("../scripts/master/log_parser/perf_expectations/%s.json" % perf_id)
So that's what the 'perf_id' property is used for.
In log_parser/process_log.py there is: class PerformanceLogProcessor(object)
Seems to be a 'graphs.dat' file that I'm missing that is used in process_log.py
So in master/factory/chromium_factory.py we have:
if R('page_cycler'): f.AddPageCyclerTests(fp)
Then in master/factory/chromium_commands.py we have:
def AddPageCyclerTests(self, factory_properties=None): """Adds a step to the factory to run the page-cycler tests.""" tests = [ {'name': 'moz'}, {'name': 'morejs', 'http': False}, {'name': 'intl1', 'http': False, 'target': 'Release'}, {'name': 'intl2', 'http': False, 'target': 'Release'}, {'name': 'bloat', 'http': True, 'target': 'Release'}, {'name': 'dhtml', 'http': False, 'target': 'Release'}, {'name': 'database', 'http': False, 'title': 'Database*'}, ] for test in tests: # Set the different names for this test. test['command_name'] = test['name'].capitalize() test['perf_name'] = test['name'] test['step_name'] = 'page_cycler_%s' % test['perf_name'] # Derive the class from the factory, name, and log processor. test['class'] = self.GetPerfStepClass( factory_properties, test['perf_name'], process_log.GraphingPageCyclerLogProcessor) # Get the test's command. cmd = self.GetPageCyclerCommand( test.get('title', test['command_name']), http_page_cyclers) # Add the test step to the factory. self.AddTestStep(test['class'], test['step_name'], cmd)
These names match the graphs on the perf dashboard. A couple are missing - bloat and database.
In master/factory/commands.py:
def GetPerfStepClass(self, factory_properties, test_name, log_processor_class, **kwargs): """Selects the right build step for the specified perf test.""" perf_id = factory_properties.get('perf_id') show_results = factory_properties.get('show_perf_results') if show_results and self._target in self.PERF_TEST_MAPPINGS: mapping = self.PERF_TEST_MAPPINGS[self._target] perf_name = mapping.get(perf_id) if not perf_name: raise Exception, ('There is no mapping for identifier %s in %s' % (perf_id, self._target)) report_link = '%s/%s/%s/%s' % (self.PERF_BASE_URL, perf_name, test_name, self.PERF_REPORT_URL_SUFFIX) output_dir = '%s/%s/%s' % (self.PERF_OUTPUT_DIR, perf_name, test_name) return self._CreatePerformanceStepClass(log_processor_class, report_link=report_link, output_dir=output_dir, factory_properties=factory_properties, perf_name=perf_name, test_name=test_name)
# -------------------------------------------------------------------------- # PERF TEST SETTINGS # In each mapping below, the first key is the target and the second is the # perf_id. The value is the directory name in the results URL. # Configuration of most tests. PERF_TEST_MAPPINGS = { 'Release': { 'chromium-linux-targets': 'linux-targets', 'chromium-rel-linux-hardy': 'linux-release-hardy', perf_base_url = 'http://build.chromium.org/buildbot/perf' perf_report_url_suffix = 'report.html?history=150' # Directory in which to save perf-test output data files. perf_output_dir = '~/www/perf'
Note in master.cfg we declare the perf_id:
f_cr_rel_linux_hardy_1 = F_LINUX('chromium-rel-linux-hardy', tests=['page_cycler', 'startup', 'page_cycler_http'], options=['startup_tests', 'page_cycler_tests'], factory_properties={ 'show_perf_results': True, 'expectations': True, 'perf_id': 'chromium-rel-linux-hardy'})
In master/factory/commands.py:
# Performance step utils. def _CreatePerformanceStepClass( self, log_processor_class, report_link=None, output_dir=None, factory_properties=None, perf_name=None, test_name=None, command_class=chromium_step.ProcessLogShellStep): """Returns ProcessLogShellStep class. Args: log_processor_class: class that will be used to process logs. Normally should be a subclass of process_log.PerformanceLogProcessor. report_link: URL that will be used as a link to results. If None, result won't be written into file. output_dir: directory where the log processor will write the results. command_class: command type to run for this step. Normally this will be chromium_step.ProcessLogShellStep. """ # We create a log-processor class using # chromium_utils.InitializePartiallyWithArguments, which uses function # currying to create classes that have preset constructor arguments. # This serves two purposes: # 1. Allows the step to instantiate its log processor without any # additional parameters; # 2. Creates a unique log processor class for each individual step, so # they can keep state that won't be shared between builds log_processor_class = chromium_utils.InitializePartiallyWithArguments( log_processor_class, report_link=report_link, output_dir=output_dir, factory_properties=factory_properties, perf_name=perf_name, test_name=test_name) # Similarly, we need to allow buildbot to create the step itself without # using additional parameters, so we create a step class that already # knows which log_processor to use. return chromium_utils.InitializePartiallyWithArguments(command_class, log_processor_class)
So basically just creates the class.
In scripts/master/log_parser/process_log.py:
class GraphingPageCyclerLogProcessor(GraphingLogProcessor): """Handles additional processing for page-cycler timing data.""" ... class GraphingLogProcessor(PerformanceLogProcessor): """Parent class for any log processor expecting standard data to be graphed. The log will be parsed looking for any lines of the form <*>RESULT <graph_name>: <trace_name>= <value> <units> or <*>RESULT <graph_name>: <trace_name>= [<value>,value,value,...] <units> or <*>RESULT <graph_name>: <trace_name>= {<mean>, <std deviation>} <units> For example, *RESULT vm_final_browser: OneTab= 8488 kb RESULT startup: reference= [167.00,148.00,146.00,142.00] msec The leading * is optional; if it's present, the data from that line will be included in the waterfall display. If multiple values are given in [ ], their mean and (sample) standard deviation will be written; if only one value is given, that will be written. A trailing comma is permitted in the list of values. Any of the <fields> except <value> may be empty, in which case not-terribly-useful defaults will be used. The <graph_name> and <trace_name> should not contain any spaces, colons (:) nor equals-signs (=). Furthermore, the <trace_name> will be used on the waterfall display, so it should be kept short. If the trace_name ends with '_ref', it will be interpreted as a reference value, and shown alongside the corresponding main value on the waterfall. """ RESULTS_REGEX = re.compile( r'(?P<IMPORTANT>\*)?RESULT ' '(?P<GRAPH>[^:]*): (?P<TRACE>[^=]*)= ' '(?P<VALUE>[\{\[]?[\d\., ]+[\}\]]?)( ?(?P<UNITS>.+))?')
So basically GraphingPageCyclerLogProcessor will be the command_class in:
def AddTestStep(self, command_class, test_name, test_command, test_description='', timeout=600, workdir=None, env=None, locks=None, halt_on_failure=False, do_step_if=True): """Adds a step to the factory to run a test. Args: command_class: the command type to run, such as shell.ShellCommand or gtest_command.GTestCommand test_name: a string describing the test, used to build its logfile name and its descriptions in the waterfall display timeout: the buildbot timeout for the test, in seconds. If it doesn't produce any output to stdout or stderr for this many seconds, buildbot will cancel it and call it a failure. test_command: the command list to run test_description: an auxiliary description to be appended to the test_name in the buildbot display; for example, ' (single process)'
GraphingPageCyclerLogProcessor's top parent is:
class PerformanceLogProcessor(object): """ Parent class for performance log parsers. """ def Process(self, revision, data): """Invoked by the step with data from log file once it completes. Each subclass needs to override this method to provide custom logic, which should include setting self._revision. Args: revision: changeset revision number that triggered the build. data: content of the log file that needs to be processed. Returns: A list of strings to be added to the waterfall display for this step. """ self._revision = revision return []
okay so I get this. Basically GraphingPageCyclerLogProcessor is ultimately a Buildbot Step (like ShellCommand). It processes the output according to the regular expressions given above. So why don't I start by reusing one of these page cycler tests, but just change the command and see what happens.
Looking at chromes build log for a perf test:
# upload [uploading perf_expectations.json] [0 seconds] # page_cycler_moz [page_cycler_moz PERF_IMPROVE: total_op_b/IO_op_b IO_b: 43.5k (42.8k) IO_b_extcs1: 43.3k IO_op_b: 48.4k (53.6k) IO_op_b_extcs1: 53.6k IO_op_r: 28.2k (27.4k) IO_op_r_extcs1: 28.3k IO_r: 7.84k (7.55k) IO_r_extcs1: 7.68k t: 1.1k (1.09k) t_extcs1: 1.1k vm_pk_b: 15.0M (13.9M) vm_pk_b_extcs1: 16.4M vm_pk_r: 68.8M (83.1M) vm_pk_r_extcs1: 69.0M vm_spk_r: 68.8M (83.1M) vm_spk_r_extcs1: 69.0M ws_pk_b: 31.1M (28.8M) ws_pk_b_extcs1: 32.1M ws_pk_r: 65.7M (79.5M) ws_pk_r_extcs1: 65.5M ws_spk_r: 65.7M (79.5M) ws_spk_r_extcs1: 65.5M ] [70 seconds] 1. stdio 2. results
Looking at the stdio:
python_slave ..\..\..\scripts\slave\runtest.py --target Release --build-dir src/build page_cycler_tests.exe --gtest_filter=PageCycler*.MozFile in dir C:\b\slave\chromium-rel-xp-perf-1\build (timeout 600 secs) C:\b\slave\chromium-rel-xp-perf-1\build\src\build\Release\page_cycler_tests.exe --gtest_filter=PageCycler*.MozFile Note: Google Test filter = PageCycler*.MozFile [==========] Running 3 tests from 3 test cases. [----------] Global test environment set-up. [----------] 1 test from PageCyclerTest [ RUN ] PageCyclerTest.MozFile *RESULT vm_peak_b: vm_pk_b= 14979072 bytes *RESULT ws_peak_b: ws_pk_b= 31059968 bytes ... *RESULT total_byte_b: IO_b= 43457 kb *RESULT total_op_b: IO_op_b= 48440 RESULT other_byte_b: o_b= 413 kb *RESULT total_byte_b: IO_b_extcs1= 43324 kb ... *RESULT times: t= [31,55,33,...] ms [ OK ] PageCyclerTest.MozFile (23093 ms) [----------] 1 test from PageCyclerTest (23093 ms total) [----------] 1 test from PageCyclerReferenceTest [ RUN ] PageCyclerReferenceTest.MozFile *RESULT vm_peak_b: vm_pk_b_ref= 13942784 bytes ... *RESULT total_byte_b: IO_b_ref= 42783 kb
Remember the format is:
<*>RESULT <graph_name>: <trace_name>= <value> <units>
So this matches with the above, with reference in brackets. 'PERF_IMPROVE: total_op_b/IO_op_b' means that this perf metric is better than the var/delta in the expectations file. What are these reference tests? Does it mean running the tests with a previous version of chrome? Yes, it looks like they have an older version of chrome called the 'reference':
class PageCyclerReferenceTest : public PageCyclerTest { public: // override the browser directory that is used by UITest::SetUp to cause it // to use the reference build instead. void SetUp() { FilePath dir; PathService::Get(chrome::DIR_TEST_TOOLS, &dir); dir = dir.AppendASCII("reference_build"); ... dir = dir.AppendASCII("chrome"); browser_directory_ = dir; PageCyclerTest::SetUp(); }
Yep, these are actually checked into the chrome svn:
src/chrome/tools/test/reference_build/chrome - Windows reference build for performance testing.
Also, the perf_expectations.json looks at the diff between current build and reference build to avoid glitches on the machine.
We don't need a reference because we will never see 'glitches', our perf data is deterministic.
graphs.dat comes from this function in GraphingLogProcessor:
def __SaveGraphInfo(self): """Keep a list of all graphs ever produced, for use by the plotter. Build a list of dictionaries: [{'name': graph_name, 'important': important, 'units': units}, ..., ] sorted by importance (important graphs first) and then graph_name. Save this list into the GRAPH_LIST file for use by the plotting page. (We can't just use a plain dictionary with the names as keys, because dictionaries are inherently unordered.) """
Okay. Lets try this out.
— Andrew Canis 2010/07/30 8:00
buildbot is leaving running vsim jobs if you do a 'force stop' or a build times out. I think the problem is that buildbot sends a SIGKILL instead of a SIGTERM.
Lets see what happens when I send a SIGKILL to runtest:
acanis@acanis-desktop:~$ pstree -p -a 16133 sh,16133 -c runtest\040../dejagnu/*.exp └─expect,16134 -- /usr/share/dejagnu/runtest.exp ../dejagnu/jpeg.exp ├─make,16185 v │ └─sh,16211 -c vsim\040-note\0402009\040-c\040-do\040"run\0407000000000000000ns;\040exit;"\040work.main_tb │ └─vish,16212 -- -vsim -note 2009 -c -do run\0407000000000000000ns;\040exit; work.main_tb │ ├─vlm,16220 652114806 814696814 │ │ └─mgls_asynch,16221 -f6,10 │ └─vsimk,16224 -port 40073 -stdoutfilename /tmp/VSOUTH6KLFK -note 2009 -c -do run\0407000000000000000ns;\040exit; work.main_tb └─{expect},16154
Ran:
acanis@acanis-desktop:~$ kill -9 16134
For some reason 'runtest' doesn't kill any of it's children:
acanis@acanis-desktop:~$ pstree -p -a 16185 make,16185 v └─sh,16211 -c vsim\040-note\0402009\040-c\040-do\040"run\0407000000000000000ns;\040exit;"\040work.main_tb └─vish,16212 -- -vsim -note 2009 -c -do run\0407000000000000000ns;\040exit; work.main_tb ├─vlm,16220 652114806 814696814 │ └─mgls_asynch,16221 -f6,10 └─vsimk,16224 -port 40073 -stdoutfilename /tmp/VSOUTH6KLFK -note 2009 -c -do run\0407000000000000000ns;\040exit; work.main_tb
so kill -9 orphans all children. I've posted a buildbot bug:
Buildbot is slowing down my machine. So I'm lowering the priority on all buildslaves processes:
sudo renice +20 --user buildslave
You can also permanently set the priority of buildslave to the lowest in /etc/security/limits.conf:
buildslave hard priority 19
See: http://tldp.org/HOWTO/Xterminals/advanced.html
Had to create a .profile to source .bashrc on navy.
Created buildslave:
buildslave create-slave -r --umask=022 buildbot legup.org:3462 navy password
Final .bashrc on navy:
# buildslave export PATH=~/buildbot-slave-0.8.1/bin:$PATH export PYTHONPATH=~/legup/python/lib/python/ export PATH=~/legup/bin/dejagnu:$PATH export PATH=~/legup/bin:$PATH export PATH=~zhangvi1/altera9.1/quartus/bin:$PATH export PATH=~zhangvi1/modeltech/bin:$PATH export MGLS_LICENSE_FILE=7326@picton.eecg.utoronto.ca export LM_LICENSE_FILE=1802@ra.eecg.toronto.edu export PATH=~/legup/llvm-gcc4.2-2.7-x86_64-linux/bin/:$PATH export MIPS_PREFIX=mipsel-elf-
Installing buildbot python module locally in ~/legup/python:
acanis@navy:~/buildbot-slave-0.8.1$ python setup.py install --home=~/legup/python/
buildbot git update wasn't working. Had to modify /usr/share/buildbot/contrib/git_buildbot.py:
master = "legup.org:3462"
Is that the best place to put that script? No, I've moved it inside the git repo folder.
— Andrew Canis 2010/07/28 8:00
Really cool top alternative: atop
Trying to figure out the chrome buildbot setup. So basically there are a bunch of builders: 'Chromium XP', 'Chromium Linux', 'XP Tests', 'XP Perf', 'Chromium Builder':
Looking at 'Chromium XP': unit test seems to save the results:
copying dashboard file gtest-results/gpu_unittests\results.json to \\chrome-web.jail.google.com\chrome-bot\www\gtest_results\chromium-rel-xp\gpu_unittests saving results to \\chrome-web.jail.google.com\chrome-bot\www\gtest_results\chromium-rel-xp\gpu_unittests
.json is a data format like XML.
— Andrew Canis 2010/07/23 8:00
Had to hack my git post-receive to get the code to work:
To run a full Quartus compile for all the benchmark circuits: #!/bin/sh data=$(cat) echo "$data" | python /usr/share/buildbot/contrib/git_buildbot.py $1 $2 $3 | exit echo "$data" | hooks/post-receive-email | exit
Because I want to generate a commit email and also notify the buildbot master.
Really cool command. Set all permissions to equal user permissions.
chmod -R a=u dir
Binding infrastructure done. Setting up buildbot. Not using the older ubuntu version. I'd rather install from scratch. Following installation guide on buildbot website. Needed to install python-dev
Adding a new user
sudo adduser --disabled-login --home /home/buildmaster buildmaster
To verify that there is no password look in /etc/shadow:
buildmaster:!:14812:0:99999:7:::
! indicates no password. * indicates account is locked.
Creating master directory:
buildmaster@acanis-desktop:/home/buildmaster/master$ buildbot create-master .
setting up buildslave
sudo adduser --disabled-login --home /home/buildslave buildslave
git_buildbot.py is missing from the distribution. Had to manually download it from: http://github.com/buildbot/buildbot/raw/master/master/contrib/git_buildbot.py
Need to properly setup environment for the buildmaster and buildslave:
sudo su buildslave # modify .bashrc
— Andrew Canis 2010/07/22 8:00
Getting an error with memset:
FAIL: gxemul simulation. Expected: reg: v0 = 0x0000000000000000
Seems related to:
acanis@acanis-desktop:~/work/legup/examples/memset$ diff ../lib/llvm/liblegup.a /home/acanis/work/fresh/legup/examples/lib/llvm/liblegup.a Binary files ../lib/llvm/liblegup.a and /home/acanis/work/fresh/legup/examples/lib/llvm/liblegup.a differ
What generates the liblegup.a file?
Oh it's probably because I haven't pulled in a while… I'll do that later.
Watch out! dhrystone 'make watch' shows differences due to pointers:
sim: %3=5a000000 lli: %3=bf9f5978
Debugging binding: I have a call that's not being taken in dfadd:
299: begin /* %356 = tail call fastcc i64 @roundAndPackFloat64(i32 %zSign_addr.0.i, i32 %355, i64 %353) nounwind ; <i64> [#uses=2]*/ /* normalizeRoundAndPackFloat64.exit.i*/ 304: begin if (roundAndPackFloat64_finish_reg) begin cur_state = 305; end
Seems to have skipped over the fct call. Maybe finish was left high? What's this:
if (cur_state == 302) begin roundAndPackFloat64_finish_reg = roundAndPackFloat64_finish;
Wrong state. Wait. there are two calls:
302: if (roundAndPackFloat64_finish) begin
Will have to special case finish to always be the wire.
Fresh repo available in ~/work/fresh
— Andrew Canis 2010/07/21 8:00
Trying to merge in changes. Problem with chaining (mips.v):
4: /* %pc.0 = phi i32 [ %pc.1, %bb45 ], [ 4194304, %bb7.preheader ] ; <i32> [#uses=6]*/ /* %9 = lshr i32 %pc.0, 2 ; <i32> [#uses=2]*/
Produces:
56: posedge clk: pc_0_phi_temp = 32'd4194304; /* for PHI node */ ... always @(posedge clk) begin if (cur_state == 4) begin var5 = pc_0 >>> 32'd2 % 32; end end always @(posedge clk) begin if (cur_state == 4) begin pc_0 = pc_0_phi_temp; end end
— Andrew Canis 2010/07/20 8:00
Eventually could add this to binding:
/* RTLBuilder *builder; RTLOp *finished = t->addCond("==", finish, RTLConst(1)); RTLOp *display = builder.create("display", "At t=%t clk=%b finish=%b return_val=%d", builder.create("$time"), clk, finish, return_val); finished.add(display); finished.add(builder.create("$finish")); // "initial \n" << // " clk = #0 0;\n" << RTLOp *initial = builder.create("initial"); initial.add(builder.create("=", clk, 0, "#0"); // "always @(clk)\n" << // " clk <= #1 ~clk;\n" << RTLOp *clk_neg = builder.create("~", clk); initial.add(builder.create("=", clk, clk_neg, "#1"); // "initial begin\n" << // "//$monitor(\"At t=%t clk=%b %b %b %b %d\", $time, clk, reset, start, finish, return_val);\n" << // "@(negedge clk);\n" << // "reset = 1;\n" << // "@(negedge clk);\n" << // "reset = 0;\n" << // "start = 1;\n" << // "\n" << // "end\n" << RTLOp *initial = builder.create("initial"); initial.add(builder.create("@(negedge)", clk)); initial.add(builder.create("=", reset, RTLConst(1))); initial.add(builder.create("@(negedge)", clk)); initial.add(builder.create("=", reset, RTLConst(0))); initial.add(builder.create("=", start, RTLConst(1))); */
— Andrew Canis 2010/07/19 8:00
Where do 'slots' (%2) get stored for instructions?
acanis@acanis-desktop:~/work/legup/examples/sra$ gdb llc Breakpoint 1 at 0x85ec45a: file Verilog.cpp, line 1933. (gdb) run -march=v sra.bc -o sra.v
llvm::operator<< (OS=@0x9173a40, V=@0xae73408) at /home/acanis/work/legup/llvm/include/llvm/Value.h:313 312 inline raw_ostream &operator<<(raw_ostream &OS, const Value &V) { 313 V.print(OS); 314 return OS; 315 } (gdb) s llvm::Value::print (this=0xae73408, ROS=@0x9173a40, AAW=0x0) at AsmWriter.cpp:2072
Okay found it:
llvm::Value::print (this=0xae73408, ROS=@0x9173a40, AAW=0x0) at AsmWriter.cpp:2072 2077 if (const Instruction *I = dyn_cast<Instruction>(this)) { 2078 const Function *F = I->getParent() ? I->getParent()->getParent() : 0; 2079 SlotTracker SlotTable(F); 2080 AssemblyWriter W(OS, SlotTable, getModuleFromVal(I), AAW); 2081 W.printInstruction(*I);
Unfortunately you can't move this code out of AsmWriter.cpp. SlotTracker is defined in the cpp file.
— Andrew Canis 2010/07/15 8:00
Debugging binding:
— Andrew Canis 2010/07/14 8:00
Modified system→pref→sound capture to usb device.
— Andrew Canis 2010/07/06 8:00
Install 'countperl' for cyclomatic complexity of perl programs:
sudo PERL_MM_USE_DEFAULT=1 perl -MCPAN -e 'install Perl::Metrics::Simple'
Had to run this command twice. The first time it failed for some reason.
Pidgin just crashed X11!
Jun 30 15:42:12 acanis-desktop kernel: [4059043.893782] pidgin[9256]: segfault at a98b000 ip b752c253 sp bf8004d0 error 6 in libX11.so.6.2.0[b74fd000+ea000]
— Andrew Canis 2010/06/30 8:00
Need to fix h/w partition for James
Can't assign signals in seperate always blocks:
Can't resolve multiple constant drivers for net...
— Andrew Canis 2010/06/22 8:00
Working on the scheduler. Try setting every non-memory instruction to have a latency of 0. Problem:
/* %12 = load i32* %11, align 4 ; <i32> [#uses=1]*/ /* %load_noop = add i32 %12, 0 ; <i32> [#uses=19]*/ load_noop = var8 + 32'd0;
Putting the load noop right after the load. that's wrong. Why do we have the load noop again? Okay, lets get rid of this load noop first. So my hypothesis is instructions that depend on the load don't look at the 'end' cycle, just the 'start'.
okay, fixed that. Now there is an error with the comb logic:
5: /* %11 = getelementptr inbounds [44 x i32]* @imem, i32 0, i32 %10 ; <i32*> [#uses=1]*/ var7 = {`TAG_imem, 32'b0} + ((var6 + 44 * (32'd0)) << 2); ... 5: memory_controller_address = var7;
var7 is a register which isn't updated until the next state…
— Andrew Canis 2010/06/21 8:00
I finally figured out the reason for the createXXXXXPass() function. Because there are no header files for any of the passes, you need this global function to create the object.
Posted some results on:
See files in:
work/legup/results.xls work/legup/plot.m
git fetch just updates the origin (doesn't change working directory):
acanis@acanis-desktop:~/work/legup$ git fetch acanis@acanis-desktop:~/work/legup$ git diff --summary master..origin create mode 100644 examples/chstone/Makefile create mode 100644 examples/phi/phi.c create mode 100644 llvm/lib/Transforms/LegUp/WatchVariables.cpp create mode 100644 tiger/linux_tools/Makefile delete mode 100755 tiger/linux_tools/elf2sdram delete mode 100755 tiger/linux_tools/find_ra create mode 100644 tiger/linux_tools/lib/prog_link_sim.ld delete mode 100644 tiger/processor/tiger_mips/sdram.dat delete mode 100644 tiger/processor/tiger_mips/tiger.html delete mode 100644 tiger/processor/tiger_mips/tiger_sim/cacheMem.ver delete mode 100644 tiger/processor/tiger_mips/tiger_sim/sdram.dat delete mode 100644 tiger/processor/tiger_mips/tiger_sim/uart_0_log_module.txt delete mode 100644 tiger/processor/tiger_mips/tiger_top_hw.tcl~ create mode 100644 tiger/tool_source/hack_jt.cpp create mode 100644 tiger/tool_source/lib/dev_cons.h create mode 100644 tiger/windows_tools/lib/prog_link_sim.ld
— Andrew Canis 2010/06/18 8:00
Found new command: llvm-extract –func <function_name>
jpeg$ llvm-extract -S --func DecodeHuffMCU main.ll
Does this command also keep functions called by the specified function? Trying with jpeg, DecodeHuffMCU calls DecodeHuffman and buf_getv. Nope:
declare fastcc i32 @buf_getv(i32) nounwind declare fastcc i32 @DecodeHuffman() nounwind
So I just need to build up a list of functions and their called functions.
— Andrew Canis 2010/06/15 8:00
There are some transformation passes dealing with extracting Functions in the IPO folder. There's a takename function!
Function *New = Function::Create(I->getFunctionType(), GlobalValue::ExternalLinkage); New->copyAttributesFrom(I); // If it's not the named function, delete the body of the function I->dropAllReferences(); M.getFunctionList().push_back(New); NewFunctions.push_back(New); New->takeName(I);
Interesting:
The members and base classes of a struct are public by default, while in class, they default to private. Note: you should make your base classes explicitly public, private, or protected, rather than relying on the defaults.
From: http://www.parashift.com/c++-faq-lite/classes-and-objects.html#faq-7.9
— Andrew Canis 2010/06/15 8:00
Confirmed again on llist, struct, dhrystone. Doesn't make any difference in area. Synthesis tools must be able to tell that byte enable means that those bytes don't matter. The code is uglier anyway.
I was right. The ram input doesn't depend on size. However, strangely enough this doesn't produce better synthesis results.
`define B0 8-1:0 `define B1 16-1:8 `define B2 24-1:16 `define B3 32-1:24 `define B4 40-1:32 `define B5 48-1:40 `define B6 56-1:48 `define B7 64-1:56 node1_in[`B0] = memory_controller_in[`B0]; case (memory_controller_address [0]) // short/int/long - addr: 000 0: node1_in[`B1] = memory_controller_in[`B1]; // byte - addr: 001 1: node1_in[`B1] = memory_controller_in[`B0]; endcase case (memory_controller_address [1]) // int/long - addr: 000 0: node1_in[`B2] = memory_controller_in[`B2]; // byte/short - addr: 010 1: node1_in[`B2] = memory_controller_in[`B0]; endcase case (memory_controller_address [1:0]) // int/long - addr: 000 0: node1_in[`B3] = memory_controller_in[`B3]; // short - addr: 010 2: node1_in[`B3] = memory_controller_in[`B1]; // byte - addr: 011 3: node1_in[`B3] = memory_controller_in[`B0]; default: node1_in[`B3] = 'bx; endcase case (memory_controller_address [2:1]) // long - addr: 000 0: node1_in[`B4] = memory_controller_in[`B4]; // short - addr: 011 1: node1_in[`B4] = memory_controller_in[`B1]; // byte/int - addr: 100 2: node1_in[`B4] = memory_controller_in[`B0]; default: node1_in[`B4] = 'bx; endcase case (memory_controller_address [2:0]) // long - addr: 000 0: node1_in[`B5] = memory_controller_in[`B5]; // short/int - addr: 100 4: node1_in[`B5] = memory_controller_in[`B1]; // byte - addr: 101 5: node1_in[`B5] = memory_controller_in[`B0]; default: node1_in[`B5] = 'bx; endcase case (memory_controller_address [2:1]) // long - addr: 000 0: node1_in[`B6] = memory_controller_in[`B6]; // int - addr: 100 2: node1_in[`B6] = memory_controller_in[`B2]; // byte/short - addr: 110 3: node1_in[`B6] = memory_controller_in[`B0]; default: node1_in[`B6] = 'bx; endcase case (memory_controller_address [2:0]) // long - addr: 000 0: node1_in[`B7] = memory_controller_in[`B7]; // int - addr: 100 4: node1_in[`B7] = memory_controller_in[`B3]; // short - addr: 110 6: node1_in[`B7] = memory_controller_in[`B1]; // byte - addr: 111 7: node1_in[`B7] = memory_controller_in[`B0]; default: node1_in[`B7] = 'bx; endcase
— Andrew Canis 2010/06/11 8:00
Found a great paper on Catapult C:
Some improvements can be made to the memory controller hardware. The only thing that depends on memory_controller_size is the byte-enable when you are writing to the ram. Otherwise the steering just depends on the addresses.
— Andrew Canis 2010/06/10 8:00
The original paper on tcl. Very very well written introduction:
Install tcl:
sudo apt-get install tcl8.5-dev
This was dealt with in the CBackend by using a _phi_temp variable:
llvm_cbe_legup_memcpy_4_2e_exit: llvm_cbe_L_ACF_2e_2_2e_08__PHI_TEMPORARY = 0u; /* for PHI node */ ... llvm_cbe_bb_2e_bb_crit_edge: llvm_cbe_L_ACF_2e_2_2e_08__PHI_TEMPORARY = llvm_cbe_tmp__2; /* for PHI node */ ... llvm_cbe_bb: llvm_cbe_L_ACF_2e_2_2e_08 = llvm_cbe_L_ACF_2e_2_2e_08__PHI_TEMPORARY;
Modified verilog to use a temp variable. it works. Is there an easy way to add a new uniquename not tied to a Value*?
— Andrew Canis 2010/06/07 8:00
The problem is with phi dependencies:
/* %3 = phi i32 [ 1280, %legup_memcpy_4.exit ], [ %load_noop4, %bb.bb_crit_edge ] ; <i32> [#uses=2]*/ /* %L_ACF.2.08 = phi i32 [ 0, %legup_memcpy_4.exit ], [ %3, %bb.bb_crit_edge ] ; <i32> [#uses=1]*/
In the above code, you MUST take the old value of %3 when evaluating the second phi.
I should _definitely_ have a test case for this.
There is an extra assignment to L_ACF[3] in gcc version:
L_ACF[3]=0 L_ACF[3]=1857 L_ACF[3]=1280 ---- extra ...
So initializing s by a .mif and leaving:
//printf("s:%d\n", s[6]); s[6]=1280; //printf("s:%d\n", s[6]);
gives an error. printf's don't fix it. Both return 1280
gsm bug command:
gcc test.c ; ./a.out > log; make; make v; sed -n '/# run/,$p' transcript |grep -v run > hw
Found a good program to hash integers:
Perfect hash of 32 functions:
./perfect -hps < functions
Hashing methods:
Tried some hashing code (in ~/hash). Results don't look that great:
hashtable size: 15, full: 4 (26.666667%), collisions: 193 (97.969543%) hashtable size: 255, full: 34 (13.333333%), collisions: 163 (82.741117%) hashtable size: 4095, full: 178 (4.346764%), collisions: 19 (9.644670%) hashtable size: 65535, full: 197 (0.300603%), collisions: 0 (0.000000%)
Basically I need a massive table to avoid all collisions. I only have 197 elements in total. My hash function was probably inefficient: multiplicative hash by golden ratio of 2^32, then masking that down to the table size.
I've isolated the gsm bug into about 500 lines of .bc
To make transcript:
sed -n '/# run/,$p' transcript |grep -v run > hw
— Andrew Canis 2010/06/07 8:00
Mips uses uninitialized data. The very first instruction
0x8fa40000, // [0x00400000] lw $4, 0($29) ; 175: lw $a0 0($sp) # argc
Loads a word from an arbitrary uninitialized portion of the data mem.
400004: reg[29] = dmem[63] = -163754450 63 = (2147479548) + 0
2147479548 = 0x7fffeffc. In the code they manually assign the stack pointer $sp to this value:
reg[29] = 0x7fffeffc;
For now I'm just going to 0 initialize dmem.
okay going back to the drawing board in gsm. I found a huge problem. So I print out the entire s[] array at the end of autocorrelation. In modelsim s[0..95] = x!
Can I tell modelsim to show me all x's? Maybe create waveforms? There should never be any x's. Actually can i just add an assertion after each instruction to make sure it isn't x? Nothing should ever be x. Wait, well there are many signals that just haven't been assigned yet. It's only after assignment that the signal shouldn't contain any x's (even in upper bits)
Basically I have to add the following after every assignment:
if ( ^A == 1'bx) $display ("ERROR: unknown value in signal A");
To change password (generate a random password):
pwgen sudo passwd <username>
— Andrew Canis 2010/06/03 8:00
Fixing gsm bug (result=4). Putting a printf inside Autocorrelation() fixes the result. also disabling function inlining fixes the problem. the verilog code changes a lot. difficult to pin down the problem area.
Created two files. working.v and bad.v working.v has:
STEP (3); printf("here"); STEP (4);
Code changes significantly:
acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ diff working.v bad.v |diffstat unknown | 3737 ++++++++++++++++++++++++++++++---------------------------------- 1 file changed, 1809 insertions(+), 1928 deletions(-)
what changes immediately after the printf? So the only change that jumps out is: working:
185: begin var149 = memory_controller_out[15:0]; ... load_noop22 = var149 + 16'd0;
bad:
185: begin var149 = memory_controller_out[31:0]; ... load_noop22 = var149 + 32'd0;
so the bitwidth changes…why would this happen? also why does the 32 bit one not work. you don't lose anything going from 16→32.
In the next state of working:
/* %165 = sext i16 %load_noop22 to i32 ; <i32> [#uses=1]*/ var163 = $signed(load_noop22);
seems like more data is being read in working.
oh i think different things are being read from memory? in bad
27: var79 = {`TAG_so, 32'b0} + ((32'd7 + 160 * (32'd0)) << 1); 183: memory_controller_address = var79;
'so' is being read, should be 16 bits. yep. okay. there is a state 184.
27: var80 = {`TAG_L_ACF_i, 32'b0} + ((32'd7 + 9 * (32'd0)) << 2); 184: memory_controller_address = var80;
ya so looks fine actually. interestingly, in the working file there are two more reads from memory right after state 185:
186: begin memory_controller_address = var67; memory_controller_write_enable = 0; end 187: begin memory_controller_address = var80; memory_controller_write_enable = 0; end
it seems like in the bad file there is never another read from var71, var69, var67. But there is a read from var80. And also from var81 (happens later in working).
also note that the $write() is in the same state (27) that all these variables (var66-81) are defined.
comparing the original bytecode. it looks like in bad we have a structure of code that repeats: store, mul, add
%152 = getelementptr inbounds [160 x i16]* %so, i32 0, i32 7 ; <i16*> [#uses=1] %153 = load i16* %152, align 2 ; <i16> [#uses=2] %154 = sext i16 %153 to i32 ; <i32> [#uses=9] ... store i32 %158, i32* %73, align 4 %159 = mul i32 %118, %154 ; <i32> [#uses=1] %160 = add nsw i32 %159, %141 ; <i32> [#uses=2] store i32 %160, i32* %84, align 4 %161 = mul i32 %103, %154 ; <i32> [#uses=1] %162 = add nsw i32 %161, %143 ; <i32> [#uses=2] store i32 %162, i32* %97, align 4 %163 = mul i32 %90, %154 ; <i32> [#uses=1] %164 = add nsw i32 %163, %145 ; <i32> [#uses=2]
working matches until right after the printf. Which screws up the aliasing. llvm doesn't know if printf will modify memory so we can't keep using %154 in every multiply. no wait it's not %154. in bad:
%88 = getelementptr inbounds [160 x i16]* %so, i32 0, i32 3 ; <i16*> [#uses=1] %89 = load i16* %88, align 2 ; <i16> [#uses=2] %90 = sext i16 %89 to i32 ; <i32> [#uses=9] ... %163 = mul i32 %90, %154 ; <i32> [#uses=1]
in working
%163 = call i32 (i8*, ...)* @printf(i8* noalias getelementptr inbounds ([5 x i8]* @.str, i32 0, i32 0)) nounwind ; <i32> [#uses=0] %164 = load i16* %88, align 2 ; <i16> [#uses=2] %165 = sext i16 %164 to i32 ; <i32> [#uses=1] %166 = mul i32 %165, %154 ; <i32> [#uses=1] %167 = add nsw i32 %166, %145 ; <i32> [#uses=2]
why does %88 have to be reloaded after the printf? why %154 doesn't…
%152 = getelementptr inbounds [160 x i16]* %so, i32 0, i32 7 ; <i16*> [#uses=2] %153 = load i16* %152, align 2 ; <i16> [#uses=1] %154 = sext i16 %153 to i32 ; <i32> [#uses=9]
%88 is from so[3], %154 is so[7]. doesn't make sense to me. maybe the optimizer is doing peephole optimization where it doesn't look across a function call?
okay i'm making modifications to working.ll and then recompiling. the funny thing is i can remove a bunch of stores in the code and the result stays 0!
acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ llc -march=v working.ll -o gsm.v
If I remove every store the result is x. okay. so I doubt that section is the problem if i can remove _every_ store and still get a 0 result. I can remove tons of stores and still get a result=0.
It's very troubling that you can remove instructions and still get the correct result. But running the modified bytecode through 'lli' gets a result of 0. The fundamental issue is: how do you know the h/w verilog matches the llvm bytecode?
— Andrew Canis 2010/06/02 8:00
Does GAUT support chstone? I doubt it. The latest version fails to even compile sra:
acanis@acanis-desktop:~/work/legup/examples/sra$ /home/acanis/GAUT_2_4_3/GautC/cdfgcompiler/bin/cdfgcompiler -S -c2dfg -O2 -I /home/acanis/GAUT_2_4_3/GautC/lib -I. sra.c Warning : Variable inData(1) is used but not defined (constant ?) !!! Warning : Variable inData(0) is used but not defined (constant ?) !!! sra.c: In function ‘int main()’: sra.c:10: internal compiler error: Segmentation fault Please submit a full bug report, with preprocessed source if appropriate. See <http://gcc.gnu.org/bugs.html> for instructions.
The fundamental problem with bugpoint right now is it depends on the linker. So to debug the mips pass, the code is split into two pieces <safe> and <test>. <safe> is compiled by gcc into a shared object (using the cbackend pass to convert .bc to .c). <test> is compiled by llc -march=mips. Then the exe is created by linking safe and test together. bugpoint minimizes the code in test.
Legup can't use this flow because there is no concept of a linker. The flow would have to be: remove a function from the bytecode, run it with 'lli' to get expected output. Then simulate the bytecode in modelsim and compare the return values. Note: the program would need to be able to run without the removed function… In fact there would be no guarantee that you could remove any of the byte code and still be able to reproduce the bug because the program functionally changes at that point. Whereas in the original bugpoint flow the program never functionally changes, just portions of it are 'safe' and you can assume they will be compiled correctly.
So we need something similar to the 'crash debugger'. Which keeps removing code to find the minimum bytecode to trigger a segfault in the optimizer. Could add an assertion in the scheduler to run modelsim? I'm not convinced this would work, as the code will probably not compile after removing random instructions. How can you remove random portions of the code and still have the error occur?
We don't have the benefit of a golden reference (gcc) to test against.
Actually we could solve this. If we implement a SystemC backend then we are compatible with gcc and could link. So bugpoint would work fine and we could debug the scheduler… There are other advantages to having a SystemC backend:
Shouldn't be that hard to implement. Would also be good to do this so we know that we can support vhdl in the future.
The only other options:
No luck using bugpoint to debug gsm. There is an option to specify a command to execute the bitcode. However, there are external globals without initialization.
acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ rm bug.log; bugpoint gsm.bc -run-custom -exec-command ../../bugpoint.pl
Bugpoint gives a .bc filename as an argument to bugpoint.pl. However, even just running this through lli has problems:
running: lli bugpoint.test.bc-Jkwtbo 2>&1 Output: LLVM ERROR: Could not resolve external global address: inData ... running: lli bugpoint.test.bc-QobI8c 2>&1 Output: 'main' function not found in module.
The patch to add this option was fairly recent from Pekka Jääskeläinen:
http://llvm.org/viewvc/llvm-project/llvm/trunk/tools/bugpoint/ExecutionDriver.cpp?r1=45421&r2=50373
Looked at C2H Verilog from James. To disable Quartus verilog warnings:
// turn off superfluous verilog processor warnings // altera message_level Level1 // altera message_off 10034 10035 10036 10037 10230 10240 10030
It's also cool how they specify simulation only (synthesis translate_off) and synthesis only (synthesis read_comments_as_HDL) code.
— Andrew Canis 2010/06/01 8:00
Wow. Really hard to figure out this memory bug. Possible causes:
Try modifying Scheduler to be a ModulePass I still get an error:
0x085e1cda in std::less<llvm::Function*>::operator() (this=0x9d29b38, __x=@0x10, __y=@0xbf8789d8) at /usr/include/c++/4.3/bits/stl_function.h:230 230 { return __x < __y; } (gdb) p __x $1 = (class llvm::Function * const&) @0x10: <error reading variable>
Coming from
#4 0x085f4e16 in legup::Scheduler::getFSM (this=0x9d29ad8, F=0x9d2af60) at Scheduler.h:46 46 FSM[F] = new FiniteStateMachine();
Here's a full backtrace:
(gdb) bt #0 0x085e1cda in std::less<llvm::Function*>::operator() (this=0x9d29b38, __x=@0x10, __y=@0xbf8789d8) at /usr/include/c++/4.3/bits/stl_function.h:230 #1 0x085f45e7 in std::_Rb_tree<llvm::Function*, std::pair<llvm::Function* const, legup::FiniteStateMachine*>, std::_Select1st<std::pair<llvm::Function* const, legup::FiniteStateMachine*> >, std::less<llvm::Function*>, std::allocator<std::pair<llvm::Function* const, legup::FiniteStateMachine*> > >::_M_insert_unique_ (this=0x9d29b38, __position={_M_node = 0x9d29b3c}, __v=@0xbf8789d8) at /usr/include/c++/4.3/bits/stl_tree.h:1183 #2 0x085f49a5 in std::map<llvm::Function*, legup::FiniteStateMachine*, std::less<llvm::Function*>, std::allocator<std::pair<llvm::Function* const, legup::FiniteStateMachine*> > >::insert (this=0x9d29b38, __position={_M_node = 0x9d29b3c}, __x=@0xbf8789d8) at /usr/include/c++/4.3/bits/stl_map.h:496 #3 0x085f4a9c in std::map<llvm::Function*, legup::FiniteStateMachine*, std::less<llvm::Function*>, std::allocator<std::pair<llvm::Function* const, legup::FiniteStateMachine*> > >::operator[] (this=0x9d29b38, __k=@0xbf878a44) at /usr/include/c++/4.3/bits/stl_map.h:419 #4 0x085f4e16 in legup::Scheduler::getFSM (this=0x9d29ad8, F=0x9d2af60) at Scheduler.h:46 #5 0x085e0b98 in legup::LLVMLegUpPass::runOnModule (this=0x9d2a0a0, M=@0x9d1c128) at Verilog.cpp:259 #6 0x08d7358e in llvm::MPPassManager::runOnModule (this=0x9d2ae48, M=@0x9d1c128) at PassManager.cpp:1424 #7 0x08d754ca in llvm::PassManagerImpl::run (this=0x9d28c60, M=@0x9d1c128) at PassManager.cpp:1506 #8 0x08d7552f in llvm::PassManager::run (this=0xbf878bc8, M=@0x9d1c128) at PassManager.cpp:1535 #9 0x0856fc74 in main (argc=3, argv=0xbf878d04) at llc.cpp:342
So llc calls PM→run(). Finds the implementation of the pass. Runs module pass for LLVMLegUpPass runOnModule(). Tries to lookup the FSM. Segfaults in std::map because of an invalid pointer in the tree.
This segfault occurs as we schedule 'main'. The first function. In fact the only function in 'sra'.
$8 = (class llvm::Function * const&) @0x10: <error reading variable>
Is the pointer overloaded? Maybe it's just getting over written. Because it's a weird value 0x10. It's not NULL.
Why is the main function being added to the fsm map twice?
INSERTING main Scheduling Function: main INSERTING main
Debug which passes are run:
llc -march=v --debug-pass=Structure sra.bc Pass Arguments: -preverify -domtree -verify -asap Target Data Layout Basic Alias Analysis (default AA impl) ModulePass Manager FunctionPass Manager Preliminary module verification Dominator Tree Construction Module Verifier ASAP Scheduling Unnamed pass: implement Pass::getPassName() LLVMLegUpPass backend Unnamed pass: implement Pass::getPassName() Pass Arguments: -memdep Basic Alias Analysis (default AA impl) FunctionPass Manager Memory Dependence Analysis Pass Arguments: -memdep Basic Alias Analysis (default AA impl) FunctionPass Manager Memory Dependence Analysis INSERTING main Scheduling Function: main INSERTING main
So already something really weird is going on.
Reverted back to FunctionPass version. I get a slightly different segfault.
Pass Arguments: -preverify -domtree -verify -memdep -asap Target Data Layout Basic Alias Analysis (default AA impl) FunctionPass Manager Preliminary module verification Dominator Tree Construction Module Verifier Memory Dependence Analysis As Soon As Possible Scheduling LLVMLegUpPass backend [New Thread 0xb74356d0 (LWP 9392)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xb74356d0 (LWP 9392)] 0x085e88a7 in __gnu_cxx::new_allocator<std::pair<llvm::Function* const, legup::FiniteStateMachine*> >::construct (this=0xbfe6c59f, __p=0xadf8828, __val=@0x12) at /usr/include/c++/4.3/ext/new_allocator.h:108 108 { ::new((void *)__p) _Tp(__val); }
Error is in the copy constructor of scheduler:
#7 0x085d2d3e in HwModule (this=0xadf88a0, LegUpPass=0xadf00a0, F=0xadf0f60) at Verilog.cpp:332 332 sched = LegUpPass->getAnalysis<Scheduler>(); (gdb) do #6 0x085efa14 in legup::Scheduler::operator= (this=0xadf892c) at Scheduler.h:26 26 class Scheduler {
Again there's a problem in that map<Function*, FiniteStateMachine*>. Some sort of invalid dangling pointer stored in there. I just don't understand why that would happen. What actual order are passes being run/deconstructed
Starting program: /home/acanis/work/legup/llvm/Debug/bin/llc -march=v sra.bc --debug-pass=Details [Thread debugging using libthread_db enabled] -- 'LLVMLegUpPass backend' is not preserving 'As Soon As Possible Scheduling' -- 'LLVMLegUpPass backend' is not preserving 'As Soon As Possible Scheduling' -- 'LLVMLegUpPass backend' is not preserving 'Memory Dependence Analysis' -- 'LLVMLegUpPass backend' is not preserving 'Dominator Tree Construction' -- 'LLVMLegUpPass backend' is not preserving 'Preliminary module verification' -- 'LLVMLegUpPass backend' is not preserving 'Module Verifier' Pass Arguments: -preverify -domtree -verify -memdep -asap Target Data Layout Basic Alias Analysis (default AA impl) FunctionPass Manager Preliminary module verification Dominator Tree Construction Module Verifier Memory Dependence Analysis As Soon As Possible Scheduling LLVMLegUpPass backend 0x9d3ee58 Executing Pass 'Preliminary module verification' on Function 'main'... 0x9d3ee58 Executing Pass 'Dominator Tree Construction' on Function 'main'... 0x9d3ee58 Executing Pass 'Module Verifier' on Function 'main'... 0x9d3d0f8 Required Analyses: Preliminary module verification, Dominator Tree Construction -*- 'Module Verifier' is the last user of following pass instances. Free these instances 0x9d3ee58 Freeing Pass 'Dominator Tree Construction' on Function 'main'... 0x9d3ee58 Freeing Pass 'Module Verifier' on Function 'main'... 0x9d3ee58 Freeing Pass 'Preliminary module verification' on Function 'main'... 0x9d3ee58 Executing Pass 'Memory Dependence Analysis' on Function 'main'... 0x9d3d8f8 Required Analyses: No Alias Analysis (always returns 'may' alias) 0x9d3ee58 Executing Pass 'As Soon As Possible Scheduling' on Function 'main'... 0x9d3da58 Required Analyses: Memory Dependence Analysis 0x9d3ee58 Executing Pass 'LLVMLegUpPass backend' on Function 'main'... 0x9d3e0a0 Required Analyses: Memory Dependence Analysis, No Alias Analysis (always returns 'may' alias), Scheduler
why is scheduler there twice at the beginning? so we are running all the function passes on 'main' only. note: the ASAP Schedule pass is never deallocated.
Okay here is a definite problem. I removed the map<function*, FiniteStateMachine*> and now I get:
0xb16fe58 Executing Pass 'LLVMLegUpPass backend' on Function 'main'... 0xb16f0a0 Required Analyses: Memory Dependence Analysis, No Alias Analysis (always returns 'may' alias), Scheduler -- 'LLVMLegUpPass backend' is not preserving 'As Soon As Possible Scheduling' -- 'LLVMLegUpPass backend' is not preserving 'As Soon As Possible Scheduling' -- 'LLVMLegUpPass backend' is not preserving 'Memory Dependence Analysis' -*- 'LLVMLegUpPass backend' is the last user of following pass instances. Free these instances 0xb16fe58 Freeing Pass 'As Soon As Possible Scheduling' on Function 'main'... 0xb16fe58 Freeing Pass 'LLVMLegUpPass backend' on Function 'main'... 0xb16fe58 Freeing Pass 'Memory Dependence Analysis' on Function 'main'... [New Thread 0xb75a46d0 (LWP 6133)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xb75a46d0 (LWP 6133)] 0xb77d52f5 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string () from /usr/lib/libstdc++.so.6 (gdb) up #1 0x085ebc78 in legup::State::getName (this=0x0) at State.h:56 56 string getName() { return name; } (gdb) #2 0x085d0051 in printFunctionHandshaking (fsm=0xb16f740, Out=@0xbfcd14c4) at Verilog.cpp:2264 2264 Out << indent << "\t\tcur_state = " << firstState->getName() << ";\n"; (gdb) #3 0x085ddebe in legup::HwModule::printDatapath (this=0xb177850, Out=@0xbfcd14c4) at Verilog.cpp:2359 2359 printFunctionHandshaking(fsm, Out); (gdb) #4 0x085de76f in legup::HwModule::printVerilog (this=0xb177850, Out=@0xbfcd150c) at Verilog.cpp:1741 1741 printDatapath(Datapath); (gdb) #5 0x085dea8a in legup::LLVMLegUpPass::doFinalization (this=0xb16f0a0, M=@0xb161128) at Verilog.cpp:304 304 HW->printVerilog(SS);
So I use doFinalization() to print of the verilog. This is called AFTER the scheduler is deallocated! The InstList in State will definitely be destroyed by the destructor.
“A class's destructor (whether or not you explicitly define one) automagically invokes the destructors for member objects. They are destroyed in the reverse order they appear within the declaration for the class. ”
Also: “A derived class's destructor (whether or not you explicitly define one) automagically invokes the destructors for base class subobjects. Base classes are destructed after member objects. In the event of multiple inheritance, direct base classes are destructed in the reverse order of their appearance in the inheritance list.”
See: http://www.parashift.com/c++-faq-lite/dtors.html
— Andrew Canis 2010/05/28 8:00
Required fields for a bug:
— Andrew Canis 2010/05/27 8:00
Options for config file:
optimization = 1
I think going with tcl is the best option for now. Tcl is the standard in EDA and it is very powerful. Doesn't seem too bad to parse: Tcl_EvalFile() from libtcl8.4. One big disadvantage: requires tcl C library Interesting thread on tcl topic:
LLVM actually has a few different lex/parsers:
./lib/AsmParser/LLLexer.cpp ./lib/MC/MCAsmLexer.cpp ./utils/TableGen/TGLexer.cpp ./tools/llvm-mc/AsmLexer.cpp
Including one for the custom language for TableGen .td files.
Does LLVM support pragmas? pragmas appear to be preprocessor directives. So they will not show up in the LLVM IR… One test-suite example has OpenMP. LLVM supports OpenMP (llvm-gcc -fopenmp). LLVM IR has calls to (which must be linked in later):
declare void @GOMP_parallel_end() nounwind declare i32 @omp_get_thread_num() declare void @GOMP_barrier() nounwind declare i32 @omp_get_num_threads()
So we might have to use clang to support pragmas? btw clang does not support openMP. A bit of discussion on how to add a new pragma to clang. Basically you use the pragma to set metadata in the LLVM IR (new feature in 2.6)
To reassign a bug, send email to bugzilla-daemon with [Bug 6] in the subject and in the body:
@assigned_to = stetorvs@gmail.com
— Andrew Canis 2010/05/26 8:00
Created legup-bugs mailing list. Added to cc: for all new bugs
Completely reinstalled bugzilla 3.6 in /var/www/legup.org/bugs After every install run (to set permissions):
./checksetup.pl
Added bugzilla email support. Add the following to /etc/aliases:
bugzilla-daemon: "|/var/www/legup.org/bugs/email_in.pl"
Remember to run:
sudo newaliases
Then change bugzilla permissions (do AFTER running checksetup.pl)
sudo chown -R nobody:www-data bugs/
postfix runs /etc/aliases tasks as nobody:nogroup
So you can now reply to bugzilla-daemon emails with comments. Added a slight fix to email_in.pl to filter out gmail reply comments:
our $gmail = qr/^On .* wrote:$/;
To fix a bug, send an email with:
@status = resolved @resolution = fixed
To declare bug 4 a duplicate of bug 5:
to: bugzilla-daemon@legup.org subject: [bug 4] @dup_id = 5
Ran doxygen for all of LLVM: http://legup.org/doxygen/
For legup namespace: http://legup.org/doxygen/namespacelegup.html
To regenerate (takes a long time):
cd llvm/docs make doxygen
Also from the Target/Verilog folder you can run:
doxygen
But this will only create html for legup files. There will be no links to LLVM classes.
jpeg works:
# At t= 17621933000 clk=1 finish=1 return_val= 0 # ** Note: $finish : main.v(17100) # Time: 17621933 ns Iteration: 2 Instance: /main_tb real 69m12.512s user 69m8.219s sys 0m0.396s
— Andrew Canis 2010/05/20 8:00
Load isn't being given an extra state?
57: begin /* %7 = getelementptr inbounds [44 x i32]* @imem, i32 0, i32 %6 ; <i32*> [#uses=1]*/ var8 = {`TAG_imem, 32'b0} + ((var7 + 44*(32'd0)) << 2); cur_state = 58; end 59: begin /* %8 = load i32* %7, align 4 ; <i32> [#uses=18]*/ var9 = memory_controller_out[31:0]; /* %10 = lshr i32 %8, 26 ; <i32> [#uses=2]*/ var10 = var9 >>> (32'd26 % 32); cur_state = 60; end
Strange, 59 isn't put right after 58… Okay fixed small bug in Verilog.cpp
Why did I remove load_noop? makes it really hard to diff the changes… okay put them back and mips works.
The memory_controller_out isn't always at the start of the state. Fixed.
Why is there a ret void in the middle of a random basic block?
/* %64 = load i32* %dlti, align 4 ; <i32> [#uses=1]*/ /* ret void*/ finish = 1; cur_state = 60;
finish = 1 is in the wrong place…
Fixed. All return instructions moved to last state. adpcm works.
Everything works except for gsm. gsm has a segfault. Very strange, there is a GlobalValue addr=0x11 not sure what's causing this… so the I must have been deleted. that's the only explanation. okay so one problem with the analysis pass is it modifies the code to add the load_noop. Can I move this to the non-analysis pass? moved insertLoadNoop() into main legup pass. didn't fix anything. moved insertLoadNoop() back into scheduler so i remember to remove it
so some instr is getting removed? happens right on this instruction:
<code>
STARTING: 92
call void @llvm.memset.i64(i8* %scevgep90.796, i8 0, i64 32, i32 4) store i32 0, i32* %L_ACF, align 4 </code>
caused by the memset. gsm was the only benchmark that had this intrinsic.
Maybe I should git pull? No. I'd rather not have a segfault in this case. Why does this happen?
Interesting. So originally in gsm.ll the instr looks like:
call void @llvm.memset.i64(i8* %scevgep90.796, i8 0, i64 32, i32 4)
Then when I print the basic blocks inside I see this instruction:
bb.nph47: ; preds = %gsm_mult_r.exit, %gsm_mult_r.exit.us, %bb7 %62 = load i16* %s, align 2 ; <i16> [#uses=1] %scevgep90.7 = getelementptr i32* %L_ACF, i32 1 ; <i32*> [#uses=11] %scevgep90.796 = bitcast i32* %scevgep90.7 to i8* ; <i8*> [#uses=1] %63 = call i8* @memset(i8* %scevgep90.796, i32 0, i32 32) ; <i8*> [#uses=0]
Why is that memset() not in the original? The code has been subtlety modified…why is uses=0? Also I noticed this in the diffs… the registers were slightly misnamed… when do i ever modify the instruction?? It's lowerIntrinsics that does the slight modification:
acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ make &> log; grep memset log call void @llvm.memset.i64(i8* %scevgep90.796, i8 0, i64 32, i32 4) call void @llvm.memset.i64(i8* %r_addr.075.1111, i8 0, i64 14, i32 2) define i8* @memset(i8* %m, i32 %c, i32 %n) nounwind { acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ make &> log; grep memset log %63 = call i8* @memset(i8* %scevgep90.796, i32 0, i32 32) ; <i8*> [#uses=0] %134 = call i8* @memset(i8* %r_addr.075.1111, i32 0, i32 14) ; <i8*> [#uses=0] define i8* @memset(i8* %m, i32 %c, i32 %n) nounwind {
So if we added the original instruction before lowerIntrinsics, we'll fail because it has been modified. So we must call scheduler AFTER lower intrinsics.
So moved lowerIntrinsics to doinitialization() and made sure to return true (modified)
Running without function inlining. Everything works but aes. aes mif initialization has a problem. Fixed. Caused by one bit global variable.
Okay everything works. Running jpeg with function inlining.
To add new users to git:
cd ~/gitosis-admin/keydir # open id_rsa.pub and get user/host name from the end of the line cp /new/user/id_rsa.pub user@host.pub # add user@host to gitosis.conf git add user@host.pub git commit -a -v git push
— Andrew Canis 2010/05/19 12:00
Export to EMF from xfig for the image to show up properly in MS office.
Setup a new mailing list for legup-commits. Grabbed my old post-receive script from cashstream. Needed to run the following:
legup.git/hooks$ sudo git config hooks.mailinglist "legup-commits@legup.org" legup.git/hooks$ sudo git config hooks.envelopesender "legup-commits@legup.org"
Envelopesender makes the email look like it was sent from legup-commits@legup.org
— Andrew Canis 2010/05/14 12:00
Both Victor and Ahmed ran into some issues with adpcm returning 1. Victor fixed it by adding a random printf. So this is probably a scheduling issue.
Had to configure my username to make my xml-rpc work
I backed up the old verilog files in examples/backupVerilog
— Andrew Canis 2010/05/10 03:00
Upgraded wiki to latest version and moved to legup.org/wiki
— Andrew Canis 2010/05/05 11:00
So I was noticing a massive mismatch between 'df -h' and 'disk usage analyzer'. Basically my whole hard drive was completely full even though I should have had 200GB left. Turns out this was caused by rescue time:
acanis@acanis-desktop:/home$ lsof -s |grep deleted ... /usr/bin/ 5517 acanis 34w REG 8,3 259150815232 16190085 /home/acanis/.rescuetime/tmp/notifier.debuglog (deleted) ...
That's 259 150 815 232 bytes = 241.353004 gigabytes!
To get a thesis:
library.utoronto.ca e-resources search: dissertation Dissertations & Theses: Full Text
— Andrew Canis 2010/05/21 12:59
Okay works now:
acanis@acanis-desktop:~/work/legup/examples/sra$ llc -march=v --debug-pass=Structure sra.bc Pass Arguments: -preverify -domtree -verify -memdep -asap Target Data Layout Basic Alias Analysis (default AA impl) ModulePass Manager FunctionPass Manager Preliminary module verification Dominator Tree Construction Module Verifier Memory Dependence Analysis As Soon As Possible Scheduling LLVMLegupPass backend
Oh shit. I was compiling with 'make' instead of 'makellvm llc'!
Trying to add the scheduler analysis pass. Don't see it for some reason:
acanis@acanis-desktop:~/work/legup/examples/sra$ llc -march=v --debug-pass=Structure sra.bc Pass Arguments: -preverify -domtree -verify -memdep Target Data Layout Basic Alias Analysis (default AA impl) ModulePass Manager FunctionPass Manager Preliminary module verification Dominator Tree Construction Module Verifier Memory Dependence Analysis LLVMLegupPass backend
— Andrew Canis 2010/05/09 19:26
Passes can definitely be run in order based on how they are added to the pass manager: PM.add(pass)
Check out Support/StandardPasses.h for -O3 optimizations added to the pass manager
To run -O3 on a bitcode:
opt -O3 -time-passes sra.bc > /dev/null
There is an LLVM wiki:
— Andrew Canis 2010/04/08 09:46
Pretty cool command: lastlog
— Andrew Canis 2010/04/01 07:00
The time required to run quartus on all the chstone benchmarks: 5.5h
real 327m22.206s user 328m59.682s sys 2m31.025s
Probably should look into how to speed this up. Parallelism? right now I run everything serially. It was also slowed because I ran quartus_map before the full compile to initialize each project.
— Andrew Canis 2010/03/30 22:40
Before the shift changes to motion.c. -O3 fails (after LTO) but with no optimization the result is correct.
acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ llvm-gcc mpeg2.c --emit-llvm -c -O3 -o mpeg2.prelto.bc acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ llvm-ld mpeg2.prelto.bc -b=mpeg2.bc acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ lli mpeg2.bc 2 acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ llvm-gcc mpeg2.c --emit-llvm -c -o mpeg2.prelto.bc acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ llvm-ld mpeg2.prelto.bc -b=mpeg2.bc acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ lli mpeg2.bc 0
Result is 2 until both changes have been made:
diff --git a/examples/chstone/motion/getbits.c b/examples/chstone/motion/getbits.c index bf3219d..aeacf1d 100755 --- a/examples/chstone/motion/getbits.c +++ b/examples/chstone/motion/getbits.c @@ -110,7 +110,7 @@ unsigned int Show_Bits (N) int N; { - return ld_Bfr >> (32 - N); + return ld_Bfr >> (unsigned)(32-N)%32; } diff --git a/examples/chstone/motion/motion.c b/examples/chstone/motion/motion.c index b2f7278..8d490a0 100755 --- a/examples/chstone/motion/motion.c +++ b/examples/chstone/motion/motion.c @@ -152,6 +152,7 @@ decode_motion_vector (pred, r_size, motion_code, motion_residual, { int lim, vec; + r_size = r_size % 32; lim = 16 << r_size; vec = full_pel_vector ? (*pred >> 1) : (*pred);
— Andrew Canis 2010/03/29 22:40
So gcc thinks that:
4042285200 >> 4294967128 = 240
not true BUT
4042285200 >> (4294967128 % 32) = 240
One problem with motion is inside:
decode_motion_vector (&PMV[0], h_r_size, motion_code, motion_residual, full_pel_vector);
To match GCC we need to add:
r_size = r_size%32;
This is a legitimate discrepency between LLVM and GCC:
acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ lli mpeg2.bc PMV[0][0][0] = 45 PMV[0][0][0] = 286 2 acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ gcc -fmudflap -lmudflap mpeg2.c; ./a.out PMV[0][0][0] = 45 PMV[0][0][0] = 1566 0
Again in Show_Bits() there is another shift problem:
volatile int tmp = (32-N); return ld_Bfr >> tmp;
Which LLVM also gets wrong:
acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ lli mpeg2.bc PMV[0][0][0] = 45 PMV[0][0][0] = 1326 2 acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ gcc -fmudflap -lmudflap mpeg2.c; ./a.out PMV[0][0][0] = 45 PMV[0][0][0] = 1566 0
After fixing the 2 above problems lli performs correctly. However LegUp doesn't work without an extra printf statement in Get_Bits(). There is one final scheduling error to deal with.
jpeg also works without any warnings/errors. I'm leaving it out of the test suite because it takes so long to run (45min).
# At t= 17621931000 clk=1 finish=1 return_val= 0 # ** Note: $finish : main.v(19128) # Time: 17621931 ns Iteration: 2 Instance: /main_tb real 44m57.348s user 44m50.964s sys 0m0.292s
Unfortunately I still get warning with gsm even after the fix…
# addr:000000000 # memset cur_state: 0 # main cur_state: 127 # addr:800000140 # memset cur_state: 0 # main cur_state: 128 # main_tb.main_inst.memory_controller_inst.so.altsyncram_component Warning : Address pointed at port A is out of bound! # main_tb.main_inst.memory_controller_inst.so.altsyncram_component Warning : Address pointed at port A is out of bound!
Found the problem, another off by one error that mudflap didn't find
/* Rescaling of the array s[0..159] */ if (scalauto > 0) for (k = 160; k >= 0; k--) *s++ <<= scalauto;
160 should be changed to 159.
Checked all other benchmarks. None have errors with mudflap except for unaligned (as expected)
Running ./unaligned/dg.exp ... FAIL: unaligned Failed with exit(2)gcc -g -fmudflap -lmudflap unaligned.c; ./a.out Two integers: aabbccdd eeff0011 Byte 0: aabbccdd Byte 1: 11aabbcc Byte 2: 11aabb Byte 3: ff0011aa Byte 4: eeff0011 Byte 5: 65eeff00 Byte 6: fa66eeff Byte 7: a8fa67ee ******* mudflap violation 1 (check/read): time=1269831641.070330 ptr=0xbfa8fa65 size=4 pc=0x4003a8ed location=`unaligned.c:16:9 (main)' /usr/lib/libmudflap.so.0(__mf_check+0x3d) [0x4003a8ed] ./a.out(main+0x2d4) [0x8048a98] /usr/lib/libmudflap.so.0(__wrap_main+0x4f) [0x4003b34f] Nearby object 1: checked region begins 5B into and ends 1B after mudflap object 0x95d7b38: name=`unaligned.c:7:18 (main) a' bounds=[0xbfa8fa60,0xbfa8fa67] size=8 area=stack check=4r/1w liveness=5 alloc time=1269831641.070284 pc=0x4003b2ed number of nearby objects: 1 ******* mudflap violation 2 (check/read): time=1269831641.070605 ptr=0xbfa8fa66 size=4 pc=0x4003a8ed location=`unaligned.c:16:9 (main)' /usr/lib/libmudflap.so.0(__mf_check+0x3d) [0x4003a8ed] ./a.out(main+0x2d4) [0x8048a98] /usr/lib/libmudflap.so.0(__wrap_main+0x4f) [0x4003b34f] Nearby object 1: checked region begins 6B into and ends 2B after mudflap object 0x95d7b38: name=`unaligned.c:7:18 (main) a' number of nearby objects: 1 ******* mudflap violation 3 (check/read): time=1269831641.070673 ptr=0xbfa8fa67 size=4 pc=0x4003a8ed location=`unaligned.c:16:9 (main)' /usr/lib/libmudflap.so.0(__mf_check+0x3d) [0x4003a8ed] ./a.out(main+0x2d4) [0x8048a98] /usr/lib/libmudflap.so.0(__wrap_main+0x4f) [0x4003b34f] Nearby object 1: checked region begins 7B into and ends 3B after mudflap object 0x95d7b38: name=`unaligned.c:7:18 (main) a' number of nearby objects: 1 make: *** [all] Error 176
There is definitely some sort of array overrun in gsm. gcc -fbounds-check doesn't work. Apparently there are two other options: mudflap and ssp
Okay found something. There is a violation in Autocorrelation.
acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ gcc -g -fmudflap -lmudflap gsm.c acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ ./a.out ******* mudflap violation 1 (check/write): time=1269830619.922753 ptr=0xbf918c18 size=4 pc=0xb7e5d8ed location=`lpc.c:88:7 (Autocorrelation)' /usr/lib/libmudflap.so.0(__mf_check+0x3d) [0xb7e5d8ed] ./a.out(Autocorrelation+0x3e7) [0x80492e7] ./a.out(Gsm_LPC_Analysis+0x3b) [0x8051df4] Nearby object 1: checked region begins 1B after and ends 4B after mudflap object 0x90d9c68: name=`lpc.c:317:12 (Gsm_LPC_Analysis) L_ACF' bounds=[0xbf918bf4,0xbf918c17] size=36 area=stack check=0r/0w liveness=0 alloc time=1269830619.922702 pc=0xb7e5e2ed number of nearby objects: 1 ******* mudflap violation 2 (check/read): time=1269830619.923481 ptr=0xbf918c18 size=4 pc=0xb7e5d8ed location=`lpc.c:151:7 (Autocorrelation)' /usr/lib/libmudflap.so.0(__mf_check+0x3d) [0xb7e5d8ed] ./a.out(Autocorrelation+0x6762) [0x804f662] ./a.out(Gsm_LPC_Analysis+0x3b) [0x8051df4] Nearby object 1: checked region begins 1B after and ends 4B after mudflap object 0x90d9c68: name=`lpc.c:317:12 (Gsm_LPC_Analysis) L_ACF' number of nearby objects: 1 ******* mudflap violation 3 (check/write): time=1269830619.923933 ptr=0xbf918c18 size=4 pc=0xb7e5d8ed location=`lpc.c:151:7 (Autocorrelation)' /usr/lib/libmudflap.so.0(__mf_check+0x3d) [0xb7e5d8ed] ./a.out(Autocorrelation+0x6806) [0x804f706] ./a.out(Gsm_LPC_Analysis+0x3b) [0x8051df4] Nearby object 1: checked region begins 1B after and ends 4B after mudflap object 0x90d9c68: name=`lpc.c:317:12 (Gsm_LPC_Analysis) L_ACF' number of nearby objects: 1 Segmentation fault
Nice! Easy patch:
diff --git a/examples/chstone/gsm/lpc.c b/examples/chstone/gsm/lpc.c index baa99d0..dfe126c 100755 --- a/examples/chstone/gsm/lpc.c +++ b/examples/chstone/gsm/lpc.c @@ -84,7 +84,7 @@ Autocorrelation (word * s /* [0..159] IN/OUT */ , #define STEP(k) L_ACF[k] += ((longword)sl * sp[ -(k) ]); #define NEXTI sl = *++sp - for (k = 9; k >= 0; k--) + for (k = 8; k >= 0; k--) L_ACF[k] = 0; STEP (0); @@ -147,7 +147,7 @@ Autocorrelation (word * s /* [0..159] IN/OUT */ , STEP (8); } - for (k = 9; k >= 0; k--) + for (k = 8; k >= 0; k--) L_ACF[k] <<= 1; }
— Andrew Canis 2010/03/28 22:40
Figured out a solution to the memset() issue. Create a .bc for memset and then link it in with llvm-ld -disable-opt.
Looking into gsm segfault:
acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ valgrind ./a.out ==27584== Memcheck, a memory error detector. ==27584== Copyright (C) 2002-2008, and GNU GPL'd, by Julian Seward et al. ==27584== Using LibVEX rev 1884, a library for dynamic binary translation. ==27584== Copyright (C) 2004-2008, and GNU GPL'd, by OpenWorks LLP. ==27584== Using valgrind-3.4.1-Debian, a dynamic binary instrumentation framework. ==27584== Copyright (C) 2000-2008, and GNU GPL'd, by Julian Seward et al. ==27584== For more details, rerun with: -v ==27584== ==27584== Invalid write of size 4 ==27584== at 0x8049647: main (gsm.c:103) ==27584== Address 0xfffffff8 is not stack'd, malloc'd or (recently) free'd ==27584== ==27584== Process terminating with default action of signal 11 (SIGSEGV) ==27584== Access not within mapped region at address 0xFFFFFFF8 ==27584== at 0x8049647: main (gsm.c:103) ==27584== If you believe this happened as a result of a stack overflow in your ==27584== program's main thread (unlikely but possible), you can try to increase ==27584== the size of the main thread stack using the --main-stacksize= flag. ==27584== The main thread stack size used in this run was 8388608. ==27584== ==27584== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 13 from 1) ==27584== malloc/free: in use at exit: 0 bytes in 0 blocks. ==27584== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. ==27584== For counts of detected errors, rerun with: -v ==27584== All heap blocks were freed -- no leaks are possible. Segmentation fault
If I add int i declaration into Gsm_LPC_Analysis segfault disappears. if i comment out Autocorrelation segfault disappears
I don't understand how this line could ever segfault:
for (i = 0; i < N; i++)
There must be a buffer overrun which overwrites the instruction?
mmmm. so gsm doesn't segfault when -O3 is enabled:
acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ gcc -O3 -g gsm.c; ./a.out 0
Really strange. I don't think this is worth fixing.
This I should fix though:
# main_tb.main_inst.memory_controller_inst.so.altsyncram_component Warning : Address pointed at port A is out of bound! # main_tb.main_inst.memory_controller_inst.so.altsyncram_component Warning : Address pointed at port A is out of bound! # main_tb.main_inst.memory_controller_inst.so.altsyncram_component Warning : Address pointed at port A is out of bound! # main_tb.main_inst.memory_controller_inst.so.altsyncram_component Warning : Address pointed at port A is out of bound! # 0 # At t= 21583000 clk=1 finish=1 return_val= 0 # ** Note: $finish : gsm.v(6986)
for some reason i'm not seeing any latency improvement after enabling LTO. llvm-ld definitely performs optimizations:
acanis@acanis-desktop:~/work/legup/examples/chstone/dfsin$ llvm-ld dfsin.bc -b=dfsin.lto.bc -stats ===-------------------------------------------------------------------------=== ... Statistics Collected ... ===-------------------------------------------------------------------------=== 4 globalopt - Number of functions converted to fastcc 18 globalopt - Number of functions deleted 2 globalsmodref-aa - Number of functions without address taken 4 globalsmodref-aa - Number of global vars without address taken 2 gvn - Number of instructions deleted 1 gvn - Number of loads deleted 3 inline - Number of functions deleted because all callers found 3 inline - Number of functions inlined 4 instcombine - Number of insts combined 22 internalize - Number of functions internalized 4 internalize - Number of global vars internalized 1 loopsimplify - Number of pre-header or exit blocks inserted 2 memdep - Number of block queries that were completely cached 1180 memdep - Number of fully cached non-local ptr responses 527 memdep - Number of uncached non-local ptr responses 58 sccp - Number of basic blocks unreachable 1 sccp - Number of globals found to be constant by IPSCCP 180 sccp - Number of instructions removed 10 sccp - Number of instructions removed by IPSCCP
Definitely differences in bitcode:
acanis@acanis-desktop:~/work/legup/examples/chstone/dfsin$ wc -l dfsin.lto.ll 1678 dfsin.lto.ll acanis@acanis-desktop:~/work/legup/examples/chstone/dfsin$ wc -l dfsin.ll 2122 dfsin.ll acanis@acanis-desktop:~/work/legup/examples/chstone/dfsin$ ll dfsin.*bc -rw-r--r-- 1 acanis acanis 17K 2010-03-28 02:42 dfsin.bc -rwxr-xr-x 1 acanis acanis 14K 2010-03-28 02:42 dfsin.lto.bc
ohh shit. It's calling the verilog file something different!
-rw-r--r-- 1 acanis acanis 286K 2010-03-28 02:45 dfsin.lto.v
pre-LTO:
# At t= 126617000 clk=1 finish=1 return_val= 0 # ** Note: $finish : dfsin.v(13344)
post-LTO:
# At t= 114743000 clk=1 finish=1 return_val= 0 # ** Note: $finish : dfsin.v(9599)
About 10% latency improvement:
; 63309-57372 5937 ; 5937/63309 ~0.09377813581007439701
Trying to figure out how to enable link time optimization Enabling -flto does nothing as long as -c or -S is given - we get out the same LLVM bitcode. When you leave out -c -S then you get the error:
acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ make llvm-gcc gsm.c --emit-llvm -O3 -flto -fno-builtin -o gsm.bc /tmp/cc7Icg04.o: file not recognized: File format not recognized collect2: ld returned 1 exit status
This is basically saying the linker doesn't recognize LLVM bitcode. So I have to set this up now.
okay figured it out. very simple actually. Note -c instead of -S, llvm-ld can't take a textual llvm bitcode as input.
llvm-gcc gsm.c --emit-llvm -O3 -c -fno-builtin -o gsm.bc # produces binary bitcode: gsm.bc llvm-ld gsm.bc -b=gsm.bc.lto # produces a.out shell script (to call lli) and gsm.bc.lto binary bitcode llvm-dis gsm.bc.lto # produces textual bitcode: gsm.bc.lto.ll
— Andrew Canis 2010/03/28 02:09
Should be an IR to represent Verilog so errors like this are not possible:
-- Compiling module ByRef ** Error: functions.v(225): Bounds of part-select into 'memory_controller_out' are reversed. ** Error: functions.v(225): MSB of part-select into 'memory_controller_out' is out of bounds.
— Andrew Canis 2010/03/26 21:30
Found SPARK benchmark circuits MPEG-1 and adpcm:
New classes to be implemented:
class Writer { }; class VerilogWriter : Writer { }; class VHDLWriter : Writer { }; class Target { }; class TargetDevice : Target { }; class StratixIV : TargetDevice { }; class TargetTool : Target { }; class Altera : TargetTool { }; class Generic : TargetTool { }; class Binding { }; class FSM { }; class Resource { }; class FunctionalUnit : Resource { }; class AddSub : FunctionalUnit { }; class Multiplier : FunctionalUnit { }; class Shifter : FunctionalUnit { }; class MemoryUnit : Resource { }; class Register : MemoryUnit { }; class Constraints { public: Resource* getResource(Instruction* I) { return NULL; } }; class Schedule { virtual unsigned getState(HwModule *M, Instruction *I) = 0; }; class ScheduleASAP : Schedule { }; class ScheduleALAP : Schedule { };
— Andrew Canis 2010/03/10 08:15
A good page with a lot of benchmark links:
Looking around for the '92 and '95 HLS benchmarks. Found a paper for '95.
Both of these links are broken:
— Andrew Canis 2010/03/10 16:45
So it's basically an unaligned access due to memset. Can I get rid of the need for memset? Trying -ffreestanding. Doesn't work:
-fno-builtin also doesn't work
Running into some warning on gsm:
# addr:800000024 # memset cur_state: 12 # Autocorrelation cur_state: 95 # memset cur_state: 0 # Reflection_coefficients cur_state: 0 # Quantization_and_coding cur_state: 0 # main cur_state: 9 # main_tb.main_inst.memory_controller_inst.L_ACF_i.altsyncram_component Warning : Address pointed at port A is out of bound!
— Andrew Canis 2010/03/09 02:31
Really really cool. There is an x86 instruction called 'rdtsc' that increments every clock cycle!
Running into alignment issues with memory controller. I think I'm going to make it a requirement that all accesses be aligned.
Good description:
— Andrew Canis 2010/03/03 04:31
Got modelsim working. Added the following to /etc/rc.local (to run at startup)
ssh -L 7325:128.100.10.141:7326 isis.eecg.utoronto.ca & ssh -L 7327:128.100.10.141:7327 isis.eecg.utoronto.ca &
And added this to .bashrc:
export MGLS_LICENSE_FILE=7325@localhost export PATH=/home/acanis/modelsim/install/modeltech/linux:$PATH
To synthesize a verilog file:
quartus_map dfadd --analyze_file=../dfadd.v
— Andrew Canis 2010/02/16 08:40
Good verilog reference:
gnuplot:
— Andrew Canis 2010/02/10 22:29
Example of dominator tree:
— Andrew Canis 2010/02/02 16:04
Tip: if make is taking too long take out the -debug flag (speeds up code 100x)
— Andrew Canis 2010/02/01 16:25
To create .mif soft links:
acanis@acanis-desktop:~/work/legup/examples/chstone/aes/testbench$ for i in ../*.mif > do > ln -s $i > done
Note: you need the line breaks or you can do this one-liner:
for i in ../*.mif; do ln -s $i; done
— Andrew Canis 2010/01/28 14:20
Multidimensional arrays are stored in row-major order in C: (from wikipedi) A[row][column] can then be computed as: offset = row*NUMCOLS + column
— Andrew Canis 2010/01/26 14:24
Benchmark problems:
%10 = load i8* %d.1, align 1 ; <i8> [#uses=1] addr: %d.1 = select i1 %8, i8* %7, i8* %data ; <i8*> [#uses=2] getRam: %d.1 = select i1 %8, i8* %7, i8* %data ; <i8*> [#uses=2] getRam: %8 = icmp ult i8* %7, %3 ; <i1> [#uses=1] LLVM ERROR: Cannot find ram!
— Andrew Canis 2010/01/22 16:44
adpcm problem. Dynamic pointers:
getRam: %ril.0.in = getelementptr inbounds [31 x i32]* %quant26bt_pos.pn, i32 0, i32 %5 ; <i32*> [#uses=1] getRam: %quant26bt_pos.pn = select i1 %abscond, [31 x i32]* @quant26bt_pos, [31 x i32]* @quant26bt_neg ; <[31 x i32]*> [#uses=1] LLVM ERROR: Cannot find ram!
— Andrew Canis 2010/01/21 17:16
All softfloat benchmarks work (dfadd, dfsin, dfmul, dfdiv). Unfortunately, each function call adds an additional cycle of latency to a load/store, so for now I've set the latency of a load/store to 10. This will have to be fixed.
Stepping through dfadd. Double precision floating point:
sign (1 bit) | exponent (11 bits) | fractions (52 bits) -1 = 0xbff0000000000000 -1 = 1 | 01111111111 | 00000000000000000000000000... -1 = (-1)^1 x 2^ (1023 - 1023) X (1 + 0)
To print hex in gdb:
p/x b
Some old comments I found (single precision, 32-bit floating point):
// IEEE 754-1985 FP representation // sign (1 bit) | exponent (8 bits) | fractions (23 bits) // one = 0 | 01111111 | 00000000000000000000000; // one = (-1)^0 x 2 ^ (127 - 127) x (1 + 0) // two = 0 | 10000000 | 00000000000000000000000 // two = (-1)^0 x 2 ^ (128 - 127) x (1 + 0) // // matlab: // dec2ieee754(1, 'single')
— Andrew Canis 2010/01/20 9:45
Working: DFMUL, DFDIV. DFADD only 38/46 of tests work. DFSIN only 1/36 tests work.
— Andrew Canis 2010/01/19 18:39
So I've noticed something. You can't simply say value.abs() on an APInt because if the value is the maximum negative value you can't actually represent the absolute value. ie. an 8 bit integer is from -128 to 127. So if you take the abs(-128) you'll get -128 back out, not 128 as expected because it's too big.
— Andrew Canis 2010/01/18 15:13
Add a “-64'd” prefix to fix the error: (note the negative sign needs to be in front of this prefix)
# ** Error: ../dfadd.v(800): near "9221120237041090560": Numeric value exceeds 32-bit capacity.
To handle global variables:
dfadd is almost working.
— Andrew Canis 2010/01/17 18:26
Found a good introduction to alias analysis in a master's thesis: http://lenherr.name/~thomas/ma/introduction.page
— Andrew Canis 2010/01/16 06:35
Found a bug. Dependencies between load/store/call instructions is not accounted for. These instructions are not connected by a use-def chain, we need alias analysis. LLVM seems to have an analysis pass called Memory Dependency Analysis:
— Andrew Canis 2010/01/14 15:58
Todo: fix dealing with reference parameters to functions.
— Andrew Canis 2010/01/13 07:43
To avoid function inlining:
llvm-gcc -fno-inline-functions
Other issues:
— Andrew Canis 2010/01/12 02:47
dfadd issues:
Really cool, type 'wh' in gdb.
Debug build: ./configure –disable-optimized
— Andrew Canis 2010/01/08 05:36
There is still one Quartus warning:
"can't check case statement for completeness because the case expression has too many possible states"
This warning occurs when the variable used in switch() is 32-bits wide, which is too big for quartus to check all 2^32 possibilities. I don't think this is a big deal to leave in for now. In the future we could down cast the variable to be big enough for the biggest case, which will be much smaller than 32 bits.
Mips is working! PC incremented properly. dmem is getting out the correct sorted values, and the return value is correct.
# pc=00400020, state=232, dmem_out=ffffffef, dmem_address= 1, dmem_write_enable=0, dmem_in=00000026, # pc=00400020, state=233, dmem_out=fffffff7, dmem_address= 2, dmem_write_enable=0, dmem_in=00000026, # pc=00400020, state=234, dmem_out=00000000, dmem_address= 3, dmem_write_enable=0, dmem_in=00000026, # pc=00400020, state=235, dmem_out=00000003, dmem_address= 4, dmem_write_enable=0, dmem_in=00000026, # pc=00400020, state=236, dmem_out=00000005, dmem_address= 5, dmem_write_enable=0, dmem_in=00000026, # pc=00400020, state=237, dmem_out=0000000b, dmem_address= 6, dmem_write_enable=0, dmem_in=00000026, # pc=00400020, state=238, dmem_out=00000016, dmem_address= 7, dmem_write_enable=0, dmem_in=00000026, # pc=00400020, state=239, dmem_out=00000026, dmem_address= 7, dmem_write_enable=0, dmem_in=00000026, # pc=00400020, state=240, dmem_out=00000026, dmem_address= 7, dmem_write_enable=0, dmem_in=00000026, # pc=00400020, state=241, dmem_out=00000026, dmem_address= 7, dmem_write_enable=0, dmem_in=00000026, # pc=00400020, state=242, dmem_out=00000026, dmem_address= 7, dmem_write_enable=0, dmem_in=00000026, # pc=00400020, state=243, dmem_out=00000026, dmem_address= 7, dmem_write_enable=0, dmem_in=00000026, # At t= 25126000 clk=1 return_val= 0 # exit
Initial stats (Stratix IV Auto): latency 25126/2=12563 cycles. Fmax=234.63MHz (slow corner). The latency is 1/234.63e6*25126/2=53.54us
Overall not that bad for a first cut. xPilot had a latency of 42us (fast constraint) and 30us (slow constraint).
Area:
Added a noop add instruction after each load to handle the RAM latency. This was really needed because phi instructions were being pushed back into the same state as the load, which is wrong, they need to be 2 cycles afterwards. Adding the noop solves this.
— Andrew Canis 2009/12/21 14:54
From wikipedia:
Verilog-2001 is the dominant flavor of Verilog supported by the majority of commercial EDA software packages.
So we can use the arithmetic shift right (»>) operator
I'm adding some assertions to the code to fail when we don't support instructions (floating point etc).
— Andrew Canis 2009/12/18 17:17
Running into an issue with signed/unsigned numbers I think. After the instruction:
0x1100000b, // [pc=0x00400070] beq $8, $0, 44 [L2-0x00400070] ; 23: beq $t0,$zero,L2
The pc jumps to 0x4440009c instead of:
0x8fbf0008, // [pc=0x0040009c] lw $31, 8($29) ; 34: lw $ra,8($sp) ; L2
Todo: fix ram to handle signed integers. Right now it assumes unsigned.
Note, the LLVM primitive types (i8, i32, etc) do not have a sign. Instead certain instructions have two versions: ashr, lshr. See: http://nondot.org/sabre/LLVMNotes/TypeSystemChanges.txt
.mif files don't support negative decimal numbers
I've updated the square root approx example to include a global array and a printf. It's working fine. Working on mips right now, the issues still to deal with:
Had an architecture meeting yesterday. The slides were a little bit technical - focusing too much on C++ class details. But generally looked good. The goal now is just to push the rest of the chstone through and get a test suite. Then we can focus on QoR for a bit, including some basic resource constraints. When we're at roughly the same performance as xPilot we can make a public release.
Required reading for using LLVM: http://llvm.org/docs/ProgrammersManual.html#coreclasses
— Andrew Canis 2009/12/17 09:48
To generate doxygen documentation for Verilog.cpp:
doxygen Doxyfile
Played around with Spark a little more. Doesn't support any c++ features (const, c++ comments). Also doesn't seem to support arrays, for instance this snippet gives a segfault in common subexpression elimination:
int main () { int main_result; int i; int A[2]; A[0] = 1; A[1] = 2; for (i = 0; i < 2; i++) { main_result = A[i]; } return main_result; }
Tried running spark on mips.c, I get a segfault:
acanis@acanis-desktop:~/spark/spark-linux-1.2/tutorial/mips$ spark -m -hli -hcs -hcp -hdc -hs -hcc -hvf -hb -hec mips.c Copyright (C) 2000-2003 The Regents of the University of California. All Rights Reserved. SPARK Version 1.2 (built on Feb 4 2004 17:31:10) is initializing ... Done! -- Start initializing IR (and dependences) for routine: mips_c_main() -- Start initializing routine: mips_c_main() At end of source: internal error: assertion failed: dump_expr: bad expr node kind (c_gen_be.c, line 4497) : init_stmt : -- Start Build Dominator Tree on mips_c_main -- Done Build Dominator Tree on mips_c_main -- Start Lowering Exprs of routine: mips_c_main() Got false from isVarConstField for OrigLeftHandOperand Got false from isVarConstField for OrigLeftHandOperand Got false from isVarConstField for OrigLeftHandOperand Got false from isVarConstField for OrigLeftHandOperand Got false from isVarConstField for OrigLeftHandOperand -- Done Lowering Exprs of routine: mips_c_main() -- Start DataDependence on mips_c_main -- Done DataDependence on mips_c_main -- Done initializing IR (and dependences) for routine: mips_c_main() WARNING: not computing loop bounds for WhileLoopNode just yet WARNING: not computing loop bounds for WhileLoopNode just yet -- Start Doing Loop Invariant Code Motion in routine: mips_c_main() -- Start LoopInvariantCM on mips_c_main -- Done LoopInvariantCM on mips_c_main -- Done Doing Loop Invariant Code Motion in routine: mips_c_main() -- Start Unrolling Loops in routine: mips_c_main() -- Done Unrolling Loops in routine: mips_c_main() -- Start Finding Common SubExprs (CSE) in routine: mips_c_main() Segmentation fault
— Andrew Canis 2009/12/15 13:36
Using the visitor pattern is actually really easy. Just inherit from InstVisitor, override the necessary visitXXX methods and then call visit(instr). I can use this to avoid a few of the case statements in my code.
— Andrew Canis 2009/12/10 18:00
The volatile keyword is handled in llvm by accessing the variables using 'load/store volatile'. So you can assume normal registers are never volatile.
Looking at mips.c. There are a few issues I immediately foresee:
— Andrew Canis 2009/12/10 00:19
Note: there is a df_iterator for depth first iteration of the CFG.
Added support for phi instructions. Used a function called printPHICopiesForSuccessor() from CBackend.
Creating a new user:
sudo adduser --home /home/xxxx xxxx
I can now split up basic blocks based on data dependencies. First I tried cloning instructions but I noticed the names were different, so I changed the code to move instructions to the new basic block instead. I made use of splitBasicBlock().
Now I'm going to get loop.c working. I'll have to fix up my ram address code.
— Andrew Canis 2009/12/09 01:54
Mark came up with a good idea of how to handle breaking basic blocks up based on data dependencies. First find which state each instruction should be in, then just sort the instructions based on their state, which lets you use splitBasicBlock() in LLVM.
— Andrew Canis 2009/12/08 02:30
I've decided to implement the state machines using the llvm IR. Each state will be a seperate basic block, the terminating statement will be converted to the appropriate next_state verilog code. I feel this will simplify things, and I don't see any benefit right now for adding a new datastructure just for states.
How should we handle phi instructions? I think we should keep phi instructions around as long as possible until the actual Verilog generation. To turn phi instructions into verilog you have to push the assignment back into each state mentioned in the phi. There is some code in PHIElimination.cpp to do this. Can I reuse it? Also CBackend has the same issue, there's probably code I can use in there.
Another idea I've had is to use verilog blocking statements within a state. This will be great because we won't have to worry about data dependencies within a state. One thing to make sure is a load from RAM can't be used within the same state.
I've read the GraphTraits class that defines the CFG between basic blocks. I've realized that this simply wraps the following:
What's interesting is that even BasicBlocks have a use-def chain. In fact every Value has a use-def chain.
The main data structure of LLVM: doubly-linked lists. Either instructions within basic blocks, basic blocks within functions, functions within modules.
— Andrew Canis 2009/12/06 09:26
To generate llvm doxygen:
llvm/$ cd docs llvm/$ make doxygen
Look at: llvm/Support/CFG.h
Print out a dot graph of the CFG:
F.viewCFG(); F.viewCFGOnly();
Found a good STL c++ reference:
— Andrew Canis 2009/12/04 23:32
Cool dot graph at this site: http://compilers.cs.ucla.edu/fernando/projects/soc/proposal.html
So here's my issue. tags in gvim suck. For instance, lets say I want to find the class Value:
:ts Value # scroll through 50 lines, find that it's 20, hit q, then 20<enter>
Now I'm looking at the Value class, I notice a print() method. I try ctrl-]. Get back 100 matches, none of which are remotely correct.
So the problem here is that the prototype is defined as llvm::Value::print, but the declaration is simply Value::print (inside a using namespace llvm). So it can't be found… Is this a bug in ctags? Nope. Changed the declaration to be llvm::Value::print, still ctrl-] doesn't prioritize it.
Read this: http://cscope.sourceforge.net/cscope_vim_tutorial.html
To use cscope in vim. :h cscope-howtouse
$ cscope -b -q `getsrcs.sh`
Now just type <C-X> <C-O> to open auto complete dialog. :h omnicppcomplete
Added:
map <C-F12> :!ctags -R --c++-kinds=+p --fields=+iaS --extra=+q --languages=c++ .<CR>
Found a really cool vim plugin:
You have to build your ctags database with at least the following options:
--c++-kinds=+p : Adds prototypes in the database for C/C++ files. --fields=+iaS : Adds inheritance (i), access (a) and function signatures (S) information. --extra=+q : Adds context to the tag name. Note: Without this option, the script cannot get class members.
Just some commands to remember for browsing vim tags. type 'g[' select one tag from list (hit 'q' if too many). ctrl-t jump back after using ctrl-]
Added 's' command in visual mode for adding code tags around selection:
:vmap s d<S-o><code><enter><\/code><esc>kp
Added the 'doku' command to my .bashrc:
alias doku='gvim -u ~/.vim/doku.vim'
And added the following to doku.vim:
source ~/.vim/dokuvimki.vim :DWAuth :DWEdit andrew_s_log
Wow, really really cool. I can edit this wiki using a vimplugin: http://www.stumbleupon.com/su/1b3ysr/www.chimeric.de/blog/2008/0314_dokuwiki_xml-rpc_g_vim_dokuvimki
:DWAuth :DWEdit andrew_s_log # edit :DWSend :help dokuvimki
Found dokuwiki shortcut keys: http://www.dokuwiki.org/accesskeys ctrl-shift e to edit. ctrl-shift s to save.
llvm printing is done in lib/VMCore/AsmWriter.cpp
Trying to figure out why llvm uses a custom StringRef class instead of std::string. Turns out the reason is std::string always stores the string value in the heap which can cause problems in multithreaded environments where there is contention for the global heap (http://www.ddj.com/cpp/184405453). Same with Twine. Note: neither stringref/twine store the actual string so don't try to store them.
Blocked legup.org from google for now (robots.txt):
User-agent: * Disallow: /
— Andrew Canis 2009/12/03 21:53
Had to add daemon=yes to my gitosis.conf file for git-daemon to work. And had to add this to gitweb.conf so gitweb=no works:
# Point to projects.list file generated by gitosis. $projects_list = "/home/git/gitosis/projects.list";
There's a difference between git-daemon and gitosis access. To get read-only access using git-daemon run:
git clone git://legup.org:7326/legup legup
For push access (over ssh), add your ssh key to gitosis-admin and then run:
git clone git@legup.org:legup legup
Playing around with the gitosis permissions. It's helpful to look at /var/log/git-daemon. Getting the error:'receive-pack': service not enabled. Okay, this is expected, git-daemon is read-only be default. So I still need to configure gitosis properly. Following: http://vafer.org/blog/20080115011413
acanis@acanis-desktop:~$ git clone git@legup.org:gitosis-admin gitosis-admin
Just installed the new VirtualBox (3.1). Adds support for live migration and branched snapshots
You can now run any previous snapshot and create branches from previous snapshots. This is a huge win, because we don't have to make a new VM anymore every time we want to virtualize another fresh Ubuntu environment. We can just keep around a single snapshot of a clean install and branch it as needed.
— Andrew Canis 2009/12/02 23:28
Okay it's working. Latency 11240/2=5620. Fmax=125
# At t= 11240000 clk=1 ret_result= 0
So it appears the default: case of the switch() statement is not handled properly. If you look at b30s in the code, you can see the 16 cases of the switch statement: 32'h 21 (ADDU), 32'h 23 (SUBU), etc. However, the default case actually tries to test every single value not included in the switch statement, and misses tons of cases. I replaced the default case with what it should have been (true only if no other case was true):
ni1148_suif_tmp39 = ~(ni1125_suif_tmp23 | ni1126_suif_tmp24 | ni1127_suif_tmp25 | ni1131_suif_tmp26 | ni1135_suif_tmp27 | ni1136_suif_tmp28 | ni1137_suif_tmp29 | ni1138_suif_tmp30 | ni1139_suif_tmp31 | ni1140_suif_tmp32 | ni1141_suif_tmp33 | ni1142_suif_tmp34 | ni1143_suif_tmp35 | ni1144_suif_tmp36 | ni1145_suif_tmp37 | ni1146_suif_tmp38 );
I seem to be getting stuck in state 48:
# At t= 11197001 clk=1 thisState= 48 p_return_value_en=0 stateEn=1
r_pc is not quite the actual pc. It seems to increment once too many every time there is a branch. But this matches the behaviour of r_n_instr which only increments once in these cases.
dmem is getting written with the right final result. Here's the final write to dmem[7]:
... # At t= 10766000 clk=1 we=1 w_addr= 7 r_addr=56 din=00000026 dout=xxxxxxxx
Didn't create module for cregister. The .mif files forgot the ';' after END.
Still doesn't work. The shl() doesn't seem to be implemented properly.
# ** Fatal: (vsim-3734) Index value 32 is out of range 31 downto 0. # Time: 695 ns Iteration: 2 Process: /tb_mips/mips_0/mips/line__1398 File: Z:/mips/hw/lib/impack.vhd # Fatal error in ForLoop loop at Z:/mips/hw/lib/impack.vhd line 424 # # HDL call sequence: # Stopped at Z:/mips/hw/lib/impack.vhd 424 ForLoop loop # called from Z:/mips/hw/mips_comp.vhd 1398 Architecture rtl
Needed to replace:
architecture rtl of main is ... ni659_reg <= shl(r_reg, r_shamt);
with
library ieee; use ieee.numeric_std.all; architecture rtl of main is ... ni659_reg <= shl(r_reg, to_integer(unsigned(r_shamt)));
From Windows modelsim:
vsim 4> do setup.tcl vsim 4> r vsim 4> run 10us # Cannot continue because of fatal error. # HDL call sequence: # Stopped at C:/altera/91/modelsim_ase/win32aloem/../altera/vhdl/src/altera_mf/altera_mf.vhd 39474 Subprogram read_my_memory # called from C:/altera/91/modelsim_ase/win32aloem/../altera/vhdl/src/altera_mf/altera_mf.vhd 40968 Process memory #
ImpulseC can't generate testbenches in verilog. My modelsim on linux doesn't have a vhdl simulation license. To make a project from linux:
acanis@acanis-desktop:~/vmshare/mips$ icProj2make.pl mips.icProj # Make targets supported: clean, build_exe, build, build_testbench, export_hardware, export_software make -f _Makefile build
Interesting command called impulse_s2xml that converts the binary IR into an xml format (Fir51.xic)
ImpulseC license isn't working in my VirtualBox. The MAC address is different. Click gear icon beside Network→NAT to change MAC address. Works.
So there is no actual GUI for ImpulseC in linux: http://impulse-support.com/forums/index.php?showtopic=832&hl=linux
The documentation for the Linux release is somewhat minimal, but you can find a section entitled “Command Line Tools” that explains how to create Makefiles from .icProj files, which are then used to build various targets: desktop simulation executable, HDL generation, and exporting hardware and software files.
To create a project under Linux, create a .icProj file using a text editor. Modifying an existing file is a good way to start. The key/value pairs used for the project options correspond to the same fields shown in the CoDeveloper GUI environment under Windows, so you can refer to the CoDeveloper documentation for information on these options.
Remember to source: ~/impulsec/codeveloper/Impulse/CoDeveloper3/codeveloper-profile.sh
Getting ImpulseC to work. Also installing ImpulseC and Quartus on my Windows VirtualBox. Had to add:
SERVER acanis-desktop 00044b190164 <port number>
To license file. Still get an error when running:
acanis@acanis-desktop:~/impulsec/FLEXnet/v11.5/i86_re3$ sudo ./lmgrd -c /home/acanis/impulsec/codeveloper/Impulse/CoDeveloper3/bin/CoDeveloper.lic .... 16:56:04 (lmgrd) impulsed already running 27001@acanis-desktop 16:56:04 (lmgrd) The license server manager (lmgrd) is already serving all vendors, exiting.
— Andrew Canis 2009/12/01 16:45
Code for testing ac_fixed:
#include "ac_fixed.h" #include <iostream> using namespace std; int main() { const float num = 1.888888; ac_fixed<16,3,true> a = num; ac_fixed<16,10,true> b = num; ac_fixed<16,16,true> c = num; cout << a.to_string(AC_DEC) << "\n"; cout << b.to_string(AC_DEC) << "\n"; cout << c.to_string(AC_DEC) << "\n"; }
Found another xPilot bug. In the xPilot-prj.tcl file, set_cycle_time cannot be larger than 8 (otherwise the compiler gets stuck in an infinite loop). Also the “-unit ns” means nothing, xPilot always uses ns.
Just discovered how to use TbGen.tcl:
~/xpilot/xpilot-rel/samples/mips/auto_work/systemc$ TbGen.tcl main.tbgen.tcl
Found a great html validation site: http://www.htmlhelp.com/tools/validator/
mips.c error about printf():
Error: Cannot find the CDFG: printf, you may import a design first. Shell-4 ERROR: CDFG "printf" not imported
— Andrew Canis 2009/11/30 14:16
Installed Windows and Office '07 in a VirtualBox VM so I can create the powerpoint presentation for Wednesday.
CHStone benchmarks are on a Virtex 4. Slice: 2 4-LUTs, 2 2-input MUXes, 2 registers.
— Andrew Canis 2009/11/29 22:28
Trying dfadd. If I use a void pointer cast in ullong_to_double() instead of a union then xPilot synthesizes without an assertion. Strange: the printf() in main compiles fine! This didn't happen for mips.c.
— Andrew Canis 2009/11/28 22:09
Okay finally working. My comments:
always @(posedge clk) __main_134_done <= __main_134_start;
Best-case latency (# of clock cycles): 83 Averge-case latency (# of clock cycles): 153 Worst-case latency (# of clock cycles): 178
- Synthesized datapath summary ------------------------------------------------------------ * Number of ports: 4 * Number of functional units: 165 * Number of mults: 1 * Number of addsubs: 10 * Others = 154 * Number of registers: 32 * Number of register files: 0 * Equivalent number of 2-to-1 multiplexers: 111 * Number of nets: 443
I made another mistake:
ram[address0] <= d0; $writememh(write, ram);
Wasn't printing the change to ram (nonblocking assignment). Changed to:
ram[address0] = d0; $writememh(write, ram);
dmem_q0 = ffff ffef, outdata_q0 = 0fff ffef, when they should be identical. Damn, that's my fault negative values in outdata were not 32 bits wide, and $readmemh doesn't sign extend.
$readmemh is supported but using a parameter as the filename gives an error! So you have to set the file to a constant string literal.
Got the program counter of the verilog simulation matching the gcc mips.c version perfectly. There was a problem with Reg_239 being a signed 5 bit number, which meant when it was assigned to the 6 bit address it was sign extended, making 31 (011111) → 63 (111111).
But the result is still incorrect. I need to stop the simulation as soon as done=1. There are lots of instances of size truncation in the design, I'm wondering whether a similar problem is affecting the dmem ram? The results are very close dmem is correct except for the last entry which is 0x16 instead of 0x26.
— Andrew Canis 2009/11/27 13:44
Cool, just read that Quartus supports the $readmemb and $readmemh system commands in Verilog to initialize memories with a text file.
Okay, I can get ret_result to -8 if I change the input data to match mips.c (I was using a different input vector). This is due to the fact that 3 is the same between input and output. So, the n_inst is also wrong. Can I print it? However, dmem is still identical to the input vector.
Setting a = outdata, I get ret_result = -1 (from n_inst)
Added some display lines to main.v, I'm always in the same state:
# At t= 431 clk=0 CS=1010001 parameter ST_81 = 7'b1010001;
Okay, it was a problem with my ram code. Now I get a result but it's wrong:
# At t= 458 clk=1 main_result_o=11111111111111111111111111110111 A_address0=1000 A_ce0=0 A_q0=00000000000000000000000000000001, imem_ce0=0 imem_address0=101011 imem_q0=00000011111000000000000000001000, outData_ce0=0 outData_address0=1000 outData_q0=00000000000000000000000000001000, ret_result=-9 done=1
Had to create my own ram in memory. Seems to be reading A and imem properly, but I never get a result (even after 500k clock cycles):
# At t= 939178000 clk=1 main_result_o=00000000000000000000000000000000 A_address0=1000 A_ce0=0 A_q0=00000000000000000000000000000001, imem_ce0=0 imem_address0=011100 imem_q0=00010001000000000000000000001011, outData_ce0=0 outData_address0=xxxx outData_q0=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, ret_result=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx done=0
Nope. It's not a xilinx vs altera thing. dmem and reg refer to arrays in my mips.c code…
Damn. Same prob with both verilog and vhdl:
Error (10481): VHDL Use Clause error at main.vhd(504): design library "work" does not contain primary unit "N_n_main
I think this might be a problem with using Altera/Stratix as the target. Lets switch back to Xilinx. Nope, didn't work, switching back to Altera.
Can't find dmem.h reg.h N_n_main.h anywhere in the xPilot directory. So I'm just going to move to verilog for now.
Working on xPilot for meeting today. Can't get systemC to compile:
acanis@acanis-desktop:~/xpilot/xpilot-rel/samples/mips/auto_work/systemc$ g++ -I. -I/home/acanis/xpilot/xpilot-rel/tools/systemc-2.1.v1/src -o sim_main main.cpp -lsystemc -L/home/acanis/xpilot/xpilot-rel/tools/systemc-2.1.v1/lib
— Andrew Canis 2009/11/25 11:07
So, the makefile isn't detecting changes to llvm/tools. What about lib? Same thing, I touch a file in lib, make from legup, nothing. Then make from llvm-objects and the file is recompiled. It's probably because those makefiles aren't included in autoconf/configure.ac: AC_CONFIG_MAKEFILE(tools/legup/Makefile) etc.
Trying to figure out LLVM build system. I modify ../llvm/tools/llc.cpp from within legup and rerun make, nothing happens. But from llvm-object I get:
.. make[1]: Entering directory `/home/acanis/legup/llvm-objects/tools/llvm-config' llvm[1]: Regenerating LibDeps.txt.tmp make[2]: Entering directory `/home/acanis/legup/llvm-objects/tools/llc' llvm[2]: Compiling llc.cpp for Release build llvm[2]: Linking Release executable llc (without symbols) llvm[2]: ======= Finished Linking Release Executable llc (without symbols)
ImpulseC License is expired:
acanis@acanis-desktop:~/impulsec/FLEXnet/v11.5/i86_re3$ 13:12:29 (lmgrd) FLEXnet Licensing (v11.5.0.0 build 56285 i86_re3) started on acanis-desktop (linux) (11/24/2009) 13:12:29 (lmgrd) Copyright (c) 1988-2007 Macrovision Europe Ltd. and/or Macrovision Corporation. All Rights Reserved. 13:12:29 (lmgrd) US Patents 5,390,297 and 5,671,412. 13:12:29 (lmgrd) World Wide Web: http://www.macrovision.com 13:12:29 (lmgrd) License file(s): /home/acanis/impulsec/codeveloper/Impulse/CoDeveloper3/bin/CoDeveloper.lic 13:12:29 (lmgrd) lmgrd tcp-port 27001 13:12:29 (lmgrd) Starting vendor daemons ... 13:12:29 (lmgrd) Started impulsed (internet tcp_port 57454 pid 18294) 13:12:29 (impulsed) FLEXnet Licensing version v11.5.0.0 build 56285 i86_re3 13:12:29 (impulsed) EXPIRED: codeveloper 13:12:29 (impulsed) EXPIRED: appmonitor 13:12:29 (impulsed) EXPIRED: build_hdl 13:12:29 (impulsed) EXPIRED: stagemaster 13:12:29 (impulsed) EXPIRED: genvhdl 13:12:29 (impulsed) EXPIRED: genvlog 13:12:29 (impulsed) EXPIRED: smexplorer 13:12:29 (impulsed) EXPIRED: cyclesim 13:12:29 (impulsed) EXPIRED: covalidator 13:12:29 (impulsed) License server system started on acanis-desktop 13:12:29 (impulsed) No features to serve, exiting 13:12:29 (impulsed) EXITING DUE TO SIGNAL 36 Exit reason 4 13:12:29 (lmgrd) impulsed exited with status 36 (No features to serve) 13:12:29 (lmgrd) impulsed daemon found no features. Please correct 13:12:29 (lmgrd) license file and re-start daemons.
view symbols in static/shared library:
nm -g Release/lib/libLLVMVerilog.a
xPilot synthesis is done with an LLVM optimization pass:
cd xpilot-rel/samples/samples/enc83/auto_work opt -load ../../../../../xpilot-rel/bin/libxPilotHwSynPass.so -scalarrepl -mem2reg -raise -disaggr -loop-dep -array-flatten -instcombine -gcse -simplifycfg -xpilot -tcl xPilot.work.tcl -legalize tmp/enc.o.opt.presyn.bc
./configure –with-llvmsrc=$PWD/../llvm –with-llvmobj=/home/acanis/legup/llvm-objects/
I’m a huge fan of the git source control system for many reasons. http://whygitisbetterthanx.com/
Basing the legup repository on clang.
— Andrew Canis 2009/11/23 13:41
xPilot chstone:
BWA-11 ERROR: Pointer exception in: %bpl_addr.0 = getelementptr int* %bpl, uint %inc.0.sum ; <int*> [#uses=1] Shell-4 ERROR: Unsynthesizable Pointer found.
BWA-11 ERROR: Pointer exception in: %tmp.6 = getelementptr short* %s, uint %indvar ; <short*> [#uses=1] Shell-4 ERROR: Unsynthesizable Pointer found.
opt: ArrayFlatten.cpp:683: bool pass::ArrayFlattenPass::flattenMultiDimArrayGEP(llvm::GetElementPtrInst*, llvm::Value*): Assertion `IsZeroConst(*oi)' failed.
opt: /curr/irene/project/xpilot_repos/pkg/llvm-1.7/../../pkg/llvm-1.7/llvm/lib/VMCore/Value.cpp:157: void llvm::Value::replaceAllUsesWith(llvm::Value*): Assertion `New->getType() == getType() && "replaceAllUses of value with new value of different type!"' failed.
BWA-11 ERROR: Pointer exception in: %tmp.17 = getelementptr int* %key, int %tmp.16 ; <int*> [#uses=1] Shell-4 ERROR: Unsynthesizable Pointer found.
BWA-11 ERROR: Pointer exception in: %p2.0 = getelementptr uint* %s2, uint %indvar ; <uint*> [#uses=1] Shell-4 ERROR: Unsynthesizable Pointer found.
BWA-11 ERROR: Pointer exception in: %buffer_addr.0 = getelementptr ubyte* %buffer, uint %tmp. ; <ubyte*> [#uses=2] Shell-4 ERROR: Unsynthesizable Pointer found.
Setting up git repo. I need two submodules: llvm, and llvm-gcc.
git submodule add git://github.com/earl/llvm-mirror.git llvm git submodule add git://repo.or.cz/llvm-gcc-4.2.git llvm-gcc-4.2
Okay, I've decided not to put these in the repo. I'm going to setup the project like clang. Put it in llvm/tools.
I need a simple test suite that will run a testbench for the SRA circuit and confirm the result.
After that I can setup buildbot and nightly snapshots and then cleanup the website color scheme and I'm done.
After that:
— Andrew Canis 2009/11/22 09:27
Setup bugzilla: legup.org/bugs
Setting up the git repository: http://vafer.org/blog/20080115011320
Setup auto updates:“System” menu, then “Administration”, then “Software Sources”. Open up the “Updates” tab and select “Automatic updates”, also select “Install security updates without confirmation”.
Added subdomain http://lists.legup.org. Followed: http://blog.agdunn.net/?p=162 to setup the apache2 virtual site for the list subdomain in /etc/apache2/sites-enabled/020-mailman
Mail server setup. Created a subdomain for my MX record: smtp.legup.org, pointed at my fixed ip. Following:
To change mailman password: sudo mmsitepass -c <pass>
Setting up a webpage at legup.org. I need a website, git repo, mailing list, bugzilla, buildbot, and blog for now.
— Andrew Canis 2009/11/21 07:33
Looking into gcc now. The GIMPLE traversal is much messier!
A simple LLVM backend is done. Very impressed by the C++ api.
Very useful site: http://llvm.org/doxygen/
Needed to modify llc.cpp to change the file extension to .v for -march=v (verilog)
What does this mean while making?
llvm-config: unknown component name: veriloginfo llvm-config: unknown component name: veriloginfo
Okay, it was a problem with my backend Makefiles.
To see actual commands while running llvm make:
make TOOL_VERBOSE=1
Figured out how the LLVMInitializeVerilogTargetInfo() gets called. Look in Target/TargetSelect.h:
#define LLVM_TARGET(TargetName) void LLVMInitialize##TargetName##TargetInfo();
I was accidentally calling the backend 'Verilog' in some places and 'VerilogBackend' in other places. Also take a look at: llvm-objects/include/llvm/Config/Targets.def
To compile only llc, use utils/makellvm. Actually utils has a lot of really useful scripts, for instance 'llvmgrep'.
acanis@acanis-desktop:~/work/llvm/llvm-svn/lib/Target/Verilog$ /home/acanis/work/llvm/llvm-svn/utils/makellvm -obj ~/work/llvm/llvm-objects/ llc
Added a new backend: lib/Target/Verilog. Had to modify autoconf/configure.ac to add Verilog to TARGETS_TO_BUILD and I changed the required autoconf and libtool version in AutoRegen.sh. To regenerate the configure file: cd autoconf; ./AutoRegen.sh
Nice, my patch was rolled into LLVM! Development is really active. I posted the bug, within 30 min it was marked as a duplicate. Within an hour and a half I had a patch. It was fixed along with another bug and checked in that same day!
— Andrew Canis 2009/11/17 23:54
Looking at just the output files, gcc (gimple) is much easier to read than the llvm output. gcc even has a .cfg file that breaks up the basic blocks into a control flow graph.
gcc -fdump-tree-all outputs a .vcg file (Visualization of Compiler Graphs): http://rw4.cs.uni-sb.de/~sander/html/gsvcg1.html
Wow! Just wasted a lot of time. C operator precedence killed me:
I just realized a flaw in the SRA algo from the book:
Alright starting the HLS on llvm. The goal for tonight:
— Andrew Canis 2009/11/17 01:13
Gajski's university of california irvine tool (SCE) seems to use SpecC as its input specification. No download, but you can contact them for a cdrom.
Playing around with the de2. To get the demonstrations: http://www.terasic.com/downloads/cd-rom/de2/DE2_System_v1.6.zip
CBackend.cpp has grown by over 2000 lines in 5 years (since llvm-1.3) to a total of 3696 lines. There is an incredible amount of complexity in there, some of which I think is attributed to making gcc correctly compile the resulting C file. I saw sse2 instructions in there too.
Okay, so after diffing the changes made by trident to llvm-1.3 they are trivial. vhdl.cpp is almost identical to Writer.cpp (the old CBackend code generator) a few minor changes to help make the C output easier to parse (ie. renaming type “unsigned long long” to “ulong”) . The trident tool 'llv' is identical to 'llc' but hardcodes the optimization passes that are run (including some new trident passes), and has some special case if you specify -march=v (vhdl) to rename the output file extension to .llv. So in terms of llvm modifications, there were basically none. They just had to implement an llvm parser on the sea cucumber java side. Note: the CBackend.cpp file has changed substantially since 1.3. It would be very difficult to update trident's vhdl.cpp.
So I think the best approach is to first get the trident llvm compiling under the latest llvm version. The directory trident/compiler/llvm-1.3 is setup as an LLVM project:
Useful article on basic blocks: http://gcc.gnu.org/onlinedocs/gccint/Basic-Blocks.html
Interesting, seems to be a guy, Elvis Dowson who tried to get trident working with llvm v1.5.
He quotes price ranges of $145000 to $170000 for a commercial HLS compiler license.
llvm 1.5 had a tool called llv. What did that correspond to? Okay no, trident implemented a new tool called llv ('v' probably stands for vhdl):
acanis@acanis-desktop:~/trident/compiler/llvm-1.3$ find . -name \*.cpp ./lib/float_passes/FloatLoopUnroll.cpp ./lib/float_passes/LowerPHI.cpp ./lib/float_passes/RenameDuplicateVars.cpp ./lib/vhdl/vhdl.cpp ./tools/llv/main.cpp
lib/vhdl/vhdl.cpp is the backend for generating vhdl. This could be useful.
Okay, so llvm 1.5 is hopelessly out of date. It can't even compile stdlib.h!!
acanis@acanis-desktop:~/trident/llvm-obj$ ../cfrontend/x86/llvm-gcc/bin/llvm-c++ test.cpp test.cpp:2:20: while reading precompiled header: No such file or directory In file included from test.cpp:2: stdlib.h:278: error: expected constructor, destructor, or type conversion stdlib.h:278: error: expected `,' or `;' stdlib.h:283: error: expected constructor, destructor, or type conversion stdlib.h:283: error: expected `,' or `;' stdlib.h:288: error: expected constructor, destructor, or type conversion stdlib.h:288: error: expected `,' or `;' stdlib.h:297: error: expected constructor, destructor, or type conversion stdlib.h:297: error: expected `,' or `;'
Damn. Still doesn't compile. There is a linker error. Probably because g++-3.4 isn't supported on this ubuntu version, my binutils is probably out of date.
Trident doesn't work with the latest llvm. Had to download the older 1.5 versions (and compile with g++ 3.4):
tar xvzf llvm-1.5.tar.gz tar xvzf cfrontend-1.5.i686-redhat-linux-gnu.tar.gz cd cfrontend/x86/ ./fixheaders cd ../.. export PATH=$PWD/cfrontend/x86/llvm-gcc/bin/:$PATH mkdir llvm-obj cd llvm-obj ../llvm/configure CXX=g++-3.4 make
For some reason header files were missing stdlib.h and string.h: Run this .sh script in llvm/include to add them to all the header files:
#!/bin/sh find -maxdepth 2 -type f | while IFS= read vo do echo "#include<stdlib.h> #include<string.h> " > .tmp cat "$vo" >> .tmp mv .tmp "$vo" done
Tons of issue compilingThis is taking too long. Trying to compile with g++-3.4
Compiling trident:
cd trident/compiler export CLASSPATH=/usr/share/java/antlr.jar:/usr/share/java/gnu-getopt.jar export LLVMGCCDIR=~/work/llvm/llvm-gcc-4.2-install ant
Figured out how to get “llvm-” prefix on llvm-gcc: configure option –program-prefix=llvm-
Just found a very interesting open source HLS tool called Trident: http://trident.sourceforge.net Created by the Los Alamos National Laboratory. The last commit was Nov 2006. It's written in Java but uses LLVM as the front-end. Based on Sea Cucumber: a synthesizing compiler mapping Java byte-code to FPGAs. Very interesting. This looks like the first GPL tool I've found. Too bad it's in Java. There are two papers:
The most interesting thing about Trident is it supports floating point algorithms.
Looking at lib/Target/CBackend/. We need to subclass TargetMachine and implement: WantsWholeFile() and addPassesToEmiteWholeFile()
Due to the bug listed below, I've temporarily reverted to llvm revision 88706:
$ cd llvm-svn $ svn update -r 88706
Created legup.sourceforge.net just in case.
— Andrew Canis 2009/11/15 19:08
From the LLVM documentation on functions:
A function definition contains a list of basic blocks, forming the CFG (Control Flow Graph) for the function. Each basic block may optionally start with a label (giving the basic block a symbol table entry), contains a list of instructions, and ends with a terminator instruction (such as a branch or function return).
The first basic block in a function is special in two ways: it is immediately executed on entrance to the function, and it is not allowed to have predecessor basic blocks (i.e. there can not be any branches to the entry block of a function). Because the block can have no predecessors, it also cannot have any PHI nodes.
So the first question, should HLS be implemented as a backend or as an optimization pass? Do we have access to alias analysis etc. in the backend?
Damn, updated llvm from svn and now llvm-gcc won't compile. Filed a bug: http://llvm.org/bugs/show_bug.cgi?id=5497 At least I got to learn how to file a bug (http://llvm.org/docs/HowToSubmitABug.html) and use bugpoint. git bisected the error down to a commit a few days ago.
Dominators: in control flow graphs, a node d dominates a node n if every path from the start node to n must go through d.
svn updated my llvm and llvm-gcc4.2 and recompiled. Looking over LLVM documentation:
LLVMs whole contribution is to be a “cleaner” IR. They use the gcc front-end to parse C/C++, convert GIMPLE to LLVM IR, then output assembly and let gcc compile the assembly. So you can't even call LLVM a compiler. It's simply an IR, or compiler infrastructure.
Done reading the book chapter. Great overview of HLS. I'm going to implement a simple time constrained scheduler for the square-root approximation in the book. The question still remains, gcc or llvm? Just implement it twice, first on llvm (probably will be easier) then on gcc. I'd like this to be in C++. Work from svn versions of both gcc and llvm.
— Andrew Canis 2009/11/15 02:17
Reading a book chapter on high level synthesis from Gajski et al., “Embedded System Design: Modeling, Synthesis and Verification”. Springer. August 24, 2009
The chapter looks at various examples of transforming C code to hardware.
— Andrew Canis 2009/11/14 05:46
I created a wordpress blog on my local machine: http://128.100.241.23/wordpress/
Okay, I'm going to hold off on the blog for now. I've created a google group for legup: http://groups.google.com/group/legup
I'm thinking it's a good idea to replace this wiki log with a blog. Why? Keep track of dates better? It'll be good for people to keep track of my progress (Desh, Jason, Steve, Mark). It'll help me write a bit better and organize my ideas. I can still keep a wiki like this if I want, but it's better to organize my writing a little more before posting it.
I've decided to use Google Code. It lacks git support but I can just use git-svn or self-host the git repo on eecg. It has support for blogs, and google groups (mailing lists). Damn, name is already taken. Ok, I'll use github. Setup: http://github.com/acanis/legup It even has an issue tracker!
Just found an interesting project page: SableCC.org. They use trac to keep a wiki. They also have a mailing list, bug tracker, and git repo. I need something like this.
Read about SIMPLE IR that GIMPLE was based on: L. Hendren, C. Donawa, M. Emami, G. Gao, Justiani, and B. Sridharan. Designing the McCAT compiler based on a family of structured intermediate representations. In Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing, pages 406-420. Lecture Notes in Computer Science, no. 457, Springer-Verlag, August 1992.
— Andrew Canis 2009/11/13 01:15
Yes, this was the problem. pass name “ssa” refers to pass_build_ssa which is an early pass. What I want is the plugin to run after “loopdone”
Trying to see how the tree changes with auto-vectorization. Currently I don't see a change, I might be too early in the optimization passes.
../install/bin/gcc -fplugin=plugin.so -O2 vector.c -fdump-tree-all-details -ftree-vectorize -msse2
Where vector.c is:
#include<stdio.h> int main() { int a[256], b[256], c[256]; int i; for (i=0; i<256; i++){ a[i] = b[i] + c[i]; } for (i=0; i<256; i++){ printf("%d %d %d\n", a[i], b[i], c[i]); } }
To run high level synthesis plug-in:
cd ~/work/gcc/plugin make ../install/bin/gcc -fplugin=plugin.so -O2 test.c -fdump-tree-all-details cat *hls
Auto-vectorization is not trivial to implement. wc -l tree-vec* in gcc shows 20K lines of code.
So using the gcc plugin architecture is optional. But there are some big advantages listed here: http://gcc.gnu.org/wiki/GCC_Plugins The big one is you don't have to patch/bootstrap every new version of gcc, you can just grab the latest gcc binary and run your plugin without modification.
Good graph of different trees (GENERIC, GIMPLE, RTL): http://gcc.gnu.org/projects/tree-ssa/#ssa
Aside: Wow, I have been looking for this for years. Instrumenting cycle counts for a function: http://libplugin.sourceforge.net/tutorials/simple-gcc.html Okay this is similar to pfmon. Unfortunately my kernel doesn't support it.
Damn. Just realized that gcc has changed it's optimization pass code to use “plug-ins”. See: http://ehren.wordpress.com/2009/11/04/creating-a-gcc-optimization-plugin/ I'm going to need the latest svn version of gcc. Recompiled svn gcc (had to autoconf configure.ac to get configure to work)
../gcc-svn/configure --prefix=$PWD/../install --enable-languages=c,c++ --enable-checking
Note –enable-checking. This was recommended when working with gcc trees.
p in passes.c is a linked list. Start it by setting p to an address, end it by setting p to NULL. These get executed in order by execute_pass_list(). Actually look right above init_optimization_passes() in passes.c for a good description of the flow.
Interesting discussion about merging gcc and llvm. Including a GIMPLE to LLVM translator…
Okay, this makes more sense now, if you look in the llvm-gcc-4.2-svn you'll see ENABLE_LLVM define statements. There is a c++ file called llvm-convert.cpp that takes GIMPLE AST and converts it to LLVM IR. The actual function is llvm_emit_code_for_current_function().
Unfortunately, high GIMPLE is converted to LLVM before any of the tree-ssa optimizations are run (which require low GIMLPLE). So you loose gcc auto-vectorization.
Writing an optimization pass: http://gcc.gnu.org/wiki/WritingANewPass
tree.def and cp/operators.def describe the possible tree nodes in GIMPLE.
Simple optimization pass: http://gcc.gnu.org/ml/gcc-patches/2005-05/msg01061.html
SIMPLE: The IR that GIMPLE was based on: http://www.springerlink.com/content/7h14h08051754749/
— Andrew Canis 2009/11/12 00:38
Useful: http://gcc.gnu.org/readings.html
To print out GIMPLE c-like representation: gcc -fdump-tree-gimple. Handled by code in tree-dump.c
Used simulink to generate a simple FIR filter and display the results in both the time and frequency domain. (work/simulink) There is a problem where my high pass filter (with fs=2nhz) is filtering out a 1khz signal. Can't figure out why. To design the filter use: » fdatool. This tool can also generate verilog with testbenches.
Downloaded Algorithmic C Datatypes from Mentor Graphics. Contains three templated datatypes: ac_int, ac_fixed, ac_complex.
LLVM auto-vectorization support is not finished. The last LLVM dev mailing list post from Andreas Bolka (the Google summer of code applicant that took on the project) was April 1, 2009.
To create a CDFG using GAUT run:
/home/acanis/GAUT_2_4/GautC/cdfgcompiler/bin/cdfgcompiler -S -c2dfg -O2 -I /home/acanis/GAUT_2_4/GautC/lib -I. test.c
— Andrew Canis 2009/11/10 13:09
Met with Karen Tam. She was confused about the bounding box intersection calculation used to find the location of each pseudo pin. Also explained how all forces are calculated using the new positions expected in the next iteration, not the current positions. Overall a productive meeting. The code is installed and running on her account.
Installed Ubuntu 9.04 32-bit version. I was having too many problems with 64-bit linux.
Compiled gcc-4.3:
$ sudo apt-get install gcc-4.3-source $ cd /usr/src/gcc-4.3/ $ tar xvjf gcc-4.3.3.tar.bz2 $ mkdir obj $ mkdir install $ cd obj $ configure ../gcc-4.3.3/configure --prefix=$PWD/../install --enable-languages=c,c++ $ make -j4
Emailed Philippe about GAUT, apparently he's just cleaning up the source and will release it.
Read this:
— Andrew Canis 2009/11/06 02:15
Found a free book on DSPs:
— Andrew Canis 2009/11/03 01:49
Installed VirtualBox, it's really slick, you can mount the host O/S home directory. I think it's better than VMware.
Turns out I can just recompile the libtcl8.4.so and libelf.so.1 files using apt-get source and adding -m32 to the configure script, and adding soft links in xpilot/tools/bin. So xPilot is working on my 64-bit machine
— Andrew Canis 2009/10/30 16:56
SPARK looks very interesting. Currently reading Sumit Gupta's phD thesis from UC Irvine.
xPilot binaries don't work for 64-bit. Missing libelf.so.1. Installing VirtualBox to run a 32-bit version of Ubuntu:
Submitted a request for GAUT, another HLS tool
— Andrew Canis 2009/10/28 16:07
Something wrong with my DE2 nios, the JTAG debug module isn't appearing
> jtagconfig 1) USB-Blaster [USB 1-1.6] 020B40DD
Okay, working. jtag was messed up. Followed the directions on:
Compile LLVM (ELLCC verison with Nios2 target support):
cd ~/work/ellcc/llvm-build make -j4 ENABLE_OPTIMIZED=1 make -j4 ENABLE_OPTIMIZED=1 install
Compile LLVM gcc front-end:
export LLVMOBJDIR=/home/acanis/work/ellcc/llvm-install cd ~/work/llvm/ mkdir obj mkdir install cd obj ../llvm-gcc4.2-x.y.source/configure --prefix=`pwd`/../install --program-prefix=llvm- \ --enable-llvm=$LLVMOBJDIR --enable-languages=c,c++$EXTRALANGS $TARGETOPTIONS make -j4 $BUILDOPTIONS make -j4 install
Not compiling. Downloaded latest svn LLVM and LLVM-gcc and following the LLVM build instructions in:
— Andrew Canis 2009/10/14 01:49
Played around with cetus. It's really cool, it can automatically parallelize loops by detecting data dependencies and inserting openMP pragmas.
— Andrew Canis 2009/10/12 22:31
Added VPR to repo. Done reading the first warp processor journal paper. I'm going to download the NetBench, MediaBench, EEMBC, and Powerstone benchmarks used in the paper. Also I'll write up a summary of my thoughts on the paper.
— Andrew Canis 2009/09/30 10:58
Added milestones page. Reading over everything on Jason's HW Compile website. In particular looking into the warp processor.
— Andrew Canis 2009/09/29 09:30
Fixed the jtag server:
$ sudo mount -t usbfs /dev/bus/usb/ /proc/bus/usb/ $ killall jtagd $ sudo <quartus install path>/bin/jtagd $ jtagconfig
To run jtagd I had to create a /bin/arch script containing:
#!/bin/bash uname -m
DE2 is now working again. These issues were due to a ubuntu update.
— Andrew Canis 2009/09/28 11:35
Got Quartus license working. The license file is: /opt/license.dat. Uncommented lines in .bashrc setting LM_LICENSE_FILE=/opt/license.dat Need to run lmgrd at startup as non-root:
— Andrew Canis 2009/09/28 10:48
Setup wiki.
— Andrew Canis 2009/09/27 16:00