User Tools

Site Tools


andrew_s_log

It's very easy to install dokuwiki. Basically just extract the tarball over the existing installation.

Andrew Canis 2009/10/18 16:00

There is a bug in the “make hybrid” flow:

dfadd.o: In function `main':
(_main_section+0x4c): undefined reference to `float64_add'
dfadd.o: In function `main':
(_main_section+0x68): undefined reference to `float64_add'
make: *** [hybrid] Error 1

Looking in dfadd.sw.ll:

  %4 = call i64 bitcast (i64 (i64, i64)* @legup_wrap_float64_add to i64 (i64, i64 (i64, i64)*)*)(i64 %3, i64 (i64, i64)* @float64_add)

There is a reference to float64_add that shouldn't be there. Breaking down this function call:

  %4 = call i64 
        bitcast (i64 (i64, i64)* @legup_wrap_float64_add to i64 (i64, i64 (i64, i64)*)*)
        (i64 %3, i64 (i64, i64)* @float64_add)

What is that strange bitcast? Before the llvm-ld this was:

  %4 = call i64 @legup_wrap_float64_add(i64 %3, i64 (i64, i64)* @float64_add)

Before the sw pass it was:

  %4 = tail call i64 @float64_add(i64 %2, i64 %3)

So something is going wrong in the sw pass. It's a bug in ReplaceCallWith() in utils.cpp

%4 = tail call i64 @float64_add(i64 %2, i64 %3)

Becomes:

%4 = call i64 @legup_wrap_float64_add(i64 %3, i64 (i64, i64)* @float64_add)

Okay fixed it.

Seeing another problem with aes when accelerating aes_main:

acanis@acanis-desktop:~/git/legup/examples/chstone_hybrid/aes$ export LEGUP_ACCELERATOR_FILENAME=aes;         ../../../llvm/Release+Asserts/bin/opt -legup-config=config.tcl -load=../../../cloog/install/lib/libcloog-isl.so -load=../../../cloog/install/lib/libisl.so -load=../../../llvm/tools/polly/Release+Asserts/lib/LLVMPolly.so  -load=../../../llvm/Release+Asserts/lib//LLVMLegUp.so -legup-sw-only < aes.prelto.bc > aes.prelto.sw.bc
opt: SwOnly.cpp:205: virtual bool legup::SwOnly::runOnModule(llvm::Module&): Assertion `0 && "Accelerated function is never called or optimized away!\n"' failed.

Andrew Canis 2011/10/05 12:15

/*-----------------------------------------------*
*           CLooG configuration is OK           *
*-----------------------------------------------*/
It appears that your system is OK to start CLooG compilation. You need
now to type "make". After compilation, you should check CLooG by typing
"make check". If no problem occur, you can type "make uninstall" if
you are upgrading an old version. Lastly type "make install" to install
CLooG on your system (log as root if necessary).
make -C cloog
make[1]: Entering directory `/home/acanis/git/new/legup/cloog'
CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/bash /home/acanis/git/new/legup/cloog/autoconf/missing --run aclocal-1.11 -I m4
/home/acanis/git/new/legup/cloog/autoconf/missing: line 54: aclocal-1.11: command not found
WARNING: `aclocal-1.11' is missing on your system.  You should only need it if
         you modified `acinclude.m4' or `configure.ac'.  You might want
         to install the `Automake' and `Perl' packages.  Grab them from
         any GNU archive site.
CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/bash /home/acanis/git/new/legup/cloog/autoconf/missing --run autoconf
 cd . && /bin/bash /home/acanis/git/new/legup/cloog/autoconf/missing --run automake-1.11 --foreign
/home/acanis/git/new/legup/cloog/autoconf/missing: line 54: automake-1.11: command not found
WARNING: `automake-1.11' is missing on your system.  You should only need it if
         you modified `Makefile.am', `acinclude.m4' or `configure.ac'.
         You might want to install the `Automake' and `Perl' packages.
         Grab them from any GNU archive site.
configure.ac:59: error: possibly undefined macro: AM_INIT_AUTOMAKE
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
configure.ac:73: error: possibly undefined macro: AC_PROG_LIBTOOL
configure.ac:75: error: possibly undefined macro: AM_CONDITIONAL
make[1]: *** [configure] Error 1

To fix this I just added back in the ./autogen.sh in the cloog directory

Andrew Canis 2011/10/04 12:15

Looking into why the xor-xor pattern in blowfish is taking up more registers.

        strict (no sharing) -> strict off
reg:    6853                -> 7318
aluts:  6795                -> 6575

Just sharing the 5 patterns:

	Pattern Size: 5 (contents: addi32, xori32, addi32, xori32, xori32, )
	Frequency: 15
	Number of Pairs: 7

I get an improvment from:

;     Combinational ALUTs           ; 8,579 / 58,080 ( 15 % )                        ;
; Total registers                   ; 8389                                           ;
; Logic utilization                                                                 ; 10,935 / 58,080 ( 19 % )     ;

To:

;     Combinational ALUTs           ; 6,945 / 58,080 ( 12 % )                        ;
; Total registers                   ; 7494                                           ;
; Logic utilization                                                                 ; 9,586 / 58,080 ( 17 % )      ;

So reduction of 895 registers, 1634 aluts, 1349 logic utilization

Just sharing size 3 patterns:

Function: BF_encrypt
	Pattern Size: 3 (contents: addi32, xori32, addi32, )
	Frequency: 16
	Number of Pairs: 7


Function: BF_cfb64_encrypt
	Pattern Size: 3 (contents: ori32, ori32, ori32, )
	Frequency: 2
	Number of Pairs: 1

I see:

;     Combinational ALUTs           ; 7,796 / 58,080 ( 13 % )                        ;
; Total registers                   ; 7446                                           ;
; Logic utilization                 ; 9,706 / 58,080 ( 17 % )           ;

Only size 1 pairs

Function: BF_encrypt
	Pattern Size: 1 (contents: xori32, )
	Frequency: 50
	Number of Pairs: 23

	Pattern Size: 1 (contents: addi32, )
	Frequency: 32
	Number of Pairs: 15
Function: BF_cfb64_encrypt
	Pattern Size: 1 (contents: ori32, )
	Frequency: 6
	Number of Pairs: 3
Function: main
	Pattern Size: 1 (contents: ori32, )
	Frequency: 3
	Number of Pairs: 1

	Pattern Size: 1 (contents: addi32, )
	Frequency: 3
	Number of Pairs: 1
;     Combinational ALUTs           ; 7,168 / 58,080 ( 12 % )                        ;
; Total registers                   ; 7336                                           ;
; Logic utilization                                                                 ; 9,711 / 58,080 ( 17 % )           ;

Andrew Canis 2011/09/22 12:15

Look into the restrict keyword for pointers

TODO: - make sure srem/sdiv share together. - srem is only a problem with aes - function inlining would save 2 dividers in jpeg, 1 in sha, 1 in aes (assuming div/rem sharing) Damn. sdiv and srem are used in the same state. - binding aware scheduling would be crucial here

Side note: C2H has some useful benchmarks

Andrew Canis 2011/09/15 12:15

Stefan just noticed a large drop in LEs in the quality of results page. It occurs for these commits (whihc doesn't make sense)

Stefan Hadjis [Thu, 25 Aug 2011 16:17:22 +0000]
    Fixed compilation error
Stefan Hadjis [Thu, 25 Aug 2011 15:54:48 +0000]
    Removed include Signals.h
Stefan Hadjis [Thu, 25 Aug 2011 15:40:26 +0000]
    Merge branch 'master' of legup.org:legup
Stefan Hadjis [Thu, 25 Aug 2011 14:50:44 +0000]
    Binding changes for new LLVM version
    Made small changes to be compatible with the new version of LLVM.

Maybe that's when I changed the quartus version? Build before: Version 9.1 Build 350 03/24/2010

Cycle geomean: 14576.0837656035
Fmax geomean: 80.3956562368825
Latency geomean: 181.702363878993
cat benchmark.csv
name time cycles Fmax LEs regs comb mults membits 
chstone/adpcm 407 31523 77.38 24284 10585 21786 300 27072 
chstone/aes 209 15716 75.18 21590 11386 18041 0 36800 
chstone/blowfish 2811 197978 70.43 15967 8368 14198 0 150240 
chstone/dfadd 6 804 124.98 10113 3911 9564 0 17056 
chstone/dfdiv 29 2256 78.27 18079 12521 12256 48 12416 
chstone/dfmul 3 291 107.33 5095 2382 4545 32 12032 
chstone/dfsin 1010 64433 63.80 33363 18077 26105 86 12832 
chstone/gsm 84 5358 63.54 19058 5813 17477 70 10144 
chstone/jpeg 33949 1323338 38.98 46485 19051 42071 240 468784 
chstone/mips 52 5118 98.11 5042 2044 4492 16 4480 
chstone/motion 54 6379 117.52 5449 2406 5000 0 33312 
chstone/sha 2843 233875 82.25 17015 8563 14004 0 134368 
dhrystone 82 7424 90.93 6893 3737 5611 4 2256 
program finished with exit code 0
elapsedTime=25113.470937

Build after: Version 9.1 Build 350 03/24/2010

Cycle geomean: 14576.0837656035
Fmax geomean: 84.5261619961444
Latency geomean: 173.020465101761
cat benchmark.csv
name time cycles Fmax LEs regs comb mults membits 
chstone/adpcm 410 31523 76.91 24100 10173 21358 172 27072 
chstone/aes 190 15716 82.52 19730 10508 16113 0 36800 
chstone/blowfish 2724 197978 72.69 13367 7684 11544 0 150240 
chstone/dfadd 6 804 134.68 8531 3879 7965 0 17056 
chstone/dfdiv 25 2256 89.90 14962 10736 9430 48 12416 
chstone/dfmul 3 291 101.49 4451 2147 3965 32 12032 
chstone/dfsin 943 64433 68.32 30048 16602 23622 86 12832 
chstone/gsm 69 5358 77.93 11112 5864 9598 52 10144 
chstone/jpeg 33026 1323338 40.07 40614 19051 36075 172 468784 
chstone/mips 53 5118 95.83 4244 1718 3871 16 4480 
chstone/motion 51 6379 125.57 4726 2322 4273 0 33312 
chstone/sha 2738 233875 85.43 15692 8657 12623 0 134368 
dhrystone 82 7424 90.43 6566 3673 5338 4 2256 
program finished with exit code 0
elapsedTime=21144.868643

No. It's caused by a combination of 1) VerilogWriter fix 2) A few more dividers might have been shared.

Just installed quartus 10.1 sp1. Took 40 minutes to compile dfsin with no_dsps. So the new version seems to be working.

Adding stratix4 to the buildbot:

buildmaster@acanis-desktop:~/buildbot/public_html/perf$ generate_perf.py 

And modifying dashboard/overview.html and dashboard/perf.html. Also need to modify process_log.py. Then restart the buildbot.

Actually very easy! Wow. Just noticed I wasn't backing up my buildmaster stuff. Just added it to the backup system. Updating the quartus version on buildbot up to 10.1sp1. Do I have to do something with sdc files? Also I have to fix benchmark.pl to actually work properly. Do I need sdc files? Yep otherwise you have a critical warning.

Andrew Canis 2011/09/14 12:15

TODO: - make sure srem/sdiv share together. - why are sdiv/srem with constant inputs being instantiated?

Turns out sharing between functions is actually more complicated than I originally thought. You need to instatiate the bound functional unit in the main module and then setup a mux between each instantiated module.

Just setup a branch for this half way done function inlining code (in ~/git/legup):

git checkout -b inlining
git commit -a

Cases where there are two sext/zext operations feeding an adder occur in: dfdiv, dfmul, dfsin, gsm, mips, sha, dhrystone

Andrew Canis 2011/09/13 12:15

I'm trying to turn LegupPass into a ModulePass so we can do binding across function boundaries.

Very strange. It seems like LegupConfig is getting constructed twice…

So basically there are two versions. One is created by llc and I'm not sure about the other one. I think one of them is for function passes?

acanis@acanis-desktop:~/git/legup/examples/loop$ ../../llvm/Release+Asserts/bin/llc -legup-config=../../hwtest/CycloneII.tcl -march=v loop.bc -o loop.v --debug-pass=Details
Adding from llc
Constructing LegupConfig 0x9f13d40
Constructing LegupConfig 0x9f4b230
Pass Arguments:  -targetdata -legupconfig
Target Data Layout
Legup Configuration
  ModulePass Manager
    LegupPass backend
      Unnamed pass: implement Pass::getPassName()
Pass Arguments:  -no-aa -legupconfig -legup-LiveVariableAnalysis -memdep -legup scheduler DAG -sdc-sched -simple asap -meta-asap
No Alias Analysis (always returns 'may' alias)
Legup Configuration
  FunctionPass Manager
    LVA
    Memory Dependence Analysis
    Legup directed acyclic graph with dependency and other information
    SDC Scheduler -- use linear programming for scheduling
    ASAP scheduler without resource constraints
    Complete ASAP Scheduling
0x9f13c60   Executing Pass 'LegupPass backend' on Module 'loop.bc'...
0x9f47fc0     Required Analyses: LVA, Complete ASAP Scheduling, Legup Configuration
Starting doInitialization
op_name: signed_comp_lt_32 count: 155 this 0x9f13d40
op_name: signed_comp_lt_32 count: 155 this 0x9f13d40
Starting function: main
op_name: signed_comp_lt_32 count: 155 this 0x9f13d40
op_name: signed_comp_lt_32 count: 155 this 0x9f13d40
0x9f4a2d8   Executing Pass 'LVA' on Function 'main'...
0x9f4a2d8   Made Modification 'LVA' on Function 'main'...
0x9f4a2d8   Executing Pass 'Memory Dependence Analysis' on Function 'main'...
0x9f4ad90     Required Analyses: No Alias Analysis (always returns 'may' alias)
0x9f4a2d8   Executing Pass 'Legup directed acyclic graph with dependency and other information' on Function 'main'...
0x9f4aba8     Required Analyses: Memory Dependence Analysis, Legup Configuration
op_name: signed_comp_lt_32 count: 0 this 0x9f4b230
llc: /home/acanis/git/legup/llvm/include/llvm/LegupConfig.h:356: legup::Operation* legup::LegupConfig::getOperationRef(std::string): Assertion `Operations.find(op_name) != Operations.end()' failed.

Okay. So this doesn't happen if I make LegupPass a function pass. Strange. But isn't TargetData a immutable pass? Okay. So one example is MergeFunctions, which is a ModulePass which also uses the TargetData info. Okay - it never actually adds TargetData as a required analysis pass. And actually, _none_ of the passes ever add TargetData as a required pass.

Strange. So I can't even get the TargetData analysis from legupschedulerDAG. I'll have to just make LegupConfig a global variable for now.

So lets get divider sharing working. Aes has 11 dividers/remainders. Reduces to 4 after binding. I see at least one case where they aren't being shared across function boundaries.

TODO: make sure srem/sdiv share together.

Very strange. If I call the LVA pass twice

Andrew Canis 2011/09/12 12:15

I'm going to install the latest version of ubuntu to see if the roccc binaries work.

Error compiling gcc in the roccc installation. I had to install gcc-multilib. Okay. roccc works in the latest version of ubuntu!

Andrew Canis 2011/09/08 12:15

Installed roccc:

acanis@acanis-desktop:~/roccc/roccc-0.6-distribution$ ./rocccInstall.sh -t ~/roccc/roccc-0.6-install/

ROCCC INSTALLER
This process will install ROCCC 2.0 onto your system.  Warnings will
be recorded in the file warning.log

Some steps may take a while

The GUI requires Eclipse 3.5 or higher. Please download from www.eclipse.org.

Installing modified gcc 4.0.2 for Hi-CIRRF



Installing llvm-gcc for Lo-CIRRF
Compiling the roccc-compiler proper
ROCCC already installed
Floating point cores added to the database
All of ROCCC is set up!
When prompted by the GUI, please enter: /home/acanis/roccc/roccc-0.6-distribution  as the ROCCC distribution directory
All of ROCCC has been set up.
The binaries are located in  /home/acanis/roccc/roccc-0.6-distribution/Install

Installed eclipse in ~/eclipse.

Damn when I try to build in roccc I get the error:

/home/acanis/roccc/roccc-0.6-distribution//Install/roccc-compiler/src/../bin/parser:
symbol lookup error:
/home/acanis/roccc/roccc-0.6-distribution//Install/roccc-compiler//solib/libstdc++.so.6:
undefined symbol:
_ZNSt7num_getIcSt19istreambuf_iteratorIcSt11char_traitsIcEEE2idE, version
GLIBCXX_3.4
Compilation of FFT.c failed.

So I moved the c++ lib into a tmp directory:

acanis@acanis-desktop:~/roccc/roccc-0.6-distribution/Install/roccc-compiler/solib$ mv libstdc++.so.6* tmp/

Now the gui opens okay but I get a new error:

/home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt:
/usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by
/home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt)
/home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt:
/usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.11' not found (required by
/home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt)
Compilation of FFT.c failed.

They're using quite an old version of llvm (2.3) Do I need to downgrade to libstdc++.so.5? Looking at an ldd:

 ldd /home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt
/home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt: /usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by /home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt)
/home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt: /usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.11' not found (required by /home/acanis/roccc/roccc-0.6-distribution/Install/roccc-compiler/src/llvm-2.3//Release/bin/opt)
	linux-gate.so.1 =>  (0xb772c000)
	libsqlite3.so.0 => /usr/lib/libsqlite3.so.0 (0xb7696000)
	libpthread.so.0 => /lib/tls/i686/cmov/libpthread.so.0 (0xb767c000)
	libdl.so.2 => /lib/tls/i686/cmov/libdl.so.2 (0xb7678000)
	libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0xb7589000)
	libm.so.6 => /lib/tls/i686/cmov/libm.so.6 (0xb7563000)
	libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb7554000)
	libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb73f1000)
	/lib/ld-linux.so.2 (0xb772d000)

Damn. I'm nost sure what to do here…

Andrew Canis 2011/09/07 12:15

I need to look into roccc. Try to compile the chstone benchmarks.

Todo: 1) The interface for the fsm needs to be changed. How do you know how many cycles an instruction takes? 2) Makefile needs to have an option for debugging mode 3) Gui - cleanup APIs 4) forum for mailing list

Looking at the gantt chart for popcount is very interesting. So much has been inlined that the gantt chart is huge. There are 53 states.

Looking into the multi fmax bug in benchmark.pl. jpeg seems to have the bug:

Type           : Clock Setup: 'pll50MHz:pll50|altpll:altpll_component|_clk0'
Slack          : 1.303 ns
Required Time  : 50.00 MHz ( period = 20.000 ns )
Actual Time    : 57.49 MHz ( period = 17.394 ns )
From           : tiger:tiger_sopc|data_cache_0:the_data_cache_0|Cache:data_cache_0|dcacheMem:dcacheMemIns|altsyncram:altsyncram_component|altsyncram_9hd2:auto_generated|ram_block1a0~porta_address_reg8
To             : tiger:tiger_sopc|tiger_top_0:the_tiger_top_0|tiger_top:tiger_top_0|tiger_tiger:core|tiger_decode:de|always0~1_Duplicate_OTERM447_OTERM459
From Clock     : pll50MHz:pll50|altpll:altpll_component|_clk0
To Clock       : pll50MHz:pll50|altpll:altpll_component|_clk0
Failed Paths   : 0

Type           : Clock Setup: 'altera_internal_jtag~TCKUTAP'
Slack          : N/A
Required Time  : None
Actual Time    : 48.41 MHz ( period = 20.658 ns )
From           : tiger:tiger_sopc|tigers_jtag_uart_1:the_tigers_jtag_uart_1|vJTAGUart:tigers_jtag_uart_1|FIFO:DataOut|dcfifo:dcfifo_component|dcfifo_4sp1:auto_generated|altsyncram_vu11:fifo_ram|altsyncram_rd91:altsyncram14|ram_block15a0~porta_address_reg7
To             : sld_hub:auto_hub|tdo
From Clock     : altera_internal_jtag~TCKUTAP
To Clock       : altera_internal_jtag~TCKUTAP
Failed Paths   : 0

I'm recompiling jpeg in quartus to double check this. Strange. jpeg doesnt compile for me. Very strange. I have a blank function that gets called 3 times:

declare void @mexit_spin(i32) noreturn

How did buildbot not catch this? Okay nm it was due to some new changes I've been making. Retesting with a fresh copy of the repository. remember to compile with quartus you use “make p” to setup the project then “make f”

Andrew Canis 2011/08/23 12:15

I should probably add a forum to the legup website. Actually what I really need to do is turn the mailing list into more of a forum. Like the nabble forum for llvm.

I think writing a gui is actually very useful. Because I'll be able to clean up the APIs.

How to inline everything? What does inline-threshold do? From the code:

InlineLimit("inline-threshold", cl::Hidden, cl::init(225), cl::ZeroOrMore,
        cl::desc("Control the amount of inlining to perform (default = 225)"));

I'm going to try to run opt on adpcm:

acanis@acanis-desktop:~/work/legup/examples/chstone/adpcm$ ../../../llvm/Debug+Asserts/bin/opt -debug -inline -inline-threshold=0 < adpcm.bc > adpcm.new.bc; ../../../llvm/Debug+Asserts/bin/llvm-dis adpcm.new.bc
Args: ../../../llvm/Debug+Asserts/bin/opt -debug -inline -inline-threshold=0 
Inliner visiting SCC: upzero: 0 call sites.
Inliner visiting SCC: INDIRECTNODE: 0 call sites.
Inliner visiting SCC: printf: 0 call sites.
Inliner visiting SCC: main: 4 call sites.
    NOT Inlining: cost=370, thres=0, Call:   tail call fastcc void @upzero(i32 %76, i32* getelementptr inbounds ([6 x i32]* @delay_dltx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @delay_bpl, i32 0, i32 0)) nounwind
    NOT Inlining: cost=370, thres=0, Call:   tail call fastcc void @upzero(i32 %148, i32* getelementptr inbounds ([6 x i32]* @delay_dhx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @delay_bph, i32 0, i32 0)) nounwind
    NOT Inlining: cost=370, thres=0, Call:   tail call fastcc void @upzero(i32 %258, i32* getelementptr inbounds ([6 x i32]* @dec_del_dltx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @dec_del_bpl, i32 0, i32 0)) nounwind
    NOT Inlining: cost=370, thres=0, Call:   tail call fastcc void @upzero(i32 %333, i32* getelementptr inbounds ([6 x i32]* @dec_del_dhx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @dec_del_bph, i32 0, i32 0)) nounwind
Inliner visiting SCC: INDIRECTNODE: 0 call sites.

So even though I've set the inline-threshold. Oh wait, You have to set the thresold to be high. To inline all functions run:

acanis@acanis-desktop:~/work/legup/examples/chstone/adpcm$ ../../../llvm/Debug+Asserts/bin/opt -debug -inline-threshold=100000 -inline < adpcm.bc > adpcm.new.bc; ../../../llvm/Debug+Asserts/bin/llvm-dis adpcm.new.bc
Args: ../../../llvm/Debug+Asserts/bin/opt -debug -inline-threshold=100000 -inline 
Inliner visiting SCC: upzero: 0 call sites.
Inliner visiting SCC: INDIRECTNODE: 0 call sites.
Inliner visiting SCC: printf: 0 call sites.
Inliner visiting SCC: main: 4 call sites.
    Inlining: cost=370, thres=100000, Call:   tail call fastcc void @upzero(i32 %76, i32* getelementptr inbounds ([6 x i32]* @delay_dltx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @delay_bpl, i32 0, i32 0)) nounwind
    Inlining: cost=370, thres=100000, Call:   tail call fastcc void @upzero(i32 %410, i32* getelementptr inbounds ([6 x i32]* @dec_del_dhx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @dec_del_bph, i32 0, i32 0)) nounwind
    Inlining: cost=370, thres=100000, Call:   tail call fastcc void @upzero(i32 %335, i32* getelementptr inbounds ([6 x i32]* @dec_del_dltx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @dec_del_bpl, i32 0, i32 0)) nounwind
    Inlining: cost=-14630, thres=100000, Call:   tail call fastcc void @upzero(i32 %225, i32* getelementptr inbounds ([6 x i32]* @delay_dhx, i32 0, i32 0), i32* getelementptr inbounds ([6 x i32]* @delay_bph, i32 0, i32 0)) nounwind
    -> Deleting dead function: upzero
CGSCCPASSMGR: Refreshing SCC with 1 nodes:
Call graph node for function: 'main'<<0x9d0da18>>  #uses=1
  CS<0x9d297a4> calls function 'printf'

CGSCCPASSMGR: SCC Refresh didn't change call graph.
Inliner visiting SCC: INDIRECTNODE: 0 call sites.

Andrew Canis 2011/08/23 12:15

The linker seems to be optimizing away everything in the sw/hw partitioning case.

To remove untracked files in git (be careful): This removes all directories (d) and ignored files (x)

git clean -fdx 

Makefile dependencies are annoying. Dry run can help you see what's going on:

acanis@acanis-desktop:~/git/legup$ make -n
mkdir -p cloog/install
cd cloog && ./configure --prefix=/home/acanis/git/legup/cloog/install
make -C cloog
make install -C cloog
cd llvm && ./configure --with-cloog=/home/acanis/git/legup/cloog/install --with-isl=/home/acanis/git/legup/cloog/install
make -C mips-binutils
make -C llvm
make -C tiger/hybrid/processor
make -C tiger/processor
make clean -C tiger/linux_tools
make -C tiger/linux_tools
make clean -C examples/lib/llvm
make -C examples/lib/llvm

Okay I figured out why -j doesn't propagate to recursive calls of make. I need to call $(MAKE) instead of 'make'. So make -j4 works fine now. The only problem is this screws up the nice clean “make -n” shown above.

There's a slight dependency problem with the Transforms/LegUp makefile:

LDFLAGS = $(LLVM_OBJ_ROOT)/lib/CodeGen/$(BuildMode)/IntrinsicLowering.o

Sometimes CodeGen isn't built before this line:

To reproduce run:

rm -rf llvm/lib/Transforms/LegUp/Release+Asserts/ llvm/lib/CodeGen/Release+Asserts/

And you'll see:

llvm[4]: Linking Release+Asserts Loadable Module LLVMLegUp.so
g++: /home/acanis/git/legup/llvm/lib/CodeGen/Release+Asserts/IntrinsicLowering.o: No such file or directory

Andrew Canis 2011/08/23 12:15

Trying to move IterativeModuloScheduling into the Target/Verilog directory. Running into the same errors as before:

IterativeModuloScheduling.cpp:18:33: error: polly/LinkAllPasses.h: No such file or directory
IterativeModuloScheduling.cpp:19:37: error: polly/Support/GICHelper.h: No such file or directory
IterativeModuloScheduling.cpp:20:38: error: polly/Support/ScopHelper.h: No such file or directory
IterativeModuloScheduling.cpp:21:25: error: polly/Cloog.h: No such file or directory
IterativeModuloScheduling.cpp:22:31: error: polly/Dependences.h: No such file or directory
IterativeModuloScheduling.cpp:23:28: error: polly/ScopInfo.h: No such file or directory
IterativeModuloScheduling.cpp:24:32: error: polly/TempScopInfo.h: No such file or directory
IterativeModuloScheduling.cpp:39:25: error: cloog/cloog.h: No such file or directory
IterativeModuloScheduling.cpp:40:29: error: cloog/isl/cloog.h: No such file or directory

In tools/polly/lib/Makefile there is the line:

CPP.Flags += $(POLLY_INC)

Where POLLY_INC is defined in the polly Makefile.config file. I need to add this include path in the base makefile. Okay I can just add this to the Target/Verilog makefile:

CPP.Flags += -I$(LLVM_SRC_ROOT)/../cloog/install/include \
			 -I$(LLVM_SRC_ROOT)/tools/polly/include

Actually I'm going to move this to the Transforms/LegUp directory so I can run this as a prepass.

This is annoying:

../../llvm/Debug+Asserts/bin/opt -load=../../llvm/Debug+Asserts/lib/LLVMLegUp.so -legup-prelto < pipeline.prelto.linked.bc > pipeline.prelto.bc
Error opening '../../llvm/Debug+Asserts/lib/LLVMLegUp.so': ../../llvm/Debug+Asserts/lib/LLVMLegUp.so: undefined symbol: _ZNK5polly8ScopPass5printERN4llvm11raw_ostreamEPKNS1_6ModuleE

The polly shared library is stored in the tools directory now:

./tools/polly/Debug+Asserts/lib/LLVMPolly.so

This problem again:

Error opening '../../llvm/tools/polly/Debug+Asserts/lib/LLVMPolly.so': libisl.so.7: cannot open shared object file: No such file or directory

Okay, I can fix this by loading these shared libraries manually.

Missing the SchedulerDAG from the Target/Verilog:

Error opening '../../llvm/Debug+Asserts/lib/LLVMLegUp.so': ../../llvm/Debug+Asserts/lib/LLVMLegUp.so: undefined symbol: _ZN5legup17LegupSchedulerDAG2IDE 

And SchedulerPass:

../../llvm/Debug+Asserts/bin/opt: symbol lookup error: ../../llvm/Debug+Asserts/lib/LLVMLegUp.so: undefined symbol: _ZN5legup13SchedulerPass14canChainBeforeEPN4llvm11InstructionE

Had to add the .o files to the makefile:

          $(LLVM_OBJ_ROOT)/lib/Target/Verilog/$(BuildMode)/LegupSchedulerDAG.o \
          $(LLVM_OBJ_ROOT)/lib/Target/Verilog/$(BuildMode)/SchedulerPass.o

Okay. Seems to be working now. Lets just confirm I get the same results as before and I can commit what I have so far. I'm seeing 407 cycles for examples/pipeline:

# run 7000000000000000ns
# At t=              815000 cycles=                 407 clk=1 finish=1 return_val=         0
# ** Note: $finish    : pipeline.v(1390)

Interesting. Before the update I was seeing ~500 cycles. This might be due to the new sdc scheduler? Anyway. It still should be ~300 cycles so I need to fix the cross basic block latency issue. Oh wait. It's because I've modified pipeline.c to have a loop carried dependence. Nope. I changed it back and the cycles doesn't change. Luckily I saved everything in examples/pipeline/ece1754/

I need to actually distribute the gantt.sty latex style.

Trying to debug this tiger issue. I'm running “make tigersim”. Here is what I see after a while: The runtest.log is stalled at:

Running ./../dejagnu/tiger_sim/dfadd.exp ...

Looking at htop:

 1899 acanis    20   0  3336  1008   784 S  0.0  0.0  0:00.00  |       |   `- make test_tiger_sim
 1901 acanis    20   0  8360  4792  1560 S  0.0  0.2  0:01.22  |       |       `- /usr/bin/expect -- /usr/share/dejagnu/runtest.exp -v -v -v -v --all --status=1 ../dejagnu/tiger_sim/adpcm.exp ../dejagnu/tiger_sim/aes.exp ../d
 2164 acanis    20   0  3468  1056   804 S  0.0  0.0  0:00.00  |       |           `- make tigersim
 2189 acanis    20   0  3944  1244  1044 S  0.0  0.0  0:00.00  |       |           |   `- /bin/bash -e -c cd /home/acanis/work/legup/examples/chstone/dfadd/../../../tiger/processor/tiger_DE2/tiger_sim && ./simulate
 2190 acanis    20   0  4460  1068   600 S  0.0  0.0  0:00.00  |       |           |       `- /bin/bash -e -c cd /home/acanis/work/legup/examples/chstone/dfadd/../../../tiger/processor/tiger_DE2/tiger_sim && ./simulate
 2197 acanis    20   0  3068   628   532 S  0.0  0.0  0:00.00  |       |           |           `- tee transcript.txt
 2196 acanis    20   0 17016  8416  3004 S  0.0  0.3  0:00.29  |       |           |           `- vish -- -vsim -c -do ../run_sim_nowave.tcl
 2212 acanis    20   0 71084 18480  3964 R 96.0  0.7  9:24.91  |       |           |               `- /opt/modelsim/install/modeltech/linux/vsimk -port 56352 -stdoutfilename /tmp/VSOUTwrXtYr -c -do ../run_sim_nowave.tcl
 2213 acanis    20   0 71084 18480  3964 S  0.0  0.7  0:00.00  |       |           |               |   `- /opt/modelsim/install/modeltech/linux/vsimk -port 56352 -stdoutfilename /tmp/VSOUTwrXtYr -c -do ../run_sim_nowave.tcl
 2205 acanis    20   0  3172   960   784 S  0.0  0.0  0:00.00  |       |           |               `- /opt/modelsim/install/modeltech/linux/vlm 1598714592 1226522872
 2206 acanis    20   0  4320  2784  1388 S  0.0  0.1  0:00.14  |       |           |                   `- /opt/modelsim/install/modeltech/linux/mgls/lib/mgls_asynch  -f6,10
 1925 acanis    20   0  8360  4792  1560 S  0.0  0.2  0:00.00  |       |           `- /usr/bin/expect -- /usr/share/dejagnu/runtest.exp -v -v -v -v --all --status=1 ../dejagnu/tiger_sim/adpcm.exp ../dejagnu/tiger_sim/aes.exp
 2236 acanis    20   0  6812  4056  1448 S  0.0  0.1  0:00.12  |   

Looking in /tmp/VSOUTwrXtYr all I see is:

...
Tap Controller State machine output error
Time: 0  Instance: test_bench.DUT.the_tiger_top_0.tiger_top_0.debug_controller.VJTInst.sld_virtual_jtag_component.jtag.output_logic
a_input=

Andrew Canis 2011/08/23 12:15

I need to add something to shrink the integer sizes down. There is a presentation here:

    llvm.org/pubs/2007-07-25-LLVM-2.0-and-Beyond.pdf

The llvm 2.0 release added arbitrary precision integers:

Primarily useful to EDA / hardware synthesis business:
  * An 11-bit multiplier is significantly cheaper/smaller than a 16-bit one
  * Can use LLVM analysis/optimization framework to shrink variable widths
  * Patch available that adds an attribute in llvm-gcc to get this
Implementation impact of arbitrary width integers:
  * Immediates, constant folding, intermediate arithmetic simplifications
  * New APInt class used internally to represent/manipulate these
  * Makes LLVM more portable, not using uint64_t everywhere for arithmetic

I need to get my hands on that patch. Can't seem to find it. Can't find it. I'll have to implement this myself.

For instance, I think this was the case Stefan was looking at in mips:

  %6 = phi i32 [ %227, %226 ], [ 0, %.preheader ]
  %7 = lshr i32 %pc.0, 2
  %8 = and i32 %7, 63

63 is all zeros and then 6 ones. So the above code can be turned into:

  %pc.0 = phi i32 [ %pc.1, %226 ], [ 4194304, %.preheader ]
  %7 = lshr i32 %pc.0, 2
  %8 = trunc i32 %7 to i6
  %9 = and i6 %8, 63
  %10 = zext i6 %9 to i32

Need to run this pass after link time optimization. What is the impact of this change? Probably won't affect area because quartus would have already made this optimization. Lets doublecheck. No change in cycles. Yep no impact on area.

Andrew Canis 2011/08/22 12:15

Getting rid of the array initialization takes us down to:

73735 / 2 = 36867 cycles

So saves exactly 1024 cycles. Just noticed that there are actually no stores happening in the code right now. So that's actually cheating. Were are all these cycles coming from? Is it roughly 32 * 1024 = 32768? Where fully pipelined you can do it in 4 * 1024 = 4096 I forgot about unrolling. Would that fix this?

Interesting run the command (note the -debug option)

opt -mem2reg -loops -loop-simplify -loop-unroll -unroll-threshold=192 -debug  

This fully unrolls the 32 loop but leaves the bigger outer loop.

Loop Unroll: F[main] Loop %
  Loop Size = 105
  Too large to fully unroll with count: 1024 because size: 107520>192
  will not try to unroll partially because -unroll-allow-partial not given

Now we finish in 26631 / 2 = 13315 cycles So quite a bit better. But still worse. Try to partially unroll outer loop? Doesn't work. How about fully unroll the outer loop by setting the thresold to 107520. Wow that produces a lot of code. Turning off -debug flag. Now llc is taking forever. Oh shit this is stupid. llvm just optimizes everything away. all that's left is printf statments.

Interesting. -unroll-allow-partial works if I increase unroll-threshold to 512:

Loop Unroll: F[main] Loop %
  Loop Size = 105
  Too large to fully unroll with count: 1024 because size: 107520>512
  partially unrolling with count: 4
  Trip Count = 1024
UNROLLING loop % by 4 with a breakout at trip 0!

I guess this is because 1024 is divisible by 512 (4 times) Cycles doesn't change at all though. So we can still get a 3x improvement by pipelining, which is expected because i think the inner loop has about 3 dependent operations.

There's a bug with the new polly:

acanis@acanis-desktop:~/work/legup/examples/popcount$ ~/work/legup/llvm/Debug+Asserts/bin/opt -load /home/acanis/work/legup/llvm/tools/polly/Debug+Asserts/lib/LLVMPolly.so 
Error opening '/home/acanis/work/legup/llvm/tools/polly/Debug+Asserts/lib/LLVMPolly.so': libisl.so.7: cannot open shared object file: No such file or directory
  -load request ignored.

Damn. Can I statically link it in? Well for now I'll just do:

export LD_LIBRARY_PATH=/home/acanis/git/legup/cloog/install/lib/:$LD_LIBRARY_PATH

Getting an error with pollycc:

acanis@acanis-desktop:~/work/legup/examples/popcount$ ~/work/legup/llvm/tools/polly/utils/pollycc  popcount.c 
Polly support not available in opt

Looks like the python script parses the output of the opt help:

['opt', '-load', '/home/acanis/work/legup/llvm/tools/polly/Debug+Asserts/lib/LLVMPolly.so', '-help']

Wrong opt, updating my PATH

export PATH=/home/acanis/work/legup/llvm//Debug+Asserts/bin/:$PATH

Seems to be working now. pollycc produces an a.out file

Okay, my old code was in:

/home/acanis/work/legup/llvm/tools/polly_old/lib/IterativeModuloScheduling.cpp

Note: runOnScop() immediately returns false right now. Okay, I need to figure out how to move this file out of the polly directory…

Andrew Canis 2011/08/19 12:15

Getting a strange error from quartus. “Word too long”. Okay turns out my PATH is longer than 1024 characters.

Looking at the c-to-verilog example. I think Nadav had pipelining implemented. There is a testbench inside the code. looks like the two array parameters are from two dual ported brams. There is some initialization of the arrays in the testbench:

   integer i;
   initial begin
       for (i = 0; i < (1<<(ADDRESS_WIDTH-1)); i = i + 1) begin
       mem[i] <= i;
     end
   end

Looks like the mem is just initialized to 0, 1, 2, …

Is it typical to pass arrays into the main module like this in c-to-verilog?

We take significantly longer: 75783/2 = 37891 cycles vs ctoverilog: 41050ns / 10 = 4105 cycles. So about 10x slower. As expected because we dont have pipelining.

There are memory accesses every 40 / 10 = 4 cycles

#                40975w mem[ 1021] ==       1021; in=         9
#                41015w mem[ 1022] ==       1022; in=         9
#                41055w mem[ 1023] ==       1023; in=        10

Having enough memory ports is crucial. Here there are actually 4 ports available. When we pipeline this we will only have 1…

Andrew Canis 2011/08/18 12:15

Okay. Still a few modelsim warnings on mips, fir, memset. Fixed.

Andrew Canis 2011/08/15 12:15

Recompiling llvm-gcc 2.8 on the eecg machines. First you need to compile llvm 2.8

acanis@navy:~/llvm-2.8$ ./configure
acanis@navy:~/llvm-2.8$ make -j 2 ENABLE_OPTIMIZED=1

Then compile llvm-gcc:

acanis@navy:~/llvm-gcc-4.2-2.8.source/obj$ ../configure --target=i686-pc-linux-gnu --with-tune=generic --with-arch=pentium4 --prefix=`pwd`/../install --program-prefix=llvm- --enable-llvm=/home/acanis/llvm-2.8/ --enable-languages=c,c++
acanis@navy:~/llvm-gcc-4.2-2.8.source/obj$ make -j2 LLVM_VERSION_INFO=2.8

Received the error:

/brown/r/r0/acanis/llvm-gcc-4.2-2.8.source/obj/./gcc/xgcc -B/brown/r/r0/acanis/llvm-gcc-4.2-2.8.source/obj/./gcc/ -B/brown/r/r0/acanis/llvm-gcc-4.2-2.8.sourc
e/obj/../install/i686-pc-linux-gnu/bin/ -B/brown/r/r0/acanis/llvm-gcc-4.2-2.8.source/obj/../install/i686-pc-linux-gnu/lib/ -isystem /brown/r/r0/acanis/llvm-g
cc-4.2-2.8.source/obj/../install/i686-pc-linux-gnu/include -isystem /brown/r/r0/acanis/llvm-gcc-4.2-2.8.source/obj/../install/i686-pc-linux-gnu/sys-include  
-O2  -O2 -g -O2  -DIN_GCC -DCROSS_DIRECTORY_STRUCTURE   -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition  -isystem ./
include  -fPIC -g -DHAVE_GTHR_DEFAULT -DIN_LIBGCC2 -D__GCC_FLOAT_NOT_NEEDED -Dinhibit_libc -msse -c \
                ../../gcc/config/i386/crtfastmath.c \
                -o crtfastmath.o
/brown/r/r0/acanis/llvm-gcc-4.2-2.8.source/obj/./gcc/as: line 2: exec: -Q: invalid option

I'm going to try again but get rid of x86 specific target stuff:

acanis@navy:~/llvm-gcc-4.2-2.8.source/obj$ ../configure --prefix=`pwd`/../install --program-prefix=llvm- --enable-llvm=/home/acanis/llvm-2.8/ --enable-languages=c,c++
acanis@navy:~/llvm-gcc-4.2-2.8.source/obj$ make -j2 LLVM_VERSION_INFO=2.8
acanis@navy:~/llvm-gcc-4.2-2.8.source/obj$ make install

Okay that worked. The new version of llvm-gcc is in:

~/llvm-gcc-4.2-2.8.source/install/bin

Okay. So the mips bug was related to the fact we're still using llvm-gcc. I think we should move to clang. clang 2.9 works fine on mint.

Basic blocks don't have names in clang. This will make debugging more difficult. Clang has some warnings on the benchmarks: jpeg, malloc

memset fails with:

FAIL: memset
Dest Pointer: i8* %arr
Unknown pointer destination in intrinsic argument
UNREACHABLE executed at PreLTO.cpp:159!
0  opt          0x088d7379
1  opt          0x088d7a41
2               0x4001e400 __kernel_sigreturn + 0
3  libc.so.6    0x402ae098 abort + 392
4  opt          0x088c4788 llvm::report_fatal_error(llvm::Twine const&) + 0
5  LLVMLegUp.so 0x404175a5 legup::LegUp::getIntrinsicMemoryAlignment(llvm::CallInst*) + 255
6  LLVMLegUp.so 0x40417846 legup::LegUp::lowerLegupInstrinsic(llvm::CallInst*, llvm::Function*) + 278
7  LLVMLegUp.so 0x40417b54 legup::LegUp::lowerIfIntrinsic(llvm::CallInst*, llvm::Function*) + 286
8  LLVMLegUp.so 0x4041aa07 legup::LegUp::runOnFunction(llvm::Function&) + 207
9  opt          0x0885e70d llvm::FPPassManager::runOnFunction(llvm::Function&) + 343
10 opt          0x0885e8f6 llvm::FPPassManager::runOnModule(llvm::Module&) + 114
11 opt          0x0885e3cc llvm::MPPassManager::runOnModule(llvm::Module&) + 398
12 opt          0x0885fc15 llvm::PassManagerImpl::run(llvm::Module&) + 129
13 opt          0x0885fc7b llvm::PassManager::run(llvm::Module&) + 39
14 opt          0x083f9d19 main + 4778
15 libc.so.6    0x40297775 __libc_start_main + 229
16 opt          0x083ea0b1

struct fails with:

llc: Ram.cpp:155: void legup::RAM::visitConstant(const llvm::Constant*, uint64_t*, std::stack<const llvm::Constant*, std::deque<const llvm::Constant*, std::allocator<const llvm::Constant*> > >&, std::stack<unsigned int, std::deque<unsigned int, std::allocator<unsigned int> > >&, std::stack<unsigned int, std::deque<unsigned int, std::allocator<unsigned int> > >&, unsigned int&, unsigned int&): Assertion `isa<ConstantAggregateZero>(c) || isa<ConstantPointerNull>(c)' failed.
0  llc       0x09144595
1  llc       0x09144c5d
2            0x4001e400 __kernel_sigreturn + 0
3  libc.so.6 0x402ae098 abort + 392
4  libc.so.6 0x402a55ce __assert_fail + 238
5  llc       0x086a4755 legup::RAM::visitConstant(llvm::Constant const*, unsigned long long*, std::stack<llvm::Constant const*, std::deque<llvm::Constant const*, std::allocator<llvm::Constant const*> > >&, std::stack<unsigned int, std::deque<unsigned int, std::allocator<unsigned int> > >&, std::stack<unsigned int, std::deque<unsigned int, std::allocator<unsigned int> > >&, unsigned int&, unsigned int&) + 353
6  llc       0x086a4f05 legup::RAM::initializeStruct() + 595
7  llc       0x086a5077 legup::RAM::buildInitializer() + 111
8  llc       0x086a50fd legup::RAM::generateMIF() + 47
9  llc       0x08671518 legup::VerilogWriter::printMemoryController() + 128
10 llc       0x086741e2 legup::VerilogWriter::print() + 214
11 llc       0x08661480 legup::LegupPass::printVerilog(std::set<llvm::Function*, std::less<llvm::Function*>, std::allocator<llvm::Function*> >) + 112
12 llc       0x086616a2 legup::LegupPass::doFinalization(llvm::Module&) + 220
13 llc       0x09072031 llvm::FPPassManager::doFinalization(llvm::Module&) + 75
14 llc       0x09076902 llvm::FPPassManager::runOnModule(llvm::Module&) + 178
15 llc       0x09076398 llvm::MPPassManager::runOnModule(llvm::Module&) + 398
16 llc       0x09077be1 llvm::PassManagerImpl::run(llvm::Module&) + 129
17 llc       0x09077c47 llvm::PassManager::run(llvm::Module&) + 39
18 llc       0x085fd21f main + 2887
19 libc.so.6 0x40297775 __libc_start_main + 229
20 llc       0x085fb511

Andrew Canis 2011/08/12 12:15

Added cloog/isl into the repository.

git clone git:repo.or.cz/cloog.git cd cloog ./get_submodules.sh ./autogen.sh ./configure –prefix=~/work/polly/cloog/install make make install I needed to copy .gitmodules into the base legup folder and modify the path: <code> -path = isl +path = cloog/isl </code> Then run: <code> acanis@acanis-desktop:~/work/legup$ cloog/get_submodules.sh Submodule 'isl' (git:repo.or.cz/isl.git) registered for path 'cloog/isl' Cloning into cloog/isl… warning: templates not found /usr/local/share/git-core/templates remote: Counting objects: 9585, done. remote: Compressing objects: 100% (2180/2180), done. remote: Total 9585 (delta 7127), reused 9585 (delta 7127) Receiving objects: 100% (9585/9585), 2.05 MiB | 328 KiB/s, done. Resolving deltas: 100% (7127/7127), done. Submodule path 'cloog/isl': checked out '24e309472a53920bdf19130a12c9ccec320c1867' </code>

Now I added the new folder:

git add cloog/isl

Whoops. That didn't work. Okay I don't think I can use submodules here. I just need to check out both paths.

Looking in cloog/.gitmodules the repo for isl is git:repo.or.cz/isl.git cloog revision: <code> commit 225c2ed62fe37a4db22bf4b95c3731dab1a50dde Author: Sven Verdoolaege skimo@kotnet.org Date: Sun Jul 10 09:27:24 2011 +0200 </code> isl revision: <code> commit e536653cbc99d7349eafa5e1a9cba873db3135eb Author: Sven Verdoolaege skimo@kotnet.org Date: Sat Aug 6 22:30:40 2011 +0200 </code> Wait. This revision is different than the submodule one listed above… Doesn't matter. Seeing an error for the hybrids: <code> acanis@acanis-desktop:~/work/legup/examples/chstone_hybrid/adpcm$ ./sim_all_functions … export LEGUP_ACCELERATOR_FILENAME=adpcm; \ ../../../llvm/Debug+Asserts/bin/opt -legup-config=config.tcl -load=../../../llvm/Debug+Asserts/libLLVMLegUp.so -legup-sw-only < adpcm.prelto.bc > adpcm.prelto.sw.bc LLVM ERROR: IO failure on output stream. </code>

This error can't be debugged with gdb. Looking in raw_ostream.cpp

  // If there are any pending errors, report them now. Clients wishing
  // to avoid report_fatal_error calls should check for errors with
  // has_error() and clear the error flag with clear_error() before
  // destructing raw_ostream objects which may have errors.
  if (has_error())
    report_fatal_error("IO failure on output stream.");

Andrew Canis 2011/08/10 12:15

llc infinite loops on gsm. Added -debug segfaults. I'm going to have to recompile in debug mode.

./configure --disable-optimized --with-cloog=/home/acanis/work/polly/cloog/install/ --with-isl=/home/acanis/work/polly/cloog/install/

Seems to be something to do with the new SDC scheduler. Or could it just be taking a long time? I doubt it, gsm never took this long before. There are 30 recursive calls to the function:

(gdb) bt
#0  0xb744341d in memmove () from /lib/tls/i686/cmov/libc.so.6
#1  0x0916e8dd in mat_appendrow ()
#2  0x09161c57 in add_constraintex ()
#3  0x086a90db in legup::SDCScheduler::addTimingConstraints (this=0x97754b8, Root=0x978d6f8, Curr=0x978ac78, 
    PartialPathDelay=108.549011) at SDCScheduler.cpp:227
#4  0x086a9117 in legup::SDCScheduler::addTimingConstraints (this=0x97754b8, Root=0x978d6f8, Curr=0x978ad18, 
    PartialPathDelay=104.96801) at SDCScheduler.cpp:232
#5  0x086a9117 in legup::SDCScheduler::addTimingConstraints (this=0x97754b8, Root=0x978d6f8, Curr=0x978af98, 
    PartialPathDelay=101.168007) at SDCScheduler.cpp:232
...
#29 0x086a9117 in legup::SDCScheduler::addTimingConstraints (this=0x97754b8, Root=0x978d6f8, Curr=0x978d6f8, 
    PartialPathDelay=4.29199982) at SDCScheduler.cpp:232
---Type <return> to continue, or q <return> to quit---
#30 0x086a91f5 in legup::SDCScheduler::addTimingConstraints (this=0x97754b8, F=@0x970f0d8)
    at SDCScheduler.cpp:246
#31 0x086a98c6 in legup::SDCScheduler::runOnFunction (this=0x97754b8, F=@0x970f0d8) at SDCScheduler.cpp:430
#32 0x09065971 in llvm::FPPassManager::runOnFunction (this=0x9774d00, F=@0x970f0d8) at PassManager.cpp:1513
#33 0x09065b55 in llvm::FPPassManager::runOnModule (this=0x9774d00, M=@0x970dee0) at PassManager.cpp:1535
#34 0x09065630 in llvm::MPPassManager::runOnModule (this=0x970e378, M=@0x970dee0) at PassManager.cpp:1589
#35 0x09066e65 in llvm::PassManagerImpl::run (this=0x9713f00, M=@0x970dee0) at PassManager.cpp:1671
#36 0x09066ecb in llvm::PassManager::run (this=0xbfd39164, M=@0x970dee0) at PassManager.cpp:1715
#37 0x085f8f0f in main (argc=6, argv=0xbfd392a4) at llc.cpp:396

There are about 300 recursive calls to addTimingConstraints() for some reason. There's actually a huge basic block in gsm with about 100 instructions.

bb.nph.i.i.i:                                     ; preds = %bb17.i.i

Seeing one last problem with make tiger

../../mips-binutils/bin/mipsel-elf-ld -T ../../tiger/linux_tools/lib/prog_link.ld -e main struct.o ../../tiger/tool_source/lib/altera_avalon_performance_counter.o -o struct.elf -EL -L ../../tiger/linux_tools/lib -lgcc -lfloat -luart
struct.o: In function `main':
(_main_section+0x2c): undefined reference to `memcpy'
struct.o: In function `main':
(_main_section+0x6c): undefined reference to `memcpy'
make: *** [tiger] Error 1

I don't get this. Why didn't I run into this before? Don't we lower memcpys into legup instructions? I see the memcpy in the .s file:

main:
...
# BB#0:                                 # %entry
...
	jal	memcpy

I don't see a memcpy in the .ll (there is a legup_memcpy_4 though). Damn. This must be created in the MIPS backend? I might have to write a memcpy manually. Just like Mark had to write a printf.

Tiger libraries are stored in:

../../tiger/linux_tools/lib

Sources are in:

../../tiger/tool_source/lib

I can find memcpy inside libgcc.a. Which should be included. I see a mem.c file in the source directory. Compiles fine if I add:

../../tiger/tool_source/lib/mem.o

So this compiles okay. But now make emulwatch doesn't match:

acanis@acanis-desktop:~/work/legup/examples/struct$ diff -u lli.txt sim.txt
--- lli.txt     2011-08-04 16:04:13.000000000 -0400
+++ sim.txt     2011-08-04 16:04:15.000000000 -0400
@@ -69,7 +69,7 @@
   %exitcond=0
 legup_memcpy_4:bb
   %indvar=d
-  %3=cdcd1514
+  %3=1514
   %indvar.next=e
   %exitcond=1
 legup_memcpy_4:return

The code looks like:

void legup_memcpy_4(uint32_t * d, const uint32_t * s, size_t n)
{
    uint32_t * dt = d;
    const uint32_t * st = s;
    n >>= 2;
    while (n--)
        *dt++ = *st++;
}

The .ll:

bb:                                               ; preds = %bb, %bb.nph
  %indvar = phi i32 [ 0, %bb.nph ], [ %indvar.next, %bb ]
  %st.04 = getelementptr i32* %s, i32 %indvar
  %dt.03 = getelementptr i32* %d, i32 %indvar
  %3 = load i32* %st.04, align 4
  store i32 %3, i32* %dt.03, align 4
  %indvar.next = add i32 %indvar, 1
  %exitcond = icmp eq i32 %indvar.next, %tmp
  %4 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([111 x i8]* @11, i32 0, i32 0), i32 %indvar, i32 %3, i32 %indvar.next, i1 %exitcond)
  br i1 %exitcond, label %return, label %bb

So it looks like the load doesn't match. cdcd = 1100 1101 1100 1101 For some reason this is zeroed out in gxemul. Indvar = d = 13. Actually this even happens with “make watch” but the final result is still correct. I completely removed the legup_memcpy_4 code (disabling prelto pass) still fails. Very strange. Now make emulwatch simulation just stops right after:

pointSum:return
  %retval1=11

Why would nothing else get printed? There are three calls to pointSum. gxemul only calls the function once.

Interesting. So it looks like the breakpoint never triggers at the return address of main. Instead it triggers at the end of the code:

../lib/gxemul.exp -E testmips -e R3000 struct.elf -p `../../tiger/linux_tools/lib/../find_ra struct.emul.src` -p 0xffffffff80000180 -q
exit at: pc = 0xffffffff80000180
reg: v0 = 0x0000000000000011

The return address of main is:

acanis@acanis-desktop:~/work/legup/examples/struct$ ../../tiger/linux_tools/lib/../find_ra struct.emul.src
0xffffffff800319f8

So the return address of pointSum is probably incorrect… If I comment out the pointSum calls everything works fine. Lets see what the return address is:

acanis@acanis-desktop:~/work/legup/examples/struct$ gxemul -E testmips -e R3000 struct.elf -p `../../tiger/linux_tools/lib/../find_ra struct.emul.src` -p 0xffffffff80000180 -q -p  0xffffffff8003002c
BREAKPOINT: pc = 0xffffffff8003002c
(The instruction has not yet executed.)
GXemul> s
ffffffff8003002c: 27bd0008      addiu   sp,sp,8
GXemul>
ffffffff80030030: 03e00008      jr      ra      <sum+0x110>
ffffffff80030034: 00000000 (d)  nop
GXemul>
ffffffff80030148: 00028021      addu    s0,zr,v0

Looks fine. Jumps back right after pointSum call. Wait what's this. I step a few more times and:

GXemul> 
ffffffff80030150: 8fa4003e      lw      a0,62(sp)       [0xffffffffa0007e6e]
[ exception ADEL vaddr=0xffffffffa0007e6e pc=0xffffffff80030150 <sum+0x118> ]
GXemul> 
ffffffff80000180: 00000000      nop
BREAKPOINT: pc = 0xffffffff80000180
(The instruction has not yet executed.)

There should really be another check in the test suite that we never break on the second breakpoint

Looking up this exception:

4.8.9 Address Error Exception — Instruction Fetch/Data Access
An address error exception occurs on an instruction or data access
when an attempt is made to execute one of the following:
       • Fetch an instruction, load a word, or store a word that is
not aligned on a word boundary
       • Load or store a halfword that is not aligned on a halfword boundary
       • Reference the kernel address space from user mode

Note that in the case of an instruction fetch that is not aligned on a
word boundary, PC is updated before the condition
is detected. Therefore, both EPC and BadVAddr point to the unaligned
instruction address. In the case of a data
access the exception is taken if either an unaligned address or an
address that was inaccessible in the current processor
mode was referenced by a load or store instruction.
Cause Register ExcCode Value:
    ADEL: Reference was a load or an instruction fetch
    ADES: Reference was a store

The lw was trying to load a 32-bit word from address 62 + sp into reg a0. 62 = 0x3e added to the sp is 0xffffffffa0007e6e (as shown by vaddr above). the last 4 bits are: 1110 but the last two bits must be 0 to be aligned to 32-bit. I'm going to file a bug.

GXemul> reg
cpu0:    pc = 0xffffffff80000180    < no symbol >
...
cpu0:    a0 = 0x000000000b0a0908    s4 = 0x0000000000000004
...
cpu0:    t5 = 0x0000000000000000    sp = 0xffffffffa0007e30

I'm just going to file a bug report. Okay no. It seems to be something to do with this line:

../../mips-binutils/bin/mipsel-elf-ld -T ../../tiger/linux_tools/lib/prog_link_sim.ld -e main struct.o -o struct.elf -EL -L ../../tiger/linux_tools/lib -lgcc -lfloat -luart_el_sim -lmem_el_sim

That's because nothing gets run unless we link in those libraries.

You can run the Mips test Victor submitted manually like this:

acanis@acanis-desktop:~/work/legup/llvm/test$ ../Debug+Asserts/bin/llvm-lit -v CodeGen/Mips/2010-07-20-Switch.ll 
-- Testing: 1 tests, 4 threads --
PASS: LLVM :: CodeGen/Mips/2010-07-20-Switch.ll (1 of 1)
Testing Time: 0.03s
  Expected Passes    : 1

Submitted an LLVM bug report: http://llvm.org/bugs/show_bug.cgi?id=10634

So lets just wait and see what Bruno Lopes has to say. Could it be an issue with the llvm-gcc?

The following tests don't run gxemul:

./div_const/dg.exp
./overflow_intrinsic/dg.exp
./signeddiv/dg.exp
./phi/dg.exp
./unaligned/dg.exp
./cpp/dg.exp

Andrew Canis 2011/08/04 12:15

New LLVM version is almost working but I'm seeing the error:

[ 93%] Built target LLVMPolly
make -f tools/llvm-config/CMakeFiles/llvm-config.target.dir/build.make tools/llvm-config/CMakeFiles/llvm-config.target.dir/depend
make[2]: Entering directory `/home/acanis/work/legup/build'
cd /home/acanis/work/legup/build && /home/acanis/cmake-2.8.4-Linux-i386/bin/cmake -E cmake_depends "Unix Makefiles" /home/acanis/work/legup/llvm /home/acanis/work/legup/llvm/tools/llvm-config /home/acanis/work/legup/build /home/acanis/work/legup/build/tools/llvm-config /home/acanis/work/legup/build/tools/llvm-config/CMakeFiles/llvm-config.target.dir/DependInfo.cmake --color=
make[2]: Leaving directory `/home/acanis/work/legup/build'
make -f tools/llvm-config/CMakeFiles/llvm-config.target.dir/build.make tools/llvm-config/CMakeFiles/llvm-config.target.dir/build
make[2]: Entering directory `/home/acanis/work/legup/build'
/home/acanis/cmake-2.8.4-Linux-i386/bin/cmake -E cmake_progress_report /home/acanis/work/legup/build/CMakeFiles
[ 93%] Updating LibDeps.txt if necessary...
cd /home/acanis/work/legup/build/tools/llvm-config && /home/acanis/cmake-2.8.4-Linux-i386/bin/cmake -E copy_if_different LibDeps.txt.tmp LibDeps.txt
Error copying file (if different) from "LibDeps.txt.tmp" to "LibDeps.txt".
make[2]: *** [tools/llvm-config/LibDeps.txt] Error 1

I had to fix the llvm-config/CMakeLists.txt file as mentioned in:

    http://comments.gmane.org/gmane.comp.compilers.llvm.cvs/89287

Now I see:

CMakeFiles/llvm-mc.dir/llvm-mc.cpp.o: In function `llvm::InitializeAllTargetMCs()':
/home/acanis/work/legup/build/include/llvm/Config/Targets.def:41: undefined reference to `LLVMInitializeVerilogTargetMC'
collect2: ld returned 1 exit status

Okay, slight interface change in LLVM.

Linker error for llc:

Linking CXX executable ../../bin/llc
../../lib/libLLVMVerilog.a(SDCScheduler.cpp.o): In function `legup::SDCScheduler::scheduleAXAP(bool)':
/home/acanis/work/legup/llvm/lib/Target/Verilog/SDCScheduler.cpp:379: undefined reference to `set_obj_fnex'

The cmake cache is really annoying. Every time you modify the cmake files you have to run:

rm CMakeCache.txt

To figure out what's happening when you're making:

make VERBOSE=1

Seems like tcl is being added properly here but the lpsolve library isn't being added:

/usr/bin/c++    -fPIC -fno-rtti -g   CMakeFiles/llc.dir/llc.cpp.o  -o
../../bin/llc -rdynamic -ltcl8.5 ../../lib/libLLVMVerilog.a (SKIPPED) -ltcl8.5 -ldl
-lpthread 

If I manually rerun this with “-L/usr/lib/lp_solve -llpsolve55” added it works fine. Okay. This was a problem with the llvm/CMakeList.txt file. Great. Everything compiles with cmake. Now lets try autoconf.

Compiler warnings:

LegupTcl.cpp: In function ‘int legup::set_accelerator_function(void*, Tcl_Interp*, int, const char**)’:
LegupTcl.cpp:23: warning: cast from type ‘const char*’ to type ‘char*’ casts away constness
LegupTcl.cpp: In function ‘int legup::set_operation_attributes(void*, Tcl_Interp*, int, const char**)’:
LegupTcl.cpp:35: warning: cast from type ‘const char*’ to type ‘char*’ casts away constness
...
LegupTcl.cpp:97: warning: cast from type ‘const char*’ to type ‘char*’ casts away constness
LegupTcl.cpp: In function ‘int legup::set_device_specs(void*, Tcl_Interp*, int, const char**)’:
LegupTcl.cpp:108: warning: cast from type ‘const char*’ to type ‘char*’ casts away constness
...
LegupTcl.cpp:136: warning: cast from type ‘const char*’ to type ‘char*’ casts away constness

Strange error:

make[2]: Entering directory `/home/acanis/work/legup/llvm/unittests/VMCore'
llvm[2]: Compiling DerivedTypesTest.cpp for Release+Asserts build
DerivedTypesTest.cpp: In function ‘void<unnamed>::PR7658()’:
DerivedTypesTest.cpp:24: error: ‘PATypeHolder’ was not declared in this scope

I think the file has been deleted. Yep.

It seems like our llvm-gcc is too old:

llvm-gcc array.c -emit-llvm -c -fno-builtin -m32 -malign-double -I ../lib/include/ -O0 -fno-inline-functions -o array.prelto.1.bc
# linking may produce llvm mem-family intrinsics
../../llvm/Release+Asserts/bin/llvm-ld -disable-inlining -disable-opt array.prelto.1.bc -b=array.prelto.linked.bc
llvm-ld: error: Cannot load file 'array.prelto.1.bc': Bitcode file 'array.prelto.1.bc' could not be loaded: Invalid ALLOCA record

Clang doesn't have this error. But there are warnings:

clang array.c -emit-llvm -c -fno-builtin -m32 -malign-double -I ../lib/include/ -O0 -fno-inline-functions -o array.prelto.1.bc
clang: warning: argument unused during compilation: '-malign-double'

And also verilog errors:

-- Compiling module fct
** Error: array.v(550): 'LEGUP1_F_fct_BB_' already declared in this scope.
...
** Error: array.v(566): 'LEGUP3_F_fct_BB_' already declared in this scope.
-- Compiling module main
** Error: array.v(1338): 'LEGUP1_F_main_BB_' already declared in this scope.
...
** Error: array.v(1377): 'LEGUP8_F_main_BB_' already declared in this scope.

Damn. I can't run llvm-gcc 2.9:

llvm-gcc: /lib/tls/i686/cmov/libc.so.6: version `GLIBC_2.11' not found (required by llvm-gcc)

Okay great. llvm-gcc 2.8 works.

Seeing some bugs:

../../../llvm/Release+Asserts/bin/opt: symbol lookup error: ../../../llvm/Release+Asserts/lib/LLVMLegUp.so: undefined symbol: _ZN4llvm17IntrinsicLowering18LowerIntrinsicCallEPNS_8CallInstE

I thought I already fixed this… Looks fine. I included the dependency in lib/Transforms/LegUp/Makefile:

USED_LIBS = LLVMCodeGen

Tried adding back in:

LDFLAGS = $(LLVM_OBJ_ROOT)/lib/CodeGen/$(BuildMode)/IntrinsicLowering.o

Strange. So that worked. Wow. dfmul is suddenly fixed! I'm guessing this was caused by the newer llvm-gcc version?

llc seems to be running into an infinite loop on gsm…

Andrew Canis 2011/08/03 12:15

Autoconf doesn't work for the latest git llvm and polly?

llvm[0]: Compiling ScheduleOptimizer.cpp for Debug+Asserts build (PIC)
ScheduleOptimizer.cpp:30:26: error: isl/schedule.h: No such file or directory

Trying make -n to show makefile commands. The schedule.h file doesn't actually exist… Is this a new header file that has been added in the past few months? Yep. Needed to update my cloog version. Okay this works now. So I actually need to distribute these header files and .so manually. Damn. This also means I need to update LLVM.

Updating to commit:

commit b4f4cbd199318901d12737ded05ebebd8cb21336
Author: David Greene <greened@obbligato.org>
Date:   Fri Jul 29 20:50:18 2011 +0000

Damn. The merge totally fails. I see a lot of “both added” conflicts.

git checkout --theirs -- Transforms/
git add -u Transforms/

Actually usually you can just manually merge the makefiles and files we changed (looking at git log), then just checkout/add the whole directory.

Testing the Mips backend again. I'm going to submit some bug reports.

Andrew Canis 2011/07/29 12:15

Adding polly to the repo. Cmake was working before. Trying to get autoconf working. In llvm running ./configure –with-cloog=~/work/polly/cloog/install/ –with-isl=~/work/polly/cloog/install/ gives:

=== configuring in tools/polly (/home/acanis/work/legup/llvm/tools/polly)
...
checking for isl in inc_not_give_isl, lib_not_give_isl... configure: error: isl required but not found
configure: error: ./configure failed for tools/polly

Wow. You can't use ~ in the path! So annoying.

./configure --with-cloog=/home/acanis/work/polly/cloog/install/ --with-isl=/home/acanis/work/polly/cloog/install/

How to do live variable analysis in SSA? I see from LiveVariables.cpp:

It uses the dominance properties of SSA form to efficiently compute live
variables for virtual registers

What does this mean?

  // Calculate live variable information in depth first order on the CFG of the
  // function.  This guarantees that we will see the definition of a virtual
  // register before its uses due to dominance properties of SSA (except for PHI
  // nodes, which are treated as a special case).

Oh you can just do a depth first traversal of the CFG.

Andrew Canis 2011/07/26 12:15

Testing the git subtree method locally. Wow. Ran into a really annoying bug with git merge subtree. Turns out you need to specify the directory location otherwise the merge won't work properly:

I see a lot of “both added” conflicts. I'm just going to take LLVM's version and then manually merge the autoconfig changes

git checkout --theirs -- .
git add -u .

Andrew Canis 2011/07/25 12:15

What's the status on the LLVM update and loop pipelining integration?

There's still a bug with gxemul with the new LLVM mips backend:

Running ./chstone/dfmul/dg.exp ...
FAIL: gxemul simulation. Expected: reg: v0 = 0x0000000000000000

make emulwatch gives:

acanis@acanis-desktop:~/work/legup/examples/chstone/dfmul$ diff -u sim.txt lli.txt 
--- sim.txt	2011-07-07 14:55:20.000000000 -0400
+++ lli.txt	2011-07-07 14:55:18.000000000 -0400
@@ -160,12 +160,14 @@
   %87=ffff000000000000
   %88=1
 main:bb3.i26.i
-  %90=0
+  %90=1
+main:bb5.i27.i
+  %retval.i.i=ffff000000000000
 main:float64_mul.exit
-  %181=3ff8000000000000
+  %181=ffff000000000000
   %183=ffff000000000000
-  %184=1
-  %186=1
+  %184=0
+  %186=0
   %189=4
   %exitcond=1
 main:bb2

The value of %90 is wrong:

float64_is_signaling_nan.exit.i.i:                
  %84 = phi i32 [ %80, %bb.i.i.i ], [ %retval.i11.i.i, %float64_is_signaling_nan.exit14.i.i ], [ 0, %bb16.i.float64_is_signaling_nan.exit14.i.i_crit_edge ]

bb3.i26.i:                                        ; preds = %float64_is_signaling_nan.exit.i.i
  %90 = icmp eq i32 %84, 0

In both cases we're coming from main:bb16.i.float64_is_signaling_nan.exit14.i.i_crit_edge. So %84 = 0. Which is also correct in both traces in basic block main:float64_is_signaling_nan.exit.i.i

Where is this in the assembly? Looking at dfmul.s:

$BB0_40:                                # %bb3.i26.i
                                        #   in Loop: Header=BB0_1 Depth=1
	addiu	$19, $zero, 0
	lui	$16, %hi(__unnamed_24)
	xor	$19, $16, $19
	addiu	$4, $16, %lo(__unnamed_24)
	sltu	$5, $19, 1
	jal	mprintf

From below. I already looked at this. If $19 < 1 then $5 = 1 else $5 = 0. If $19 represents %84 and $5 represents %90 then when $19=0 then $5=1. The sim says $5 (%90) is 0 when it should be 1. I would like to step through this code in gxemul. “make emul” runs the following commands:

../../../mips-binutils/bin/mipsel-elf-ld -T ../../../tiger/linux_tools/lib/prog_link_sim.ld -e main dfmul.o -o dfmul.elf -EL -L ../../../tiger/linux_tools/lib -lgcc -lfloat -luart_el_sim
../../../mips-binutils/bin/mipsel-elf-objdump -d dfmul.elf > dfmul.emul.src
gxemul -E testmips -e R3000 dfmul.elf -p `../../../tiger/linux_tools/lib/../find_ra dfmul.emul.src` -p 0xffffffff80000180 -q

Before running “make emul” you need to compile the .s file with:

../../../mips-binutils/bin/mipsel-elf-as dfmul.s -mips1 -mabi=32 -o dfmul.o -EL
../../../mips-binutils/bin/mipsel-elf-ld -T ../../../tiger/linux_tools/lib/prog_link.ld -e main dfmul.o ../../../tiger/tool_source/lib/altera_avalon_performance_counter.o -o dfmul.elf -EL -L ../../../tiger/linux_tools/lib -lgcc -lfloat -luart
../../../mips-binutils/bin/mipsel-elf-objdump -D dfmul.elf > dfmul.src
../../../tiger/linux_tools/lib/../elf2sdram dfmul.elf sdram.dat

Doing an instruction trace with -i. The %90 is printed at line 305279.

Fixing lpsolve dependency. Makefile.config is generated by configure from Makefile.config.in Useful guide: http://llvm.org/docs/MakefileGuide.html#Makefile.config

I had to add a new macro called AX_EXT_HAVE_LIB() because the lpsolve library isn't installed in /usr/lib but in /usr/lib/lp_solve/. The new macro adds the appropriate -L/usr/lib/lp_solve/ flag.

Note the both AX_EXT_HAVE_LIB and AC_SEARCH_LIBS modify the Makefile.config LIBS variable.

For some reason the makefile is broken. The LDFLAGS aren't being added properly:

/home/acanis/git/legup/llvm/Release/bin/tblgen: error while loading shared libraries: liblpsolve55.so: cannot open shared object file: No such file or directory

Okay I just changed this to use liblpsolve_pic.a which is compiled with -fPIC to allow shared linkage.

Andrew Canis 2011/07/07 12:15

Okay. “make test_tiger_sim” seems to be failing with my new changes to libuart. Looks like it's a problem with mprintf not working. “make tigersim” doesn't produce the expected output.

Alright so adding this back into uart.h (included from stdio.h):

#define printf mprint

But I still don't get the right output from tiger modelsim: For mips:

# 1008533759

For aes:

# 1008533759

What does this number mean? gxemul working fine… Looks like an unitialized value. Strange, when I explicitly add:

      main_result = 0;
      printf ("%d\n", main_result);

I still get the same thing. mprintf() seems to be totally broken:

      printf ("---->%d %d %d %d\n", 0, 1, -1, main_result);

Gives:

# ---->1008533759 935190524 201326600 0

Where are these numbers coming from?? Is it some sort of bug in the llvm mips backend maybe? I should test mprintf with gxemul. For some reason printf isn't working with gxemul. Strange because make emulwatch uses printf I do see a litte bit of magic in the emulwatch target:

sed -i "s/\tprintf/\tmprintf/g" mips.s

Oh shit. I need to run “make tiger” first _before_ running “make emul” Okay, I think something is broken with mprintf. This:

      printf ("Start\n");
      printf ("---->'%d' '%d' '%d' '%d'\n", 0, 1, -1, main_result);
      printf ("End\n");

Doesn't simulate properly in gxemul:

$ make tiger;make emul
...
Start
---->'ffffffff80000180: 00000000        nop
BREAKPOINT: pc = 0xffffffff80000180
(The instruction has not yet executed.)

The code dies in the middle of mprintf(). In particular, the variable arguments seem to be failing:

va_arg(arg, int)

Seems to crash the whole program. I bet the mips backend doesn't support variable arguments… I see in the release notes for a newer LLVM version something about improved support for variable arguments in the mips backend.

I'm going to have to install the mips-gcc after all. Installing from the site: http://crosstool-ng.org/ I'm putting the mips gcc in ~/crosstool/gcc gcc gets installed in ~/x-tools/

I need to recompile with hardware-float to avoid this warning:

../../../mips-binutils/bin/mipsel-elf-ld: Warning: mips.elf uses hard float, ../../../tiger/linux_tools/lib/libuart.a(uart.o) uses soft float

Andrew Canis 2011/07/06 12:15

After Mark's push function_pointer seems to be failing:

make[1]: Leaving directory `/home/acanis/git/legup/examples/function_pointer'
function_pointer.c: In function ‘a’:
function_pointer.c:3: warning: ‘return’ with a value, in function returning void
function_pointer.c: In function ‘b’:
function_pointer.c:4: warning: ‘return’ with a value, in function returning void
llc: utils.cpp:48: llvm::Function* legup::getCalledFunction(llvm::CallInst*): Assertion `called' failed.

Of course. Because function pointers aren't supported by LegUp. I should make this error more user friendly. Added new test suite files for this.

llist is failing because NULL is undeclared. NULL is normally declared in stdio.h

Andrew Canis 2011/07/05 12:15

I just pulled Victor's fix to mprintf(). So it seems to work. The make emulwatch now doesn't show any differences.

Interesting. So when I run:

make emulwatch
make emultest 

The result is correct. But running make tiger; make emultest fails:

exit at: pc = 0xffffffff80031d4c
reg: v0 = 0x0000000000000002

To see the output from the dfmul printf, run “make emul”. I see two discrepancies:

a_input=7ff0000000000000 b_input=ffffffffffffffff expected=ffffffffffffffff output=7ff8000000000000
a_input=3ff0000000000000 b_input=ffff000000000000 expected=ffff000000000000 output=3ff8000000000000

But make emulwatch doesn't any difference these errors. How do I track down the problem? Well this is because “make emulwatch” has the correct results. So by adding printfs everywhere the bug is removed.

Strange. When I diff the .src file between the watch version and the original the watch has a slightly different _i2h function:

--- dfmul.emul.src      2011-06-06 23:16:59.000000000 -0400
+++ watch.src   2011-06-06 23:16:53.000000000 -0400
@@ -125,7 +125,7 @@
 800301ac:      00000000        nop
 800301b0:      00021880        sll     v1,v0,0x2
 800301b4:      3c028003        lui     v0,0x8003
-800301b8:      244223a0        addiu   v0,v0,9120
+800301b8:      24423060        addiu   v0,v0,12384
 800301bc:      00621021        addu    v0,v1,v0
 800301c0:      8c420000        lw      v0,0(v0)
 800301c4:      00000000        nop

There are massive differences in the main function. Very strange. When I shrink the array size to 2 (the two errors) the results are correct. If I remove the bottom 10 elements I still see the error. If I reduce it to 5 elements, the first error goes away. What could cause this? Some kind of bug with the stack when calling a function? When I change N to be 10 when there are only 5 array elements I get the same bug. If I call float64_mul on the first element 10 times I don't see the problem. Is this function call related? Weird, when I comment out the printf I don't get the error. But can't be the printf because “make emulwatch” worked.

Specifically it appears to be the printing of a_input:

      // error:
	  printf ("a_input=%016llx\n", a_input[i]);

      // no error:
	  //printf ("z_output=%016llx\n", z_output[i]);
      
      // no error:
	  //printf ("results=%016llx\n", result);

      // no error
	  //printf ("a_input=%016llx\n", b_input[i]);

It's some kind of bug inside: propagateFloat64NaN(). But when I add a printf after each statement llc dies:

../../../build/bin/llc dfmul.bc -march=mipsel -relocation-model=static -mips-ssection-threshold=0 -mcpu=mips1 -o dfmul.s
llc: /home/acanis/work/legup/llvm/include/llvm/CodeGen/LiveInterval.h:355: llvm::SlotIndex llvm::LiveInterval::beginIndex() const: Assertion `!empty() && "Call to beginIndex() on empty interval."' failed.
Stack dump:
0.      Program arguments: ../../../build/bin/llc dfmul.bc -march=mipsel -relocation-model=static -mips-ssection-threshold=0 -mcpu=mips1 -o dfmul.s 
1.      Running pass 'Function Pass Manager' on module 'dfmul.bc'.
2.      Running pass 'Linear Scan Register Allocator' on function '@propagateFloat64NaN'

I'm not sure if this is related.

Something about printing “a” seems to fix the final errors. Actually if I add a printf for bIsNaN (which is 1) I also fix one of the errors.

  printf ("3: %d\n", bIsNaN);

Why would a printf fix anything?

It's like the if statement isn't working properly…

Okay. With my new modified code “make emulwatch” is now giving me this:

--- lli.txt     2011-06-07 02:02:10.000000000 -0400
+++ sim.txt     2011-06-07 02:02:12.000000000 -0400
@@ -160,14 +160,12 @@
   %87=ffff000000000000
   %88=1
 main:bb3.i26.i
-  %90=1
-main:bb5.i27.i
-  %retval.i.i=ffff000000000000
+  %90=0
 main:float64_mul.exit
-  %181=ffff000000000000
+  %181=3ff8000000000000
   %183=ffff000000000000
-  %184=0
-  %186=0
+  %184=1
+  %186=1
   %189=4
   %exitcond=1
 main:bb2

Looks like the sim.txt is missing a basic block: main:bb5.i27.i

The first difference is:

  %90 = icmp eq i32 %84, 0

Maybe the icmp is invalid in the mips assembly?

bb3.i26.i:                                        ; preds = %float64_is_signaling_nan.exit.i.i
  %90 = icmp eq i32 %84, 0
  %91 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([42 x i8]* @23, i32 0, i32 0), i1 %90)
  br i1 %90, label %bb5.i27.i, label %float64_mul.exit
$BB0_40:                                # %bb3.i26.i
                                        #   in Loop: Header=BB0_1 Depth=1
	addiu	$19, $zero, 0
	lui	$16, %hi(__unnamed_24)
	xor	$19, $16, $19
	addiu	$4, $16, %lo(__unnamed_24)
	sltu	$5, $19, 1
	jal	mprintf
	nop
	beq	$16, $zero, $BB0_42
	nop
# BB#41:                                #   in Loop: Header=BB0_1 Depth=1
	lw	$19, 296($sp)
	nop
	j	$BB0_76
	nop
$BB0_42:                                # %bb5.i27.i
                                        #   in Loop: Header=BB0_1 Depth=1
	lui	$2, %hi(__unnamed_25)
	lw	$19, 296($sp)
	nop
	beq	$17, $zero, $BB0_44
	nop
# BB#43:                                # %bb5.i27.i
                                        #   in Loop: Header=BB0_1 Depth=1
	lw	$19, 292($sp)
	nop
$BB0_44:                                # %bb5.i27.i
                                        #   in Loop: Header=BB0_1 Depth=1
	beq	$17, $zero, $BB0_46
	nop
# BB#45:                                # %bb5.i27.i
                                        #   in Loop: Header=BB0_1 Depth=1
	addu	$18, $zero, $21
$BB0_46:                                # %bb5.i27.i
                                        #   in Loop: Header=BB0_1 Depth=1
	addiu	$4, $2, %lo(__unnamed_25)
	addu	$5, $zero, $19
	addu	$6, $zero, $18
	jal	mprintf
	nop
	j	$BB0_76
	nop
...
$BB0_76:                                # %float64_mul.exit
                                        #   in Loop: Header=BB0_1 Depth=1

Why is bb5.i27.i split into so many different basic blocks?

Does this represent the icmp? yes. If $19 < 1 then $5 = 1 else $5 = 0. If $19 represents %84 then when $19=0 then $5=1.

	sltu	$5, $19, 1

In make watch. %84 seems to be correct. When does %90=0 in gxemul?

It's very hard to correlate the .s file to the final disassembled .src file.

Can I use bugpoint to make this bug smaller?

Andrew Canis 2011/06/06 12:15

Cool tool for calculating lines of code: sloccount

Okay. There is a very strange bug:

int main () {
    volatile unsigned long long testing = 0x7FFFFFFFFFFFFFFFULL;
    printf ("testing=%016llx\n", testing);
}

When I run make emulwatch:

acanis@acanis-desktop:~/work/legup/examples/mips_bug$ make emulwatch
acanis@acanis-desktop:~/work/legup/examples/mips_bug$ diff -u lli.txt sim.txt 
--- lli.txt     2011-06-03 15:51:39.000000000 -0400
+++ sim.txt     2011-06-03 15:51:40.000000000 -0400
@@ -1,2 +1,2 @@
 main:entry
-  %0=7fffffffffffffff
+  %0=ffffffffffffffff

The gxemul emulator seems to sign extend the unsigned long long. What about if it's just a normal 32-bit long? Okay. That matches fine. No sign extend problem. Must be an issue with 64-bit integers. Let's compare the .s assembly with the old version of LLVM. Same problem… Wow. So this is a bug that hasn't been filed yet. The issue must be with something else. I'll file this bug right now.

I think it's just treating an unsigned number as a signed number. Victor mentioned a problem with the ldu instruction.

This works fine:

    volatile unsigned long long testing = 0x7FFFFFFFULL;

But this has the sign extend problem:

    volatile unsigned long long testing = 0xFFFFFFFFULL;

The diff between the above two snippets:

acanis@acanis-desktop:~/work/legup/examples/mips_bug$ diff -u bad.s good.s
--- bad.s       2011-06-03 16:44:36.000000000 -0400
+++ good.s      2011-06-03 16:45:03.000000000 -0400
@@ -18,8 +18,9 @@
        sw      $16, 20($sp)
        sw      $17, 16($sp)
        addiu   $2, $sp, 24
+       lui     $3, 32767
        ori     $2, $2, 4
-       addiu   $3, $zero, -1
+       ori     $3, $3, 65535
        sw      $3, 24($sp)
        sw      $zero, 0($2)
        lui     $3, %hi($.str)

MIPS reference:

LUI -- Load upper immediate
    Description: The immediate value is shifted left 16 bits and stored in the register. The lower 16 bits are zeroes.
    Operation: $t = (imm << 16); advance_pc (4);
ADDIU -- Add immediate unsigned (no overflow)
    Description: Adds a register and a sign-extended immediate value and stores the result in a register
    Operation: $t = $s + imm; advance_pc (4);

In the good case: 32767 « 16 + 65535 = 2147483647. Which is right. But in the bad case: -1 sign extended is all ones. But if $3 is actually 64-bits this will be wrong.

Could it be a problem with the emulator not supporting 64-bit integers? I doubt it.

Looking on the LLVM release notes for 3.0:

Known problems with the MIPS back-end
     * 64-bit MIPS targets are not supported yet.

But shouldn't matter because Tiger MIPS is a 32-bit processor.

When I run 'make tigerwatch' I get:

acanis@acanis-desktop:~/work/legup/examples/mips_bug$ diff -u lli.txt sim.txt 
--- lli.txt     2011-06-03 16:50:21.000000000 -0400
+++ sim.txt     2011-06-03 16:50:27.000000000 -0400
@@ -1,2 +1,2 @@
 main:entry
-  %0=ffffffff
+  %0=0

How are 64-bit integers treated in a MIPS1 ISA? MIPS1 is a 32-bit ISA. So actually “addiu $3, $zero, -1” should be correct.

I'm going to try to step through the code in gxemul. The normal 'make emul' command:

gxemul -E testmips -e R3000 mips_bug.elf -p `../../tiger/linux_tools/lib/../find_ra mips_bug.emul.src`

find_ra finds the return address of the main() function so you know when to break the gxemul simulation. In this case it returns: 0xffffffff80031400 When I look in mips_bug.emul.src I see:

80031400:	03e00008 	jr	ra

So you must have to pad with breakpoint address with 1's.

There is a slight difference between the mips_bug.emul.src and the mips_bug.s file: mips_bug.s file:

	sw	$17, 16($sp)
	addiu	$2, $sp, 24
	ori	$2, $2, 4
	addiu	$3, $zero, -1
	sw	$3, 24($sp)

The mips_bug.emul.src file:

8003138c:	afb10010 	sw	s1,16(sp)
80031390:	27a20018 	addiu	v0,sp,24
80031394:	34420004 	ori	v0,v0,0x4
80031398:	2403ffff 	li	v1,-1
8003139c:	afa30018 	sw	v1,24(sp)

The addiu became an li. Lets step through:

 gxemul -E testmips -e R3000 mips_bug.elf -p 0x80031394

Seems like the registers are actually 64-bit in this machine…

GXemul> s
ffffffff80031394: 34420004      ori     v0,v0,0x0004
GXemul> s
ffffffff80031398: 2403ffff      addiu   v1,zr,-1
GXemul> reg
cpu0:    pc = 0xffffffff8003139c    <main+0x1c>
...
cpu0:    v1 = 0xffffffffffffffff    s3 = 0x0000000000000000

It seems like the gxemul is simulating a 64-bit little-endian machine:

GXemul> machine
serial nr: 1  (nr of NICs: 1)
memory: 32 MB
cpu0: 5KE, running
    64-bit Little-endian (MIPS64, revision 2), 48 TLB entries
    L1 I-cache: 32 KB, 32 bytes per line, 2-way
    L1 D-cache: 32 KB, 32 bytes per line, 2-way

Andrew Canis 2011/06/03 12:15

Did I even apply victor's changes to lib/Target/Mips/MipsRegisterInfo.cpp?

No I didn't. The MIPS backend code exactly matches the git version. So I need to apply Victor's patches manually.

Okay, a bunch of the stack code has been moved into a new file:

MipsFrameLowering.cpp

Okay I've tried to reapply the patch. Only dfmul is failing now.

Seems to be some kind of sign extension problem? In most cases gxemul seems to be sign extending while lli isn't.

Andrew Canis 2011/06/02 12:15

There is a CallInst function called getArgOperand() which I should be using.

They finally fixed the llvm.vim syntax file.

Okay. I fixed the uadd.overload.* intrinsic problem.

Now I'm down to some gxemul errors for dfmul, llist, loopbug, memset.

Could this be caused by Victor's MIPS changes? Maybe the MIPS backend has been fixed/broken?

Looking into loopbug benchmark, from git commit:

commit 8cdf9e016927d9361144260ccaf87d74e58ebaa8
Author: Andrew Canis <andrew.canis@gmail.com>
Date:   Tue Aug 24 22:20:04 2010 -0400

    Test case for LLVM MIPS backend bug.
    
    Expected:
    $ make
    $ lli loopbug.bc
    
    On MIPS (using gxemul emulator):
    $ make tiger
    $ make emul

Just double checked that simple backup is working again. Looks good.

Can I try this with the unmodified llvm version? Here's the command that produces the mips assembly:

../../build/bin/llc loopbug.bc -march=mipsel -relocation-model=static -mips-ssection-threshold=0 -mcpu=mips1 -o loopbug.s

I just installed the 2.9 binaries in ~/downloads/llvm-2.9-mingw32-i386 Same error with the newer version of llc.

I probably incorporated the mips backend changes incorrectly.

Andrew Canis 2011/06/01 12:15

Whoops, noticed that simple backup wasn't working (isis has gone down).

Mips, gsm fail:

# ** Error: gsm.v(1692): Module 'memset' is not defined.
# ** Error: mips.v(740): Module 'memset' is not defined.

This also causes gxemul to fail:

../../../mips-binutils/bin/mipsel-elf-ld -T ../../../tiger/linux_tools/lib/prog_link.ld -e main gsm.o ../../../tiger/tool_source/lib/altera_avalon_performance_counter.o -o gsm.elf -EL -L ../../../tiger/linux_tools/lib -lgcc -lfloat -luart
make[1]: Leaving directory `/home/acanis/work/legup/examples/chstone/gsm'
gsm.o: In function `main':
(_main_section+0x118): undefined reference to `memset'

Looking at the memset test. It looks like the legup versions aren't being linked properly.

../../build/bin/llvm-ld  memset.prelto.bc ../lib/llvm/liblegup.a -b=memset.premodulo.bc

First of all it seems like the intrinsic lowering pass is no longer working properly:

acanis@acanis-desktop:~/work/legup/examples/memset$ diff -u memset.prelto.ll memset.premodulo.ll 
...
-declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture, i32, i32, i1) nounwind
-
-declare void @llvm.memset.p0i8.i64(i8* nocapture, i8, i64, i32, i1) nounwind
-
 declare i8* @memcpy(i8*, i8*, i32)
 
 declare i8* @memset(i8*, i32, i32)

There are still two instrinsic calls in there. Rebuilding ../lib/llvm/liblegup.a doesn't help. There is some sort of problem. Basically memset() is not being linked into memset.premodulo.ll

So previously the prelto pass replaces:

  call void @llvm.memset.i64(i8* %arr_addr.04.1.i31, i8 0, i64 11, i32 1) nounwind

With:

  %16 = call i8* @legup_memset_1(i8* %arr_addr.04.1.i31, i8 0, i64 11) ; <i8*> [#uses=0]

The postfix “_1” indicates a 1 byte alignment. The type of the 3rd argument (length) is i64.

I see in the LLVM manual for the SVN head (http://llvm.org/docs/) that the function name has changed:

  declare void @llvm.memset.p0i8.i32(i8* <dest>, i8 <val>, i32 <len>, i32 <align>, i1 <isvolatile>)
  declare void @llvm.memset.p0i8.i64(i8* <dest>, i8 <val>, i64 <len>, i32 <align>, i1 <isvolatile>)

It's pretty amazing how fast LLVM changes. We're at the 2.7 release and 3.0 is coming out soon. The old 2.7 syntax (from http://llvm.org/releases/2.7/docs/LangRef.html#int_memset)

  declare void @llvm.memset.i8(i8 * <dest>, i8 <val>, i8 <len>, i32 <align>)
  declare void @llvm.memset.i16(i8 * <dest>, i8 <val>, i16 <len>, i32 <align>)
  declare void @llvm.memset.i32(i8 * <dest>, i8 <val>, i32 <len>, i32 <align>)
  declare void @llvm.memset.i64(i8 * <dest>, i8 <val>, i64 <len>, i32 <align>)

In release notes for 2.8:

 The memcpy, memmove, and memset intrinsics now take address space qualified
 pointers and a bit to indicate whether the transfer is "volatile" or not.

Our PreLTO seems to be failing. The instrinsics are successfully turned into memcpy calls but those should then be turned into legup_memset_* calls.

It's strange. The old version of the code doesn't lower anything. While the newer version prints (in -debug mode):

Lowering:   call void @llvm.memcpy.p0i8.p0i8.i32(i8* %carray1, i8* getelementptr inbounds ([12 x i8]* @C.17.1564, i32 0, i32 0), i32 12, i32 1, i1 false)
New instruction:   %0 = call i8* @memcpy(i8* %carray1, i8* getelementptr inbounds ([12 x i8]* @C.17.1564, i32 0, i32 0), i32 12)

Wow really strange. After touching the file PreLTO.cpp I now get this error: unknown instruction on intrinsic argument

UNREACHABLE executed at /home/acanis/work/legup/llvm/lib/Transforms/LegUp/PreLTO.cpp:164!
Stack dump:
0.      Program arguments: ../../build/bin/opt -load=../../build/lib/LLVMLegUp.so -legup-prelto
1.      Running pass 'Function Pass Manager' on module '<stdin>'.
2.      Running pass 'Pre-Link Time Optimization Pass to lower intrinsics' on function '@main'
/bin/bash: line 1: 15114 Aborted                 ../../build/bin/opt -load=../../build/lib/LLVMLegUp.so -legup-prelto < memset.prelto.linked.bc > memset.prelto.bc

Did the makefile not build this properly before?

Okay, so the code isn't handling getelementptr's properly. Actually I'm a little bit confused by the code. The legup_* prefix is determined by the destination pointer…

  call void @llvm.memcpy.i32(i8* %carray1, i8* getelementptr inbounds ([12 x i8]* @C.17.1564, i32 0, i32 0), i32 12, i32 1)
  call void @llvm.memcpy.i32(i8* %sarray2, i8* bitcast ([12 x i16]* @C.18.1565 to i8*), i32 24, i32 2)
  call void @llvm.memcpy.i32(i8* %array3, i8* bitcast ([12 x i32]* @C.19.1566 to i8*), i32 48, i32 4)
  call void @llvm.memcpy.i32(i8* %larray4, i8* bitcast ([12 x i64]* @C.20.1567 to i8*), i32 96, i32 8)

Turns into:

  %0 = call i8* @legup_memcpy_1(i8* %carray1, i8* getelementptr inbounds ([12 x i8]* @C.17.1564, i32 0, i32 0), i32 12) ; <i8*> [#uses=0]
  %1 = call i8* @legup_memcpy_2(i8* %sarray2, i8* bitcast ([12 x i16]* @C.18.1565 to i8*), i32 24) ; <i8*> [#uses=0]
  %2 = call i8* @legup_memcpy_4(i8* %array3, i8* bitcast ([12 x i32]* @C.19.1566 to i8*), i32 48) ; <i8*> [#uses=0]
  %3 = call i8* @legup_memcpy_8(i8* %larray4, i8* bitcast ([12 x i64]* @C.20.1567 to i8*), i32 96) ; <i8*> [#uses=0]

Why can't you just use the alignment parameter? For instance:

Lowering for LegUp:   call void @llvm.memset.p0i8.i64(i8* %16, i8 0, i64 96, i32 8, i1 false)

The destination is: %16 = bitcast [12 x i64]* %larray to i8* Which points to an array of i64's so the alignment is calculated to be 8 (64/8).

Damn, I just got hit by the new API change again: There was an api change with CallInst operand order. The function is now stored as the last operand instead of the first.

Okay that worked. Down to 16 failures. I have a couple of unexplained gxemul simulation errors…

dfdiv, dfmul, dfsin, sha:

LLVM ERROR: Code generator does not support intrinsic function 'llvm.uadd.with.overflow.i64'!

The actual error comes from:

lib/CodeGen/IntrinsicLowering.cpp:353:    report_fatal_error("Code generator does not support intrinsic function '"+

From the LLVM docs:

The 'llvm.uadd.with.overflow' family of intrinsic functions perform an unsigned
addition of the two arguments, and indicate whether a carry occurred during the
unsigned summation.

So I get code looking like:

  %uadd.i = call %0 @llvm.uadd.with.overflow.i64(i64 %105, i64 %106) nounwind
  %108 = extractvalue %0 %uadd.i, 0
  %109 = extractvalue %0 %uadd.i, 1

Which could easily be converted to verilog: {a, b} = c + d;

But what's the best way of handling this? I think the easiest way is to turn this into an i65 addition. And shift out the carry bit. Quartus should easily optimize this to the correct hardware. I'll just add this to the PreLTO pass.

Andrew Canis 2011/05/31 12:15

mips intrinsic error with new llvm version:

../../../build/bin/opt: symbol lookup error: ../../../build/lib/LLVMLegUp.so:
undefined symbol: _ZN4llvm17IntrinsicLowering18LowerIntrinsicCallEPNS_8CallInst

This is a linker error I experienced previously. I fixed the autoconf makefile flow, now I need to fix cmake.

I need to include: LDFLAGS = $(LLVM_OBJ_ROOT)/lib/CodeGen/$(BuildMode)/IntrinsicLowering.o

How do I handle this in cmake?

What does this mean?

add_llvm_loadable_module( LLVMLegUp
..

This creates the build/lib/LLVMLegUp.so shared library. There are only a few other examples of this in the code. This doesn't fix it:

add_dependencies(LLVMLegUp LLVMCodeGen)

I need to actually link the LLVMCodeGen library into the LLVMLegup.so library:

target_link_libraries(LLVMLegUp LLVMCodeGen)

Okay adding this to /llvm/lib/Transforms/LegUp/CMakeLists.txt works.

Andrew Canis 2011/05/30 15:05

There's an interesting discussion on the CBackend on the LLVM mailing list: http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-November/036278.html

Chris suggests a full rewrite if anyone wants to work on the CBackend:

If anyone was really interested in this, I'd strongly suggest a complete
rewrite of the C backend: make use the existing target independent code
generator code (for legalization etc) and then just put out a weird ".s file"
at the end.
-Chris

So I've finished iterative modulo scheduling for a simple example with no recurrences:

    int a[N], b[N], c[N];
    for (i = 0; i < N; i++) {
        a[i] = b[i] + c[i];
    }
    return a[N-1];

But it takes 989ns = ~500 cycles. The II=3 so it should take 300 cycles. The prologue and epilogue both require 2 basic blocks.

I need to fix the prologue to branch to the epilog depending on the loop bound. For instance if N=1 then the kernel should be skipped.

Is there an easy way to generate a gantt chart for the reservation table? psTricks seems to have gantt chart generation for latex. Okay found a good sty here: http://www.martin-kumm.de/tex_gantt_package.php

Added a debug macro for legup. Use the option '-debug-only=legup' to only show debugging from LegUp.

I don't understand how this is executing in legup in 1000ns/2=~500 cycles. This means the loop body only takes 5 cycles when it should take 6. Seems like the getelementptr has been chained. It's weird though, because I see it gets scheduled in separate states at one point. Then gets chained in later. What is going on here? The chaining happens somewhere between SchedulerASAP::scheduleBasicBlock() and SchedulerPass::createFSMforBB()

I'm noticing that the scheduler needs to be completely revamped. There is tons of copy pasted code all over the place. For instance, looking in SchedulerMapping::createFSM() function, this looks like an exact copy of the ASAP scheduler code. And the schedulerPass has the exact same copied code too. Why does the DAG need it's own custom asap code? And then this code is repeated again in simpleASAPScheduler.

Okay so I think the bug was this code in SimpleASAPScheduler::getSoonestStateRegUses():

if (depIn->getAsapDelay() + in->getDelay() > InstructionNode::getMaxDelay()) {

Should be this:

if (depIn->getAsapDelay() >= InstructionNode::getMaxDelay()
        || depIn->getAsapDelay() + in->getDelay() > InstructionNode::getMaxDelay()) {

Basically, you look at the predecessors of the current instruction (depIn). If they have an asapDelay that's equal or greater than the getMaxDelay then you _must_ be in the next state. Otherwise, _only_ if the asapDelay of the predecessor + the delay of the current instruction is _greater_ than the maxDelay would you need to be moved to the next state (there isn't enough room for you to be in the current state with the predecessor).

Andrew Canis 2011/04/25 15:05 There's an interesting discussion on the CBackend on the LLVM mailing list: http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-November/036278.html

Chris suggests a full rewrite if anyone wants to work on the CBackend:

If anyone was really interested in this, I'd strongly suggest a complete
rewrite of the C backend: make use the existing target independent code
generator code (for legalization etc) and then just put out a weird ".s file"
at the end.
-Chris

So I've finished iterative modulo scheduling for a simple example with no recurrences:

    int a[N], b[N], c[N];
    for (i = 0; i < N; i++) {
        a[i] = b[i] + c[i];
    }
    return a[N-1];

But it takes 989ns = ~500 cycles. The II=3 so it should take 300 cycles. The prologue and epilogue both require 2 basic blocks.

I need to fix the prologue to branch to the epilog depending on the loop bound. For instance if N=1 then the kernel should be skipped.

Is there an easy way to generate a gantt chart for the reservation table? psTricks seems to have gantt chart generation for latex. Okay found a good sty here: http://www.martin-kumm.de/tex_gantt_package.php

Added a debug macro for legup. Use the option '-debug-only=legup' to only show debugging from LegUp.

I don't understand how this is executing in legup in 1000ns/2=~500 cycles. This means the loop body only takes 5 cycles when it should take 6. Seems like the getelementptr has been chained. It's weird though, because I see it gets scheduled in separate states at one point. Then gets chained in later. What is going on here? The chaining happens somewhere between SchedulerASAP::scheduleBasicBlock() and SchedulerPass::createFSMforBB()

I'm noticing that the scheduler needs to be completely revamped. There is tons of copy pasted code all over the place. For instance, looking in SchedulerMapping::createFSM() function, this looks like an exact copy of the ASAP scheduler code. And the schedulerPass has the exact same copied code too. Why does the DAG need it's own custom asap code? And then this code is repeated again in simpleASAPScheduler.

Okay so I think the bug was this code in SimpleASAPScheduler::getSoonestStateRegUses():

if (depIn->getAsapDelay() + in->getDelay() > InstructionNode::getMaxDelay()) {

Should be this:

if (depIn->getAsapDelay() >= InstructionNode::getMaxDelay()
        || depIn->getAsapDelay() + in->getDelay() > InstructionNode::getMaxDelay()) {

Basically, you look at the predecessors of the current instruction (depIn). If they have an asapDelay that's equal or greater than the getMaxDelay then you _must_ be in the next state. Otherwise, _only_ if the asapDelay of the predecessor + the delay of the current instruction is _greater_ than the maxDelay would you need to be moved to the next state (there isn't enough room for you to be in the current state with the predecessor).

Andrew Canis 2011/04/25 15:05

Finished calculating heights of the dependences graph. I need to fix the schedulerDAG - why are reg/mem dependencies split up?

Also a bigger question: what is the delay of an instruction? It's actually determined by the schedule and depends on the chaining that happens. There might be an opportunity here to improve the algorithm.

For some reason, the loads are aliasing:

  %scevgep7 = getelementptr [100 x i32]* %b, i32 0, i32 %i.06
  %scevgep8 = getelementptr [100 x i32]* %c, i32 0, i32 %i.06
  %0 = load i32* %scevgep7, align 4
  %1 = load i32* %scevgep8, align 4

Even though %b and %c clearly don't alias…

Running -print-alias-sets:

../../build/bin/opt -legup-config=../../hwtest/CycloneII.tcl -load=../../build/lib/LLVMPolly.so -basicaa -print-alias-sets  -modulo-schedule pipeline.premodulo.bc > pipeline.bc 
Alias Set Tracker: 3 alias sets for 3 pointer values.
  AliasSet[0xa7957f0, 1] must alias, Ref       Pointers: (i32* %scevgep7, 4)
  AliasSet[0xa795820, 1] must alias, Ref       Pointers: (i32* %scevgep8, 4)
  AliasSet[0xa795850, 1] must alias, Mod       Pointers: (i32* %scevgep, 4)

The mem dependence uses are:

%0 = load i32* %scevgep7, align 4 uses: %1 = load i32* %scevgep8, align 4
%0 = load i32* %scevgep7, align 4 uses: store i32 %2, i32* %scevgep, align 4

%1 = load i32* %scevgep8, align 4 uses: store i32 %2, i32* %scevgep, align 4

It's like everything is considered aliased… Not sure why this is happening. Don't have time to fix it now.

So the heights look good. Except for the aliasing issue between the loads.

Height: 6:   %i.06 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]
Height: 6:   %scevgep7 = getelementptr [100 x i32]* %b, i32 0, i32 %i.06
Height: 5:   %0 = load i32* %scevgep7, align 4
Height: 4:   %scevgep8 = getelementptr [100 x i32]* %c, i32 0, i32 %i.06
Height: 3:   %1 = load i32* %scevgep8, align 4
Height: 2:   %3 = add nsw i32 %i.06, 1
Height: 1:   %scevgep = getelementptr [100 x i32]* %a, i32 0, i32 %i.06
Height: 1:   %2 = add nsw i32 %1, %0
Height: 1:   %exitcond = icmp eq i32 %3, 100
Height: 0:   br i1 %exitcond, label %bb2, label %bb
Height: 0:   store i32 %2, i32* %scevgep, align 4

Andrew Canis 2011/04/22 15:05

Moving to cmake (for polly). Debug build:

acanis@acanis-desktop:~/work/legup/build$ cmake ../llvm/ -DCMAKE_BUILD_TYPE=Debug -DCMAKE_PREFIX_PATH=/home/acanis/work/polly/cloog/install/

Had to add Verilog as a target to llvm/CMakeLists.txt

With cmake you can just say:

make llc

I can't include polly as an analysis pass in the backend. Won't work with the build system. Anyway, I need to make modulo scheduling a prepass anyway. Just do all the development in the polly folder to simplify the build issues.

Created a simple example with no loop recurrences:

    for (i = 0; i < N; i++) {
        a[i] = b[i] + c[i];
    }

Takes 1007ns/2 = 500 cycles. The .ll:

bb:                                               ; preds = %bb, %bb.nph
  %i.04 = phi i32 [ 0, %bb.nph ], [ %3, %bb ]
  %scevgep = getelementptr [100 x i32]* %a, i32 0, i32 %i.04
  %scevgep5 = getelementptr [100 x i32]* %b, i32 0, i32 %i.04
  %scevgep6 = getelementptr [100 x i32]* %c, i32 0, i32 %i.04
  %0 = volatile load i32* %scevgep5, align 4
  %1 = volatile load i32* %scevgep6, align 4
  %2 = add nsw i32 %1, %0
  volatile store i32 %2, i32* %scevgep, align 4
  %3 = add nsw i32 %i.04, 1
  %exitcond = icmp eq i32 %3, 100
  br i1 %exitcond, label %bb2, label %bb

I would expect this to take 5 cycles: Two loads can be pipelined sequentially, so 3 cycles. Then add takes 1 cycle. Store takes 1 cycles. So 5 cycles * 100 = 500. Wait. But what about the getelementptr instructions? Shouldn't this take 6 cycles? What can I pipeline this too? The resource minimum initiation interval (ResMII) is limited by memory operations, which there are 3 of. There is only a single memory port. Therefore ResMII = 3/1 = 3. So fully pipelined this loop should take 300 cycles. There are no recurrences.

So you can't modify the LLVM IR when iterating over SCoPs:

// Because they operate on Polly IR, not the LLVM IR, ScopPasses are not allowed
// to modify the LLVM IR. Due to this limitation, the ScopPass class takes

All I want to do is detect and iterate over the SCoPs in the LLVM IR! What's the point of polly if you can't modify the LLVM?

Andrew Canis 2011/04/21 15:05

Okay. I finally figured out how to handle the fact we have an upstream branch of LLVM in our repository. git subtree merging.

git submodules don't work. They are too complex and don't fit our workflow. I don't really want to have to run 'git submodule update' every time I modify something in the llvm directory. Instead what I want is a branch of LLVM that tracks the upstream changes. Then occasionally I want to merge in the latest LLVM upstream changes into that branch and then merge it into mainline. The best thing about git subtree is it doesn't change the workflow of anyone else working with LegUp. It's just up to me to occasionally merge these changes.

git remote add llvm_remote http://llvm.org/git/llvm.git
git fetch llvm_remote

There are a bunch of releases:

* [new branch]      master     -> llvm_remote/master
...
 * [new branch]      release_27 -> llvm_remote/release_27
 * [new branch]      release_28 -> llvm_remote/release_28
 * [new branch]      release_29 -> llvm_remote/release_29

We were at the LLVM 2.7 release.

git checkout -b llvm_2.7 llvm_remote/release_27
git checkout -b llvm_branch llvm_remote/master

Looking in the release27 branch, the last commit of 2.7 is:

commit 4bbf07421f101f00f4272927b60f7a8383b5cecf
Author: Tanya Lattner <tonic@nondot.org>
Date:   Tue Apr 27 06:53:59 2010 +0000

    Commit 2.7 release notes.
    Update getting started guide for 2.7
    
    
    git-svn-id: https://llvm.org/svn/llvm-project/llvm/branches/release_27@102412 91177308-0d34-0410-b5e6-96231b3b80d

Very interesting. This commit doesn't exist in the LLVM mainline. I guess this makes sense. They created an svn branch to track the release of 2.7.

Probably the best way to deal with this is forget about 2.7, just grab the latest git repo.

git read-tree --prefix=llvm-git/ -u llvm_branch

The llvm commit head is:

commit e5ff344fc03351eaf8bb3303d0fe359378c09684

Now when I git mv into the subtree I lose all the previous history. No I just need to use:

git log --follow

git annotate still works. Okay so this is fine.

  Added LLVM git mainline as a git subtree
  
  Used the command:
  git read-tree --prefix=llvm-git/ -u llvm_branch
  
  The latest LLVM git commit in llvm_branch was:
  commit e5ff344fc03351eaf8bb3303d0fe359378c09684

Is there any way I can merge it into my existing folder. I don't really understand how the subtree feature works…

Okay, I tried a new strategy. Removed llvm-git and just ran:

git merge --squash -s subtree --no-commit llvm_branch

It somehow detected that llvm/ was where the merge should happen. git just completely removed the Verilog directory! Just manually go through and fix the merge.

I wonder if this would work better:

git read-tree --prefix=llvm/ -m -u llvm_branch

Nope. Doesn't work. Can't use prefix and -m options together

I think Victor will have to merge in:

lib/Target/Mips/MipsRegisterInfo.cpp

Something is wrong with the MemoryDependenceAnaysis pass. It's giving me a seg fault. There's a new LLVM idiom for passes:

-static RegisterPass<GVN> X("gvn",
-                           "Global Value Numbering");
+INITIALIZE_PASS_BEGIN(GVN, "gvn", "Global Value Numbering", false, false)
+INITIALIZE_PASS_DEPENDENCY(MemoryDependenceAnalysis)
+INITIALIZE_PASS_DEPENDENCY(DominatorTree)
+INITIALIZE_AG_DEPENDENCY(AliasAnalysis)
+INITIALIZE_PASS_END(GVN, "gvn", "Global Value Numbering", false, false)
 

Another example:

// Register the default SparcV9 implementation...
-static RegisterPass<TargetData> X("targetdata", "Target Data Layout", false, 
-                                  true);
+INITIALIZE_PASS(TargetData, "targetdata", "Target Data Layout", false, true)
 char TargetData::ID = 0;
 

Wow. Really annoying compiler warning:

SchedulerPass.cpp:50: error: ‘void llvm::initializeSchedulerASAPPass(llvm::PassRegistry&)’ should have been declared inside ‘llvm’

Fix by adding:

namespace llvm {
void initializeSchedulerASAPPass(llvm::PassRegistry&);
}

Where is initializeSchedulerASAPPass() being called from? There is a short note about the change here: http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-July/033293.html Detailed note: http://permalink.gmane.org/gmane.comp.compilers.llvm.devel/35362

There was also an api change with CallInst operand order. The function is now stored as the last operand instead of the first.

    for (CallSite::arg_iterator AI = CI->op_begin()+1, AE = CI->op_end()-1; AI != AE; ++AI) {

PreLTO required IntrinsicLowering so I kept getting the error:

../../../llvm/Debug+Asserts/bin/opt: symbol lookup error: ../../../llvm/Debug+Asserts/lib/LLVMLegUp.so: undefined symbol: _ZN4llvm17IntrinsicLowering18LowerIntrinsicCallEPNS_8CallInstE

I finally figured out you need to add this to the Transforms/LegUp Makefile:

LDFLAGS = $(LLVM_OBJ_ROOT)/lib/CodeGen/$(BuildMode)/IntrinsicLowering.o 

I'm seeing this weird case with overflow intrinsics… llvm-ld seems to produce them.

../../../llvm/Debug+Asserts/bin/llc -march=c dfmul.prelto.linked.bc
LLVM ERROR: Code generator does not support intrinsic function 'llvm.uadd.with.overflow.i64'!

Even the CBackend doesn't support these intrinsics. But the CppBackend does. Okay, I'm reading in the release notes that the CBackend is no longer actively maintained. So I should just be looking at the CppBackend.

If I turn on NO_INLINE (to prevent link time optimization) I can prevent these intrinsics from being generated.

Tons of gxemul errors. PreLTO seems to be failing occasionally with:

make[1]: Leaving directory `/home/acanis/work/legup/examples/memset'
unknown instruction on intrinsic argument
UNREACHABLE executed at PreLTO.cpp:164!

Dhrystone has a strange error:

# ** Error: dhry.v(5687): Module 'legup_memcpy_' is not defined.

I don't really have time to look into this anymore.

git cloned polly in the tools directory. Had to run 'make clean' on my llvm because cmake didn't like the in source build.

acanis@acanis-desktop:~/work/legup/build$ cmake ../llvm/ -DCMAKE_PREFIX_PATH=/home/acanis/work/polly/cloog/install/

Okay. Added a few temporary modifications to llc.cpp so I can load in the shared library. Appears to work:

acanis@acanis-desktop:~/work/legup/build$ bin/llc -load lib/LLVMPolly.so

Andrew Canis 2011/04/20 15:05

The changes I pushed yesterday improved the geomean Time by 8%. Geomean Fmax went up and cycles went down.

Looking into installing Poly in ~/work/polly

Poly:

git clone http://llvm.org/git/llvm.git
cd llvm/tools
git clone git://repo.or.cz/polly.git

ISL/CLooG:

git clone git://repo.or.cz/cloog.git
cd cloog
./get_submodules.sh
./autogen.sh
./configure --prefix=~/work/polly/cloog/install
make
make install

Now building polly:

cd ~/work/polly
mkdir build
cd build
cmake ../llvm -DCMAKE_PREFIX_PATH=~/work/polly/cloog/install . 
make

Great setup btw. I should make LegUp more like this! Especially the use of submodules. The thing is, we have made some modifications to LLVM to add tcl and fix the MIPS backend.

Now modifying my path:

export PATH=~/work/polly/build/bin/:$PATH

Looking at examples:

cd ~/work/polly/test
make

Polly can't seem to deal with constant integers. For instance, the dependencies are detected if I use:

#define N 1024

But not with:

const int N = 1024;

Polly seems to work. For this example:

for (i = 2; i < N; i++) {
    array[i] = array[i-2]+1;
}

Detects a distance of 2:

Printing analysis 'Polly - Calculate dependences for Scop' for region: '%2 => %7' in function 'main':
    Must dependences:
        { Stmt_3[i0] -> Stmt_3[2 + i0] : i0 >= 0 and i0 <= 95; Stmt_3[i0] -> FinalRead[0] : i0 >= 0 and i0 <= 97 }
    May dependences:
        {  }
    Must no source:
        { FinalRead[0] -> MemRef_array[o0] : o0 >= 100 or o0 <= 1; Stmt_3[i0] -> MemRef_array[i0] : i0 >= 0 and i0 <= 1 }
    May no source:
        {  }

Probably the fastest way to get this up and running is to use the .json file export:

{
	"name": "%2 => %7",
	"context": "{ [] }",
	"statements": [{
		"name": "Stmt_3",
		"domain": "{ Stmt_3[i0] : i0 >= 0 and i0 <= 97 }",
		"schedule": "{ Stmt_3[i0] -> scattering[0, i0, 0] }",
		"accesses": [{
			"kind": "read",
			"relation": "{ Stmt_3[i0] -> MemRef_array[i0] }"
		},
		{
			"kind": "write",
			"relation": "{ Stmt_3[i0] -> MemRef_array[2 + i0] }"
		}]
	}]
}

But how do I map 'Stmt_3' to the actually LLVM IR instruction? Looking in ScopInfo.cpp. Looks like 3 is the name of the basic block (label %3) with % stripped off. Similarly with MemRef_array (%array with % stripped).

json is missing all the dependencies. How can I actually integrate this? I need to use the dependencies analysis class:

CodeGeneration.cpp:    Dependences *DP = &getAnalysis<Dependences>();
Pocc.cpp:  Dependences *D = &getAnalysis<Dependences>();

What's the quickest way I can integrate this into LegUp? LegUp runs in the backend llc. Can llc load a library like opt can? llc gives an error when trying to load the Polly library:

Error opening '/home/acanis/work/polly/build/lib/LLVMPolly.so': /home/acanis/work/polly/build/lib/LLVMPolly.so: undefined symbol: _ZNK4llvm10RegionPass17createPrinterPassERNS_11raw_ostreamERKSs
  -load request ignored.

What is RegionPass::createPrinterPass? Do I just need to extend llc to include this? Added it manually to llc (based on opt). Now I get:

Error opening '/home/acanis/work/polly/build/lib/LLVMPolly.so': /home/acanis/work/polly/build/lib/LLVMPolly.so: undefined symbol: _ZTVN4llvm15AliasSetTracker13ASTCallbackVHE

Okay by moving these lines into llc.cpp:

  PM.add(new RegionPassPrinter(NULL, Out->os()));
  createStandardModulePasses(&PM, 3,
                             /*OptimizeSize=*/ false,
                             /*UnitAtATime=*/ true,
                             /*UnrollLoops=*/ true,
                             true,
                             /*HaveExceptions=*/ true,
                             NULL);

I can now load the Polly library.

Next step is the fact that Polly requires the latest git version of LLVM. What's the easiest way to do this? Compiling. Used configure instead of cmake. cmake isn't configured for the tcl changes we made to legup. Errors: Wow. There makefile completely doesn't work. Soo many bugs. Can I fix the tcl problem instead?

Needed to add poly directory to tools/Makefile.

I think I got it:

INCLUDE(FindTCL)
if (TCL_FOUND)
    SET(CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS} "-I ${TCL_INCLUDE_PATH}")
endif()

Now I get a link error:

../../lib/libLLVMTarget.a(LegupTcl.cpp.o): In function `legup::parseTclFile(std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, legup::LegupConfig*)':
LegupTcl.cpp:(.text+0x15): undefined reference to `Tcl_CreateInterp'

Okay, fixed cmake:

INCLUDE(FindTCL)
if (TCL_FOUND)
 # add in the Tcl values if found
   IF (TCL_INCLUDE_PATH)
     INCLUDE_DIRECTORIES(${TCL_INCLUDE_PATH})
   ENDIF (TCL_INCLUDE_PATH)
   IF (TCL_LIB_PATH)
     LINK_DIRECTORIES (${TCL_LIB_PATH})
   ENDIF (TCL_LIB_PATH)
   IF (TCL_LIBRARY)
     LINK_LIBRARIES (${TCL_LIBRARY})
   ENDIF (TCL_LIBRARY)
endif()

Andrew Canis 2011/04/19 15:05

Creating a simple example of pipelining in examples/pipeline. First just a simple, parallel loop with no loop carried dependencies:

do i = 1,100
    a(i) = 1

LegUp compiles this an it takes 811ns/2=406 cycles. I would expect this to take 2 cycles per load, which can be pipelined with incrementing the i induction variable. So 2*100=200 cycles. Where do the other 200 cycles come from? Okay, it's because I disabled all LLVM optimizations. Turned them back on. Now 607ns/2 = 304 cycles. Still off. The loop body:

bb:                                               ; preds = %bb, %bb.nph
  %i.04 = phi i32 [ 0, %bb.nph ], [ %1, %bb ]     ; <i32> [#uses=2]
  %tmp = shl i32 %i.04, 2                         ; <i32> [#uses=1]
  %scevgep = getelementptr [400 x i8]* %0, i32 0, i32 %tmp ; <i8*> [#uses=1]
  %scevgep5 = bitcast i8* %scevgep to i32*        ; <i32*> [#uses=1]
  store i32 1, i32* %scevgep5, align 4
  %1 = add nsw i32 %i.04, 1                       ; <i32> [#uses=2]
  %exitcond = icmp eq i32 %1, 100                 ; <i1> [#uses=1]
  br i1 %exitcond, label %bb2, label %bb

Phi gets removed. Okay, the array offset expression (shl) will take 1 cycle. Adding will take 1 cycle, then the exit cond will take another cycle. The store will take 2 cycles. Still seems like the loop should be taking 2 cycles. Inless the getelementptr and bitcast don't get chained.

Can I print out a dot graph of the dependency graph? Running llc with -debug option, I see:

ASAP: bb
State: 0  %i.04 = phi i32 [ 0, %bb.nph ], [ %1, %bb ]     ; <i32> [#uses=2]
State: 0  %tmp = shl i32 %i.04, 2                         ; <i32> [#uses=1]
State: 1  %scevgep = getelementptr [400 x i8]* %0, i32 0, i32 %tmp ; <i8*> [#uses=1]
State: 1  %scevgep5 = bitcast i8* %scevgep to i32*        ; <i32*> [#uses=1]
State: 2  store i32 1, i32* %scevgep5, align 4
State: 1  %1 = add nsw i32 %i.04, 1                       ; <i32> [#uses=2]
State: 2  %exitcond = icmp eq i32 %1, 100                 ; <i1> [#uses=1]
State: 3  br i1 %exitcond, label %bb2, label %bb

So unfortunately, it seems like the getelementptr actually take an extra cycle. This is because you need to take %0 (the address of the array) and add the %tmp offset. You need the shl because this is an integer array. Why didn't LLVM do strength reduction? TODO: strength reduction (-loop-reduce) isn't run by default?

Also, why is %1 started at control step 1? So the shl is actually cheap to chain because it is a constant shift. Strangely, the phi is actually requiring one cycle. I think there is a bug in the scheduler. I don't think Phi's should require a cycle. Okay. Fixed this issue. I'm going to push it to double check. cycle count is still 607ns/2=307 cycles:

State: 0  %i.04 = phi i32 [ 0, %bb.nph ], [ %1, %bb ]     ; <i32> [#uses=2]
State: 0  %tmp = shl i32 %i.04, 2                         ; <i32> [#uses=1]
State: 1  %scevgep = getelementptr [400 x i8]* %0, i32 0, i32 %tmp ; <i8*> [#uses=1]
State: 1  %scevgep5 = bitcast i8* %scevgep to i32*        ; <i32*> [#uses=1]
State: 2  store i32 1, i32* %scevgep5, align 4
State: 0  %1 = add nsw i32 %i.04, 1                       ; <i32> [#uses=2]
State: 1  %exitcond = icmp eq i32 %1, 100                 ; <i1> [#uses=1]
State: 2  br i1 %exitcond, label %bb2, label %bb

Okay, the other problem is that a branch is taking an extra cycle. Fixed. Final number of cycles: 407ns/2 = 204 cycles

cstep: 0  %i.04 = phi i32 [ 0, %bb.nph ], [ %1, %bb ]     ; <i32> [#uses=2]
cstep: 0  %tmp = shl i32 %i.04, 2                         ; <i32> [#uses=1]
cstep: 1  %scevgep = getelementptr [400 x i8]* %0, i32 0, i32 %tmp ; <i8*> [#uses=1]
cstep: 1  %scevgep5 = bitcast i8* %scevgep to i32*        ; <i32*> [#uses=1]
cstep: 2  store i32 1, i32* %scevgep5, align 4
cstep: 0  %1 = add nsw i32 %i.04, 1                       ; <i32> [#uses=2]
cstep: 1  %exitcond = icmp eq i32 %1, 100                 ; <i1> [#uses=1]
cstep: 1  br i1 %exitcond, label %bb2, label %bb

Weird. It looks like this should be 3 cycles. Oh no. 0 literally means not a cycle. The cstep represents when the instruction finishes. The branch looks strange here, finishing before the store. But I think the finite state mahine generation will handle that. So wait. If cstep is the ending state then I was probably wrong to chain after PhiNodes… Just wait until buildbot finishes… Turning off chaining after phinode:

cstep: 0  %i.04 = phi i32 [ 0, %bb.nph ], [ %1, %bb ]     ; <i32> [#uses=2]
cstep: 0  %tmp = shl i32 %i.04, 2                         ; <i32> [#uses=1]
cstep: 1  %scevgep = getelementptr [400 x i8]* %0, i32 0, i32 %tmp ; <i8*> [#uses=1]
cstep: 1  %scevgep5 = bitcast i8* %scevgep to i32*        ; <i32*> [#uses=1]
cstep: 2  store i32 1, i32* %scevgep5, align 4
cstep: 1  %1 = add nsw i32 %i.04, 1                       ; <i32> [#uses=2]
cstep: 2  %exitcond = icmp eq i32 %1, 100                 ; <i1> [#uses=1]
cstep: 2  br i1 %exitcond, label %bb2, label %bb

This looks like what I want. But wait, doesn't the store require 2 cycles. So shouldn't the branch be forced to cstep 3? Increasing the load latency in the code doesn't change the bra.

Weird. If I change the lower loop bound to 1. I get 300 cycles:

cstep: 0  %indvar = phi i32 [ 0, %bb.nph ], [ %indvar.next, %bb ] ; <i32> [#uses=2]
cstep: 0  %tmp = shl i32 %indvar, 2                       ; <i32> [#uses=1]
cstep: 1  %tmp5 = add i32 %tmp, 4                         ; <i32> [#uses=1]
cstep: 2  %scevgep = getelementptr [400 x i8]* %0, i32 0, i32 %tmp5 ; <i8*> [#uses=1]
cstep: 2  %scevgep6 = bitcast i8* %scevgep to i32*        ; <i32*> [#uses=1]
cstep: 3  store i32 1, i32* %scevgep6, align 4
cstep: 0  %indvar.next = add i32 %indvar, 1               ; <i32> [#uses=2]
cstep: 1  %exitcond = icmp eq i32 %indvar.next, 99        ; <i1> [#uses=1]
cstep: 1  br i1 %exitcond, label %bb2, label %bb

This is a strength reduction issue. If I run:

opt -loop-reduce pipeline.bc > pipeline2.bc
../../llvm/Debug/bin/llc -legup-config=../../hwtest/CycloneII.tcl  -march=v pipeline2.bc -o pipeline.v -debug &> log

I get it back down to 200 cycles. So this proves that strength reduction isn't run by default for some reason…

cstep: 0  %lsr.iv3 = phi [400 x i8]* [ %tmp6, %bb ], [ %scevgep12, %bb.nph ] ; <[400 x i8]*> [#uses=2]
cstep: 0  %lsr.iv = phi i32 [ %lsr.iv.next, %bb ], [ 99, %bb.nph ] ; <i32> [#uses=1]
cstep: 0  %lsr.iv37 = bitcast [400 x i8]* %lsr.iv3 to i32* ; <i32*> [#uses=1]
cstep: 1  store i32 1, i32* %lsr.iv37, align 4
cstep: 0  %lsr.iv.next = add i32 %lsr.iv, -1              ; <i32> [#uses=2]
cstep: 0  %scevgep4 = getelementptr [400 x i8]* %lsr.iv3, i32 0, i32 4 ; <i8*> [#uses=1]
cstep: 0  %tmp6 = bitcast i8* %scevgep4 to [400 x i8]*    ; <[400 x i8]*> [#uses=1]
cstep: 1  %exitcond = icmp eq i32 %lsr.iv.next, 0         ; <i1> [#uses=1]
cstep: 1  br i1 %exitcond, label %bb2, label %bb

Okay, I think I'm going to do a loop carried dependency instead:

    array[0] = 1;
    for (i = 1; i < N; i++) {
        array[i] = array[i-1]+1;
    }
    return array[N-1];

300 cycles. With strength reduction 200 cycles. Made array volatile. Now up to 500 cycles. (volatile makes no difference on the previous loop - still 200 cycles w/strength red). New loop body:

bb:
  %indvar = phi i32 [ 0, %bb.nph ], [ %tmp, %bb ] ; <i32> [#uses=2]
  %tmp = add i32 %indvar, 1                       ; <i32> [#uses=3]
  %scevgep = getelementptr [100 x i32]* %0, i32 0, i32 %tmp ; <i32*> [#uses=1]
  %scevgep5 = getelementptr [100 x i32]* %0, i32 0, i32 %indvar ; <i32*> [#uses=1]
  %1 = volatile load i32* %scevgep5, align 4      ; <i32> [#uses=1]
  %2 = add nsw i32 %1, 1                          ; <i32> [#uses=1]
  volatile store i32 %2, i32* %scevgep, align 4
  %exitcond = icmp eq i32 %tmp, 99                ; <i1> [#uses=1]
  br i1 %exitcond, label %bb2, label %bb

Control steps:

cstep: 0  %indvar = phi i32 [ 0, %bb.nph ], [ %tmp, %bb ] ; <i32> [#uses=2]
cstep: 0  %tmp = add i32 %indvar, 1                       ; <i32> [#uses=3]
cstep: 1  %scevgep = getelementptr [100 x i32]* %0, i32 0, i32 %tmp ; <i32*> [#uses=1]
cstep: 0  %scevgep5 = getelementptr [100 x i32]* %0, i32 0, i32 %indvar ; <i32*> [#uses=1]
depend (state: 0):   %scevgep5 = getelementptr [100 x i32]* %0, i32 0, i32 %indvar ; <i32*> [#uses=1]
cstep: 1  %1 = volatile load i32* %scevgep5, align 4      ; <i32> [#uses=1]
cstep: 3  %2 = add nsw i32 %1, 1                          ; <i32> [#uses=1]
depend (state: 3):   %2 = add nsw i32 %1, 1                          ; <i32> [#uses=1]
cstep: 4  volatile store i32 %2, i32* %scevgep, align 4
cstep: 1  %exitcond = icmp eq i32 %tmp, 99                ; <i1> [#uses=1]
cstep: 1  br i1 %exitcond, label %bb2, label %bb

Trying this code:

    volatile int array_val;
    volatile int *array = &array_val;
    *array = 1;
    for (i = 1; i < N; i++) {
        *array = *array+1;
        //array[i] = 1;
    }
    return *array;

Also takes 500 cycles.

bb:                                               ; preds = %bb, %bb.nph
  %indvar = phi i32 [ 0, %bb.nph ], [ %indvar.next, %bb ] ; <i32> [#uses=1]
  %1 = phi i32 [ %0, %bb.nph ], [ %3, %bb ]       ; <i32> [#uses=1]
  %2 = add nsw i32 %1, 1                          ; <i32> [#uses=1]
  volatile store i32 %2, i32* %array_val, align 4
  %3 = volatile load i32* %array_val, align 4     ; <i32> [#uses=2]
  %indvar.next = add i32 %indvar, 1               ; <i32> [#uses=2]
  %exitcond = icmp eq i32 %indvar.next, 99        ; <i1> [#uses=1]
  br i1 %exitcond, label %bb2, label %bb

Control steps:

cstep: 0  %indvar = phi i32 [ 0, %bb.nph ], [ %indvar.next, %bb ] ; <i32> [#uses=1]
cstep: 0  %1 = phi i32 [ %0, %bb.nph ], [ %3, %bb ]       ; <i32> [#uses=1]
cstep: 0  %2 = add nsw i32 %1, 1                          ; <i32> [#uses=1]
depend (state: 0):   %2 = add nsw i32 %1, 1                          ; <i32> [#uses=1]
cstep: 1  volatile store i32 %2, i32* %array_val, align 4
depend (state: 1):   volatile store i32 %2, i32* %array_val, align 4
cstep: 2  %3 = volatile load i32* %array_val, align 4     ; <i32> [#uses=2]
cstep: 0  %indvar.next = add i32 %indvar, 1               ; <i32> [#uses=2]
cstep: 1  %exitcond = icmp eq i32 %indvar.next, 99        ; <i1> [#uses=1]
cstep: 1  br i1 %exitcond, label %bb2, label %bb

Note how the load is immediately performed after the store. We can do this because the memory controller is shared and we access it sequentially. I don't see how the control steps map to the number of cycles. Here the max cstep is 2. Where the previous example had a max csetp of 4. Yet they both seem to take 5 steps to iterate over the loop. Strength reduction does nothing here.

The actual states are:

state: 2   %indvar = phi i32 [ 0, %bb.nph ], [ %indvar.next, %bb ] ; <i32> [#uses=1]
state: 2   %1 = phi i32 [ %0, %bb.nph ], [ %3, %bb ]       ; <i32> [#uses=1]
state: 2   %2 = add nsw i32 %1, 1                          ; <i32> [#uses=1]
state: 2   %indvar.next = add i32 %indvar, 1               ; <i32> [#uses=2]
state: 7   volatile store i32 %2, i32* %array_val, align 4
state: 7   %exitcond = icmp eq i32 %indvar.next, 99        ; <i1> [#uses=1]
state: 7   br i1 %exitcond, label %bb2, label %bb
state: 8   %3 = volatile load i32* %array_val, align 4     ; <i32> [#uses=2]

Note the states 2, 7, 8 are sequential. The branch actually happens at state 10. So 2 cycles after the load. So there is some code to handle the memory instruction latency for branches. Which makes 3 + 2 = 5 cycles in the inner loop. As we see. Honestly, I need to make this a lot easier to visualize! The control steps are really only local (within a BB). I think it's actually done in SchedulerMapping::createFSM(). There is a few lines of code to expand the number of states in a basic block to ensure a function finishes:

// need to ensure multi-cycle instructions finish in the basic block
unsigned delayState = SchedulerPass::getNumInstructionCycles(I);
...

Nope. Not here.

Andrew Canis 2011/04/18 15:05

Fixed the problem with my samba printing. Was this bug in ubuntu: http://brainextender.blogspot.com/2009/01/ubuntu-intrepid-too-many-failed.html

Andrew Canis 2011/04/17 15:05

Added analytics code to blog and wiki (wiki/lib/tpl/default/main.php)

Andrew Canis 2011/04/15 15:05

Conditional gdb breakpoint:

b translate.cc:244 if !a

For the code:

Breakpoint 3, translate_source (r=0x99dc080) at translate.cc:244
244             if (!a) simple_error("temporary register used before defined");

Andrew Canis 2011/04/08 15:05

Added Geolocation info:

contab -e
# Update geocity data on the 3rd of every month
0 0 3 * * /var/www/updategeocity.sh 2>&1 | mailx -s "update GeoCity" andrew.canis@utoronto.ca

Used the code from:

http://www.sequentiallogic.com/2009/05/29/maxmind-geolite-country-and-geolite-city-made-easy/

Sample code in geo.php

Andrew Canis 2011/02/16 15:05

Working on a user guide: ~/grad/legup/notes/

Andrew Canis 2011/02/04 15:05

Changed legup.org → legup.eecg.utoronto.ca

Even with new pipelined dividers aes is slow:

Clock Setup: 'clk'
--------------------------------------------------------------------------------------
Clock Setup: 'clk'
--------------------------------------------------------------------------------------
Path Number                  : 1
Slack                        : -109.304 ns
Actual fmax (period)         : 8.38 MHz ( period = 119.304 ns )
From                         : decrypt:decrypt_inst|KeySchedule:KeySchedule_inst|KeySchedule_bb17_indvar_reg[0]
To                           : decrypt:decrypt_inst|KeySchedule:KeySchedule_inst|KeySchedule_bb17_var14_reg[29]
From Clock                   : clk
To Clock                     : clk
Required Setup Relationship  : 10.000 ns
Required Longest P2P Time    : 9.809 ns
Actual Longest P2P Time      : 119.113 ns

What is this path?

acanis@acanis-desktop:~/work/legup/tiger/hybrid/aes$ grep KeySchedule_bb17_indvar_reg aes.v
reg [31:0] KeySchedule_bb17_indvar_reg;
KeySchedule_bb17_indvar_reg <= KeySchedule_bb17_indvar;
KeySchedule_bb29_scevgep55 <= `TAG_g_word_a + 4 * KeySchedule_bb17_indvar_reg;
KeySchedule_bb29_scevgep55_1 <= `TAG_g_word_a + 4 * KeySchedule_bb17_indvar_reg + 480;
KeySchedule_bb29_scevgep55_2 <= `TAG_g_word_a + 4 * KeySchedule_bb17_indvar_reg + 960;
KeySchedule_bb29_scevgep55_3 <= `TAG_g_word_a + 4 * KeySchedule_bb17_indvar_reg + 1440;
KeySchedule_bb29_indvar_next <= KeySchedule_bb17_indvar_reg + 32'd1;
acanis@acanis-desktop:~/work/legup/tiger/hybrid/aes$ grep KeySchedule_bb17_var14_reg aes.v
reg  KeySchedule_bb17_var14_reg;
KeySchedule_bb17_var14_reg <= KeySchedule_bb17_var14;
                if (KeySchedule_bb17_var14_reg) begin

Here's the From gate:

always @(posedge clk) begin
/* KeySchedule: bb17*/
if (cur_state == 14) begin
/*   %indvar = phi i32 [ 0, %bb.nph45 ], [ %indvar.next, %bb29 ] ; <i32> [#uses=8]*/
KeySchedule_bb17_indvar_reg <= KeySchedule_bb17_indvar;
end
end

Here's the To gate:

always @(posedge clk) begin
/* KeySchedule: bb17*/
if (cur_state == 37) begin
/*   %13 = icmp eq i32 %12, 0                        ; <i1> [#uses=1]*/
KeySchedule_bb17_var14_reg <= KeySchedule_bb17_var14;
end

always @(*) begin
KeySchedule_bb17_var14 <= 0;
/* KeySchedule: bb17*/
if (cur_state == 37) begin
/*   %13 = icmp eq i32 %12, 0                        ; <i1> [#uses=1]*/
KeySchedule_bb17_var14 <= KeySchedule_bb17_var13_reg == 32'd0;
end

always @(posedge clk) begin
/* KeySchedule: bb17*/
if (cur_state == 36) begin
/*   %12 = srem i32 %j.137, %nk.095                  ; <i32> [#uses=2]*/
KeySchedule_bb17_var13_reg <= KeySchedule_bb17_var13;
end

I bet it's something to do with the 'srem'. I should pipeline that…

Why is adpcm so slow?

Info: Slack time is -20.841 ns for clock "clk" between source memory "main:main_inst|altsyncram:Mux3_rtl_0|altsyncram_3lu:auto_generated|ram_block1a0~porta_address_reg8" and destination register "main:main_inst|main_quantl_exit_i_n__i_i_reg[31]"
    Info: Fmax is 45.79 MHz (period= 21.841 ns)

Here's what happens in state 179!

always @(*) begin
if (cur_state == 179) begin
/*   %124 = load i32* getelementptr inbounds ([6 x i32]* @delay_bph, i32 0, i32 5), align 4 ; <i32> [#uses=1]*/
main_quantl_exit_i_var141 <= memory_controller_out;
/*   %126 = mul nsw i32 %124, %123                   ; <i32> [#uses=1]*/
main_quantl_exit_i_var144 <= main_quantl_exit_i_var141 * main_quantl_exit_i_var140_reg;
/*   %132 = ashr i32 %131, 14                        ; <i32> [#uses=2]*/
main_quantl_exit_i_var145 <= main_quantl_exit_i_var143_reg + main_quantl_exit_i_var144;
/*   %136 = add nsw i32 %135, %132                   ; <i32> [#uses=2]*/
main_quantl_exit_i_var146 <= $signed(main_quantl_exit_i_var145) >>> 32'd14 % 32;
/*   %131 = add nsw i32 %130, %126                   ; <i32> [#uses=1]*/
main_quantl_exit_i_var147 <= main_quantl_exit_i_var79_reg + main_quantl_exit_i_var146;
/*   %137 = sub nsw i32 %30, %136                    ; <i32> [#uses=3]*/
main_quantl_exit_i_var148 <= main_bb5_i_var30_reg - main_quantl_exit_i_var147;
/*   %138 = icmp sgt i32 %137, -1                    ; <i1> [#uses=2]*/
main_quantl_exit_i_var149 <= $signed(main_quantl_exit_i_var148) > $signed(-32'd1);
/*   %n..i.i = select i1 %138, i32 %137, i32 %141    ; <i32> [#uses=1]*/
main_quantl_exit_i_n__i_i <= main_quantl_exit_i_var149 ? main_quantl_exit_i_var148 : main_quantl_exit_i_var150;
/*   %n..i.i = select i1 %138, i32 %137, i32 %141    ; <i32> [#uses=1]*/
main_quantl_exit_i_n__i_i_reg <= main_quantl_exit_i_n__i_i;

I don't know how this was all put into one state. There must be a problem with the estimations.

Why are multipliers so high for adpcm? Because I stopped sharing 32-bit multipliers. Add that back in. Well, I can only add this back in if I pipeline the multipliers. Otherwise I get a huge hit in fmax, especially with adpcm

Am I sharing the dividers properly?

ac215364@1637b:~/Downloads >grep "lpm_divide " --before-context=1 aes_main.v |grep %
/*   %12 = srem i32 %j.137, %nk.095                  ; <i32> [#uses=2]*/
/*   %19 = sdiv i32 %j.137, %nk.095                  ; <i32> [#uses=1]*/
/*   %15 = sdiv i32 %14, 16                          ; <i32> [#uses=1]*/
/*   %16 = srem i32 %14, 16                          ; <i32> [#uses=1]*/
/*   %25 = sdiv i32 %24, 16                          ; <i32> [#uses=1]*/
/*   %26 = srem i32 %24, 16                          ; <i32> [#uses=1]*/
/*   %30 = sdiv i32 %29, 16                          ; <i32> [#uses=1]*/
/*   %31 = srem i32 %29, 16                          ; <i32> [#uses=1]*/
/*   %35 = sdiv i32 %34, 16                          ; <i32> [#uses=1]*/
/*   %36 = srem i32 %34, 16                          ; <i32> [#uses=1]*/
/*   %48 = sdiv i32 %46, 16                          ; <i32> [#uses=1]*/
/*   %49 = srem i32 %46, 16                          ; <i32> [#uses=1]*/
/*   %52 = sdiv i32 %45, 16                          ; <i32> [#uses=1]*/
/*   %53 = srem i32 %45, 16                          ; <i32> [#uses=1]*/
/*   %56 = sdiv i32 %44, 16                          ; <i32> [#uses=1]*/
/*   %57 = srem i32 %44, 16                          ; <i32> [#uses=1]*/
/*   %60 = sdiv i32 %43, 16                          ; <i32> [#uses=1]*/
/*   %61 = srem i32 %43, 16                          ; <i32> [#uses=1]*/
/*   %27 = srem i32 %tmp84, 4                        ; <i32> [#uses=1]*/
/*   %42 = srem i32 %tmp85, 4                        ; <i32> [#uses=1]*/
/*   %57 = srem i32 %tmp86, 4                        ; <i32> [#uses=1]*/
/*   %36 = srem i32 %type, 1000                      ; <i32> [#uses=2]*/
/*   %38 = sdiv i32 %36, 8                           ; <i32> [#uses=2]*/

Andrew Canis 2010/09/20 8:00

So blowfish hasn't compiled properly with quartus since Sept 2:

acanis@navy:/autofs/seth.eecg/brown.r/r0/acanis/buildbot/linux_x86_64/build/examples/chstone/blowfish$ ll --sort=time
-rw-r--r--  1 acanis browngrp 574K Sep 16 11:52 bf.v
...
-rw-rw-r--  1 acanis browngrp   25 Sep  2 15:26 top.done

Getting a strange error:

Info: Found 1 design units, including 1 entities, in source file db/altsyncram_1b13.tdf
    Info: Found entity 1: altsyncram_1b13
    Info: Found 1 design units, including 1 entities, in source file db/mux_ujb.tdf
        Info: Found entity 1: mux_ujb
        Error: Current module quartus_map ended unexpectedly
        Error: Flow compile (for project /autofs/seth.eecg/brown.r/r0/acanis/buildbot/linux_x86_64/build/examples/chstone/blowfish/top) was not successful
        Error: ERROR: Error(s) found while running an executable. See report file(s) for error message(s). Message log indicates which executable was run last.

        make[2]: *** [f] Error 3
        make[2]: Leaving directory `/autofs/seth.eecg/brown.r/r0/acanis/buildbot/linux_x86_64/build/examples/chstone/blowfish'

Trying to just manually build blowfish:

quartus_sh --flow compile top

Trying to turn off parallel synthesis to see if that helps. Nope.

Trying:

make cleanall
make
quartus_sh --flow compile top

Andrew Canis 2010/09/16 8:00

Just had a conflict with a locally modified file:

git reset --hard
git pull again

Then I got 'xxxx would be overwritten by merge.' To fix: DON'T DO THIS - I just lost my last two commits!

git fetch
git reset --hard origin/master

Debugging issue with gsm make generate-wrapper:

../../../llvm/Debug/bin/opt -legup-config=config.tcl -load=../../../llvm/Debug/lib//LLVMLegUp.so -legup-hw-only < gsm.prelto.bc > gsm.prelto.hw.bc
Invalid user of intrinsic instruction!
i8* bitcast (void (i8*, i8, i64, i32)* @llvm.memset.i64 to i8*)
Broken module found, compilation aborted!

Verifier.cpp fails here:

680       // If this function is actually an intrinsic, verify that it is only used in
681       // direct call/invokes, never having its "address taken".
682       if (F.getIntrinsicID()) {
683         for (Value::use_iterator UI = F.use_begin(), E = F.use_end(); UI != E;++UI){
684           User *U = cast<User>(UI);
(gdb) l
685           if ((isa<CallInst>(U) || isa<InvokeInst>(U)) && UI.getOperandNo() == 0)
686             continue;  // Direct calls/invokes are ok.
687           
688           Assert1(0, "Invalid user of intrinsic instruction!", U); 
689         }
690       }

So the new code is trying to get the address of an intrinsic? How do I print out the bitcode before this failure? Okay need to run with -disable-verify flag:

../../../llvm/Debug/bin/opt -disable-verify -legup-config=config.tcl -load=../../../llvm/Debug/lib//LLVMLegUp.so -legup-hw-only < gsm.prelto.bc > gsm.prelto.hw.bc

That's weird, the error is coming from:

@llvm.used = appending global [5 x i8*] [i8* bitcast (void (i8*, i8, i64, i32)* @llvm.memset.i64 to i8*), i8* bitcast (void (i16*, i32*)* @Autocorrelation to i8*), i8* bitcast (void (i32*, i16*)* @Reflection_coefficients to i8*), i8* bit

What is @llvm.used?

If a global variable appears in the @llvm.used list, then the compiler, assembler, and linker are required to treat the symbol as if there is a reference to the global that it cannot see. 

Okay I've just removed the code that produces this llvm.used variable. The only difference I see is that the functions are now all marked as internal. Not sure if this will matter.

Andrew Canis 2010/09/14 8:00

Binding: sharing dividers has disappointing results. Saves ~1500 LEs on dfsin and dfdiv - so the geomeon only drops ~1%.

Interesting synthesis options: settings→synthesis→more settings. Option to turn small rams into logic:

set_global_assignment -name AUTO_RAM_TO_LCELL_CONVERSION ON

One thing we are missing. We can't transform a simple branch to a mux. For instance:

if (a == 3)
    b = z;
else
    b = y;

This if statement will need multiple states. State 1: a == 3, state 2: b = z, state 3: b = y. Really we should be able to feed a == 3 into a mux.

Andrew Canis 2010/09/02 8:00

Hybrid jpeg simulation took 34 hours. Rest of the tests took about an hour. I'm just going to disable jpeg for now.

Andrew Canis 2010/09/01 8:00

Made changes to buildbot for Tiger perf and generated new folders:

buildmaster@acanis-desktop:~/buildbot/public_html/perf$ git diff
diff --git a/buildbot/public_html/perf/generate_perf.py b/buildbot/p
index 10eaaac..66cd8cc 100644
--- a/buildbot/public_html/perf/generate_perf.py
+++ b/buildbot/public_html/perf/generate_perf.py
@@ -140,6 +140,8 @@ PerfTester_list = [
     PerfTester('linux_x86', 'Linux Perf'),
     PerfTester('linux_x86_64', 'Linux 64 Perf'),
     PerfTester('perf_test', 'Test Perf'),
+    PerfTester('linux_x86_tiger', 'Linux Perf'),
+    PerfTester('linux_x86_64_tiger', 'Linux 64 Perf'),
 #    PerfTester('xp-release-dual-core', 'XP Perf'),
 #    PerfTester('xp-release-single-core', 'XP Perf (single)'),
 #    PerfTester('vista-release-dual-core', 'Vista Perf'),
buildmaster@acanis-desktop:~/buildbot/public_html/perf$ python generate_perf.py 

Tiger simulation flow:

cd examples/sra
# compile for mips. Convert .elf to sdram.dat
make tiger
# $(PROC_DIR) = tiger/hybrid/processor/tiger_cache_on_avalon/tiger_sim
# copy sdram.dat into $(PROC_DIR) 
# run vsim: cd $(PROC_DIR) && vsim -c -do "../run_sim.tcl"
make tigersim

Interesting. It's not possible to screw up the history with git –amend. git won't let you push:

There are still modelsim warnings in Tiger. One is an incompatible clock port in the lpm_divide/lpm_mult modules. I don't understand why. One thing I noticed was removing the vsim flag: +acc=rn (display all registers and nets) gets rid of the warning.

Andrew Canis 2010/08/27 8:00

Interesting, shifters take a lot of area:

shift_ll_32 luts: 159 mux_2_32 luts: 32
shift_ll_64 luts: 410 mux_2_64 luts: 64
shift_rl_32 luts: 159 mux_2_32 luts: 32
shift_rl_64 luts: 410 mux_2_64 luts: 64
signed_comp_eq_mux_32 luts: 53 mux_2_32 luts: 32
signed_comp_eq_mux_64 luts: 106 mux_2_64 luts: 64
signed_multiply_64 luts: 169 mux_2_64 luts: 64
unsigned_divide_64 luts: 4285 mux_2_64 luts: 64

Old results for xpilot mips fast (recompiled for CycloneII):

xpilot_fast_mips/prj
------------
Fmax: 91.65 MHz
Latency: 7347 cycles
Latency: 80 us
Verilog: 48 LOC
Family : Cyclone II
Device : EP2C15AF484C6
Timing Models : Final
Total logic elements : 2,815 / 14,448 ( 19 % )
    Total combinational functions : 2,620 / 14,448 ( 18 % )
    Dedicated logic registers : 1,449 / 14,448 ( 10 % )
Total registers : 1449
Total pins : 213 / 315 ( 68 % )
Total virtual pins : 0
Total memory bits : 8,192 / 239,616 ( 3 % )
Embedded Multiplier 9-bit elements : 8 / 52 ( 15 % )

Andrew Canis 2010/08/26 8:00

Literature review for phd transfer. Areas:

  • Hardware/software partitioning
  • Are we good enough for real h/w designers?
  • High-level synthesis optimizations
  • Debugging and verification of hardware
  • System architecture
  • FPGA architecture-aware high-level synthesis
  • Floating point and fixed point support

What is the “state of the art”? Lets start with the easy one: High-level synthesis optimizations.

Andrew Canis 2010/08/23 8:00

Got 2 cycle load/store working. So basically the struct memory controller didn't work if I put the memory_controller_out assignment in an always @(posedge clk), but it did work if I created a new memory_controller_out_reg signal. Strange.

Looking into fmaxes on latest run:

Info: Slack time is -13.703 ns for clock "clk" between source register "main:main_inst|main_Proc_3_exit_i_tmp18_reg[5]" and destination memory "memory_controller:memory_controller_inst|ram_one_port:Arr_2_Glob|altsyncram:altsyncram_component|altsyncram_hae1:auto_generated|ram_block1a22~porta_address_reg10"
Info: Fmax is 68.01 MHz (period= 14.703 ns)

Info: Slack time is -23.143 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:delay_dltx|altsyncram:altsyncram_component|altsyncram_p8d1:auto_generated|ram_block1a14~porta_address_reg2" and destination memory "memory_controller:memory_controller_inst|ram_one_port:accumd|altsyncram:altsyncram_component|altsyncram_9sc1:auto_generated|ram_block1a0~porta_datain_reg8"
Info: Fmax is 41.42 MHz (period= 24.143 ns)

Info: Slack time is -26.296 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:_str|altsyncram:altsyncram_component|altsyncram_glc1:auto_generated|ram_block1a0~porta_address_reg4" and destination memory "memory_controller:memory_controller_inst|ram_one_port:_str|altsyncram:altsyncram_component|altsyncram_glc1:auto_generated|ram_block1a0~porta_datain_reg6"
Info: Fmax is 36.64 MHz (period= 27.296 ns)

Info: Slack time is -14.011 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:out_key|altsyncram:altsyncram_component|altsyncram_44d1:auto_generated|ram_block1a2~porta_address_reg11" and destination register "main:main_inst|main_bb12_i_check_025_i_phi_temp[31]"
Info: Fmax is 66.62 MHz (period= 15.011 ns)

Info: Slack time is -12.449 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:b_input|altsyncram:altsyncram_component|altsyncram_s0d1:auto_generated|ram_block1a32~porta_address_reg5" and destination register "main:main_inst|main_float64_add_exit_var155_reg[31]"
Info: Fmax is 74.35 MHz (period= 13.449 ns)

Info: Slack time is -17.152 ns for clock "clk" between source memory "main:main_inst|altsyncram:Mux2_rtl_0|altsyncram_2lu:auto_generated|ram_block1a0~porta_address_reg8" and destination register "main:main_inst|main_bb4_i32_i_var113_reg[31]"
Info: Fmax is 55.09 MHz (period= 18.152 ns)

Info: Slack time is -11.244 ns for clock "clk" between source register "main:main_inst|cur_state.0011111" and destination register "main:main_inst|main_bb22_i_var102_reg[63]"
Info: Fmax is 81.67 MHz (period= 12.244 ns)

Info: Slack time is -16.636 ns for clock "clk" between source memory "main:main_inst|altsyncram:Selector1_rtl_0|altsyncram_nnu:auto_generated|ram_block1a0~porta_address_reg9" and destination register "main:main_inst|main_bb24_i_i_var193_reg[63]"
Info: Fmax is 56.7 MHz (period= 17.636 ns)

Info: Slack time is -17.503 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:main_bb_nph19_so|altsyncram:altsyncram_component|altsyncram_psa1:auto_generated|ram_block1a0~porta_address_reg7" and destination register "main:main_inst|main_bb_nph35_i_var106_reg[30]"
Info: Fmax is 54.05 MHz (period= 18.503 ns)

Info: Slack time is -28.608 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:hana|altsyncram:altsyncram_component|altsyncram_7pc1:auto_generated|ram_block1a6~porta_address_reg11" and destination memory "memory_controller:memory_controller_inst|ram_one_port:JpegFileBuf|altsyncram:altsyncram_component|altsyncram_igd1:auto_generated|ram_block1a7~porta_address_reg8"
Info: Fmax is 33.77 MHz (period= 29.608 ns)

nfo: Slack time is -15.996 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:_str|altsyncram:altsyncram_component|altsyncram_njc1:auto_generated|ram_block1a0~porta_address_reg0" and destination register "main:main_inst|main_bb45_Hi_0_phi_temp[30]"
Info: Fmax is 58.84 MHz (period= 16.996 ns)

Info: Slack time is -15.312 ns for clock "clk" between source memory "memory_controller:memory_controller_inst|ram_one_port:ld_Bfr|altsyncram:altsyncram_component|altsyncram_hqc1:auto_generated|ram_block1a0~porta_we_reg" and destination register "main:main_inst|main_bb18_var29_reg[31]"
Info: Fmax is 61.3 MHz (period= 16.312 ns)

Info: Slack time is -12.528 ns for clock "clk" between source register "main:main_inst|cur_state.0001010" and destination memory "memory_controller:memory_controller_inst|ram_one_port:indata|altsyncram:altsyncram_component|altsyncram_l1d1:auto_generated|ram_block1a16~porta_address_reg11"
Info: Fmax is 73.92 MHz (period= 13.528 ns)

Getting the fmax of every accelerator to 100MHz. Pipeline depth of udiv's is now equal to bitwidth.

Looking at jpeg:

Info: Slack time is -20.027 ns for clock "clk" between source memory 
"main:main_inst|altsyncram:Selector1_rtl_0|altsyncram_onu:auto_generated|ram_block1a0~porta_address_reg9" 
and destination memory 
"memory_controller:memory_controller_inst|ram_one_port:DecodeInfo_comps_info_quant_tbl_no|altsyncram:altsyncram_component|altsyncram_vlf1:auto_generated|ram_block1a0~porta_we_reg"
    Info: Fmax is 47.56 MHz (period= 21.027 ns)

Where is the source memory? Can't really tell from the description. There shouldn't be any altsyncrams in main_inst… Dest ram:

reg [1:0] DecodeInfo_comps_info_quant_tbl_no_address;
reg DecodeInfo_comps_info_quant_tbl_no_write_enable;
reg [7:0] DecodeInfo_comps_info_quant_tbl_no_in;
wire [7:0] DecodeInfo_comps_info_quant_tbl_no_out;

/* @DecodeInfo_comps_info_quant_tbl_no = internal global [3 x i8] zeroinitializer ; <[3 x i8]*> [#uses=2]
*/
ram_one_port DecodeInfo_comps_info_quant_tbl_no (
	.clk( clk ),
	.address( DecodeInfo_comps_info_quant_tbl_no_address ),
	.write_enable( DecodeInfo_comps_info_quant_tbl_no_write_enable ),
	.data( DecodeInfo_comps_info_quant_tbl_no_in ),
	.q( DecodeInfo_comps_info_quant_tbl_no_out )
);
defparam DecodeInfo_comps_info_quant_tbl_no.width_a = 8;
defparam DecodeInfo_comps_info_quant_tbl_no.widthad_a = 2;
defparam DecodeInfo_comps_info_quant_tbl_no.numwords_a = 3;
defparam DecodeInfo_comps_info_quant_tbl_no.init_file = "DecodeInfo_comps_info_quant_tbl_no.mif";

Seems like a small altsyncram.

Actually many of the circuits seem to have the altsyncram on their crit path. I'm just going to register the mem controller and use the waitrequest. Fastest circuit is dfadd:

Info: Slack time is -8.681 ns for clock "clk" between 
source memory 
"memory_controller:memory_controller_inst|ram_one_port:float_exception_flags|altsyncram:altsyncram_component|altsyncram_oce1:auto_generated|ram_block1a0~porta_we_reg" 
and destination register 
"main:main_inst|var146_reg[31]"
    Info: Fmax is 103.3 MHz (period= 9.681 ns)

I don't understand why we_reg would ever be driving something. Isn't that an input to the altsyncram? Oh of course. The altsyncram doesn't have output flops, just input registers. So we_reg is the input flop for the write enable. So the critical path is through the entire altsyncram.

Andrew Canis 2010/08/19 8:00

To pretty print cpp:

a2ps -o print.ps MetaScheduler.h MetaScheduler.cpp ConstraintScheduling.* Simple* LegUpSchedulerDAG.* Scheduler*

Andrew Canis 2010/08/17 8:00

To convert jpeg to raw text RGB (.ppm):

acanis@acanis-desktop:~/work/legup/examples/chstone/jpeg$ convert -compress None pic.jpeg pic.ppm

Should be 150×113 = 16,950 pixels. Assuming RGB, there should be 50850 entries. However, printing from YuvToRgb I get 18,432 pixels.

Note: we are decoding a 4:1:1.

An ascii R G B .ppm file looks like (150×113):

P3
150 113
255
202 246 0
....

Looking at the code:

/* Transform from Yuv into RGB */
for (i = 0; i < 4; i++) {
    YuvToRgb (rgb_buf[i], IDCTBuff[i], IDCTBuff[4], IDCTBuff[5]);
}

for (i = 0; i < RGB_NUM; i++) {
    p_out_vpos = OutData_comp_vpos[i];
    p_out_hpos = OutData_comp_hpos[i];
    for (j = 0; j < BMP_OUT_SIZE; j++)
    {
        p_out_buf[j] = OutData_comp_buf[i][j];
    }
    Write4Blocks (rgb_buf[0][i],
            rgb_buf[1][i], rgb_buf[2][i], rgb_buf[3][i]);
}

So we have four rgb_buf entries:

#define DCTSIZE2           64
#define RGB_NUM  3
  int rgb_buf[4][RGB_NUM][DCTSIZE2];

I just cannot figure this code out. It makes no sense to me. I took a look at the original source from Stanford. They have made massive changes to the point where this is basically uncomparable to the original.

Andrew Canis 2010/08/09 8:00

Buildbot mailing list isn't work. Forgot:

sudo newaliases

Andrew Canis 2010/07/30 8:00

So the problem circuits are:

very high LEs - jpeg, dfsin
high LEs - adpcm, dfdiv

Looking at Ahmed's resource estimates:

acanis@acanis-desktop:~/work/legup/examples/chstone$ grep -i div `find . -name resources.summary`
./dfdiv/resources.summary:Operation "unsigned_divide_64" x 2
./dfdiv/resources.summary:Critical path contains operation: unsigned_divide_64
./dfsin/resources.summary:Operation "unsigned_divide_64" x 2
acanis@acanis-desktop:~/work/legup/examples/chstone$ grep -i mult `find . -name resources.summary`
./jpeg/resources.summary:Operation "signed_multiply_32" x 11 
./jpeg/resources.summary:Operation "signed_multiply_nodsp_32" x 32 
./adpcm/resources.summary:Operation "signed_multiply_32" x 11 
./adpcm/resources.summary:Operation "signed_multiply_nodsp_32" x 78 
./dfmul/resources.summary:Operation "signed_multiply_64" x 3 
./dfmul/resources.summary:Operation "signed_multiply_nodsp_64" x 1 
./dfdiv/resources.summary:Operation "signed_multiply_64" x 3 
./dfdiv/resources.summary:Operation "signed_multiply_nodsp_64" x 8 
./motion/resources.summary:Operation "signed_multiply_32" x 2 
./sha/resources.summary:Operation "signed_multiply_32" x 1 
./sha/resources.summary:Critical path contains operation: signed_multiply_32
./dfsin/resources.summary:Operation "signed_multiply_32" x 1 
./dfsin/resources.summary:Operation "signed_multiply_64" x 3 
./dfsin/resources.summary:Operation "signed_multiply_nodsp_64" x 12 
./mips/resources.summary:Operation "signed_multiply_64" x 2 
./mips/resources.summary:Critical path contains operation: signed_multiply_64
./gsm/resources.summary:Operation "signed_multiply_32" x 11 
./gsm/resources.summary:Operation "signed_multiply_nodsp_32" x 46 

Surprisingly gsm has a ton of multipliers: 57. But they are only 32 bit. jpeg has only 43 32bit multipliers, but probably has more logic in the circuit. gsm is still fairly big, it's right after dfdiv in terms of LEs. So we can save up to:

jpeg: 42 32-bit mults (with a 43-1 mux)
dfsin: 14 64-bit mults (with a 15-1 mux). 1 64-bit div.
adpcm: 88 32-bit mults (with a 89-1 mux)

Why don't we try this incrementally. Just share half of the 'big' functional units (div/mul) with muxes for now.

reg you look in hwtest/CycloneII.tcl all the operation characteristics are given.

set_operation_attributes <name> <LUTs> <Registers> <LogicElements> <ALUTs> <DSPElements> 
set_operation_attributes signed_multiply_32 29 96 111 0 6 
set_operation_attributes signed_multiply_64 169 192 315 0 20 
set_operation_attributes signed_multiply_nodsp_32 694 96 758 0 0 
set_operation_attributes signed_multiply_nodsp_64 2748 192 2876 0 0 
set_operation_attributes signed_divide_32 1214 96 1278 0 0 
set_operation_attributes signed_divide_64 4509 192 4637 0 0 
set_operation_attributes signed_modulus_32 1277 96 1341 0 0 
set_operation_attributes signed_modulus_64 4604 192 4732 0 0 

So cutting down these operations will save a ton of DSPs. When there are no DSPs the LUT count just explodes. So why don't we share after we max out DSPs?

Just looking at nodsps:

./jpeg/resources.summary:Operation "signed_multiply_nodsp_32" x 32 
./adpcm/resources.summary:Operation "signed_multiply_nodsp_32" x 78 
./dfmul/resources.summary:Operation "signed_multiply_nodsp_64" x 1 
./dfdiv/resources.summary:Operation "signed_multiply_nodsp_64" x 8 
./dfsin/resources.summary:Operation "signed_multiply_nodsp_64" x 12 
./gsm/resources.summary:Operation "signed_multiply_nodsp_32" x 46 

We can save a ton here. How much do muxes cost?

set_operation_attributes <name> <LUTs> <Registers> <LogicElements> <ALUTs> <DSPElements> 
set_operation_attributes mux_2_32 32 97 97 0 0 
set_operation_attributes mux_2_64 64 193 193 0 0
set_operation_attributes mux_4_32 64 162 194 0 0 
set_operation_attributes mux_4_64 128 322 386 0 0 
set_operation_attributes mux_8_32 160 291 419 0 0 

Pretty cheap. For 2-1 muxes LUTs = bitwidth. Probably best to stick to 2-1 muxes for now. Though they appear to increase linearly for 4-1. 8-1 looks higher.

Chrome builbot changed:

acanis@acanis-desktop:~/buildbot/chrome_buildbot$ svn up
U    master.chromium.memory/master.cfg
U    scripts/common/chromium_utils.py
U    scripts/slave/chromium/sizes.py
A    scripts/slave/gsutil
U    scripts/slave/zip_build.py
D    scripts/master/unittests/master_utils_test.py
A    scripts/master/unittests/chromium_commands_test.py
U    scripts/master/unittests/runtests.py
U    scripts/master/factory/nacl_commands.py
U    scripts/master/factory/chromeos_commands.py
U    scripts/master/factory/commands.py
U    scripts/master/factory/chromium_commands.py
U    scripts/master/factory/nacl_factory.py
U    scripts/master/factory/chromeos_factory.py
U    scripts/master/factory/chromium_factory.py
U    scripts/master/factory/gclient_factory.py
U    scripts/master/chromium_status.py
U    master.nacl.sdk/public_html/announce.html
U    master.nacl.sdk/master.cfg
U    perf/dashboard/sizes.html
U    perf/dashboard/ui/js/plotter.js
U    perf/generate_perf.py
U    master.naclports/public_html/announce.html
U    master.naclports/master.cfg
U    pylibs/buildbot/README.chromium
U    pylibs/buildbot/status/web/console.py
U    master.chromium.fyi/master.cfg
U    master.chromium.fyi/slaves.cfg
U    master.nacl/public_html/announce.html
U    master.nacl/master.cfg
U    master.nacl/slaves.cfg
U    master.chromeos/public_html/announce.html
U    master.chromeos/master.cfg
U    master.chromeos/slaves.cfg
U    slave/run_slave.py
U    master.chromium/master.cfg
Updated to revision 54543.

Nice. So I merged in plotter.js and we got slightly nicer graphs (can now click a variable to highlight it)

Setup tiling window manager in gnome:

To setup run:

gconftool-2 -s /desktop/gnome/session/required_components/windowmanager xmonad --type string

Create the file:

$ cat /usr/share/applications/xmonad.desktop
[Desktop Entry]
Type=Application
Encoding=UTF-8
Name=Xmonad
Exec=xmonad
NoDisplay=true
X-GNOME-WMName=Xmonad
X-GNOME-Autostart-Phase=WindowManager
X-GNOME-Provides=windowmanager
X-GNOME-Autostart-Notify=false

Then create ~/.xmonad/xmonad.hs with:

import XMonad
import XMonad.Config.Gnome
 
main = xmonad gnomeConfig

Andrew Canis 2010/07/30 8:00

So it's sort of working. The graph doesn't work because git doesn't have numerical revisions. The important files;

buildmaster@acanis-desktop:~$ ll ./buildbot/public_html/perf/linux-release-hardy/moz/
-rwxr-xr-x 1 buildmaster buildmaster  60 2010-08-01 13:33 graphs.dat
-rw-r--r-- 1 buildmaster buildmaster 183 2010-08-01 13:33 total_byte_b-summary.dat

Inside:

buildmaster@acanis-desktop:~$ cat ./buildbot/public_html/perf/linux-release-hardy/moz/graphs.dat
[{"units": "kb", "important": true, "name": "total_byte_b"}]
buildmaster@acanis-desktop:~$ cat ./buildbot/public_html/perf/linux-release-hardy/moz/total_byte_b-summary.dat
{"traces": {"IO_b": ["5000.0", "0.0"]}, "rev": "a0d345abf9adf82074f0ad38ab6910b128c1147d"}
{"traces": {"IO_b": ["43457.0", "0.0"]}, "rev": "a0d345abf9adf82074f0ad38ab6910b128c1147d"}

Had to modify generic plotter to keep a map of # to gitrevision.

I think I know what the issue is. Basically the chromium step has init 2 arguments: self, log_processor. But buildbot calls the factory with only 1:

step = factory(**args)

okay, so I had to actually modify my factory.py in ~/buildbot/buildbot-0.8.1 and reinstall buildbot. Remember you have to restart buildbot if you make changes to the python classes.

diff --git a/buildbot-0.8.1/buildbot/process/factory.py b/buildbot-0.8.1/buildbot/process/factory.py
index 384feb2..e981239 100644
--- a/buildbot-0.8.1/buildbot/process/factory.py
+++ b/buildbot-0.8.1/buildbot/process/factory.py
@@ -60,11 +60,14 @@ class BuildFactory(util.ComparableMixin):
             if kwargs:
                 raise ArgumentsInTheWrongPlace()
             s = step_or_factory.getStepFactory()
-        elif type(step_or_factory) == type(BuildStep) and \
-                issubclass(step_or_factory, BuildStep):
-            s = (step_or_factory, dict(kwargs))
+        #elif type(step_or_factory) == type(BuildStep) and \
+        #        issubclass(step_or_factory, BuildStep):
+        #    s = (step_or_factory, dict(kwargs))
+        #else:
+        #    raise ValueError('%r is not a BuildStep nor BuildStep subclass' % step_or_factory)
+        # Fix needed for chrome perf
         else:
-            raise ValueError('%r is not a BuildStep nor BuildStep subclass' % step_or_factory)
+            s = (step_or_factory, dict(kwargs))
         self.steps.append(s)
 
     def addSteps(self, steps):

I've disabled perf expectations in master.cfg:

  'expectations': False,

Remember RESULTS must have a space after the equals!

Useful vim command: gF (goto line of file)

Andrew Canis 2010/07/30 8:00

Chrome buildbot templates:

acanis@acanis-desktop:~/buildbot/chrome_buildbot$ gvim pylibs/buildbot/status/web/index.html 

Console doesn't work because:

The console view is still in development. At this moment it supports only the source control managers that have an integer based revision id, like svn. 

Put chrome perf in buildbot/public_html. Ran:

python generate_perf.py

Generated a bunch of directories I think make_expections analyzes old data to calculate the max delta/variance to generate an 'perf_expectations.json' file, see:

In scripts/master/factory/chromium_commands.py:

  def AddUploadPerfExpectations(self, factory_properties=None):
    """Adds a step to the factory to upload perf_expectations.json to the
    master.
    """
    perf_id = factory_properties.get('perf_id')
    if not perf_id:
      logging.error("Error: cannot upload perf expectations: perf_id is unset")
      return
    slavesrc = "src/tools/perf_expectations/perf_expectations.json"
    masterdest = ("../scripts/master/log_parser/perf_expectations/%s.json" %
                  perf_id)

So that's what the 'perf_id' property is used for.

In log_parser/process_log.py there is: class PerformanceLogProcessor(object)

Seems to be a 'graphs.dat' file that I'm missing that is used in process_log.py

So in master/factory/chromium_factory.py we have:

    if R('page_cycler'):    f.AddPageCyclerTests(fp)

Then in master/factory/chromium_commands.py we have:

  def AddPageCyclerTests(self, factory_properties=None):
    """Adds a step to the factory to run the page-cycler tests."""

    tests = [
      {'name': 'moz'},
      {'name': 'morejs', 'http': False},
      {'name': 'intl1', 'http': False, 'target': 'Release'},
      {'name': 'intl2', 'http': False, 'target': 'Release'},
      {'name': 'bloat', 'http': True, 'target': 'Release'},
      {'name': 'dhtml', 'http': False, 'target': 'Release'},
      {'name': 'database', 'http': False, 'title': 'Database*'},
    ]

    for test in tests:
      # Set the different names for this test.
      test['command_name'] = test['name'].capitalize()
      test['perf_name'] = test['name']
      test['step_name'] = 'page_cycler_%s' % test['perf_name']

      # Derive the class from the factory, name, and log processor.
      test['class'] = self.GetPerfStepClass(
                          factory_properties, test['perf_name'],
                          process_log.GraphingPageCyclerLogProcessor)
      # Get the test's command.
      cmd = self.GetPageCyclerCommand(
                test.get('title', test['command_name']), http_page_cyclers)

      # Add the test step to the factory.
      self.AddTestStep(test['class'], test['step_name'], cmd)

These names match the graphs on the perf dashboard. A couple are missing - bloat and database.

In master/factory/commands.py:

  def GetPerfStepClass(self, factory_properties, test_name, log_processor_class, **kwargs):
    """Selects the right build step for the specified perf test."""
    perf_id = factory_properties.get('perf_id')
    show_results = factory_properties.get('show_perf_results')

    if show_results and self._target in self.PERF_TEST_MAPPINGS:
      mapping = self.PERF_TEST_MAPPINGS[self._target]
      perf_name = mapping.get(perf_id)
      if not perf_name:
        raise Exception, ('There is no mapping for identifier %s in %s' %
                            (perf_id, self._target))
      report_link = '%s/%s/%s/%s' % (self.PERF_BASE_URL, perf_name, test_name,
                                     self.PERF_REPORT_URL_SUFFIX)
      output_dir = '%s/%s/%s' % (self.PERF_OUTPUT_DIR, perf_name, test_name)

    return self._CreatePerformanceStepClass(log_processor_class,
               report_link=report_link, output_dir=output_dir,
               factory_properties=factory_properties, perf_name=perf_name,
               test_name=test_name)
  # --------------------------------------------------------------------------
  # PERF TEST SETTINGS
  # In each mapping below, the first key is the target and the second is the
  # perf_id. The value is the directory name in the results URL.

  # Configuration of most tests.
  PERF_TEST_MAPPINGS = {
    'Release': {
      'chromium-linux-targets': 'linux-targets',
      'chromium-rel-linux-hardy': 'linux-release-hardy',

  perf_base_url = 'http://build.chromium.org/buildbot/perf'
  perf_report_url_suffix = 'report.html?history=150'
  # Directory in which to save perf-test output data files.
  perf_output_dir = '~/www/perf'

Note in master.cfg we declare the perf_id:

f_cr_rel_linux_hardy_1 = F_LINUX('chromium-rel-linux-hardy',
                                 tests=['page_cycler', 'startup',
                                        'page_cycler_http'],
                                 options=['startup_tests', 'page_cycler_tests'],
                                 factory_properties={
                                     'show_perf_results': True,
                                     'expectations': True,
                                     'perf_id': 'chromium-rel-linux-hardy'})

In master/factory/commands.py:

  # Performance step utils.
  def _CreatePerformanceStepClass(
      self, log_processor_class, report_link=None, output_dir=None,
      factory_properties=None, perf_name=None, test_name=None,
      command_class=chromium_step.ProcessLogShellStep):
    """Returns ProcessLogShellStep class.

    Args:
      log_processor_class: class that will be used to process logs. Normally
        should be a subclass of process_log.PerformanceLogProcessor.
      report_link: URL that will be used as a link to results. If None,
        result won't be written into file.
      output_dir: directory where the log processor will write the results.
      command_class: command type to run for this step. Normally this will be
        chromium_step.ProcessLogShellStep.
    """
    # We create a log-processor class using
    # chromium_utils.InitializePartiallyWithArguments, which uses function
    # currying to create classes that have preset constructor arguments.
    # This serves two purposes:
    # 1. Allows the step to instantiate its log processor without any
    #    additional parameters;
    # 2. Creates a unique log processor class for each individual step, so
    # they can keep state that won't be shared between builds
    log_processor_class = chromium_utils.InitializePartiallyWithArguments(
        log_processor_class, report_link=report_link, output_dir=output_dir,
        factory_properties=factory_properties, perf_name=perf_name,
        test_name=test_name)
    # Similarly, we need to allow buildbot to create the step itself without
    # using additional parameters, so we create a step class that already
    # knows which log_processor to use.
    return chromium_utils.InitializePartiallyWithArguments(command_class,
                                                           log_processor_class)

So basically just creates the class.

In scripts/master/log_parser/process_log.py:

class GraphingPageCyclerLogProcessor(GraphingLogProcessor):
  """Handles additional processing for page-cycler timing data."""
  ...

class GraphingLogProcessor(PerformanceLogProcessor):
  """Parent class for any log processor expecting standard data to be graphed.

  The log will be parsed looking for any lines of the form
    <*>RESULT <graph_name>: <trace_name>= <value> <units>
  or
    <*>RESULT <graph_name>: <trace_name>= [<value>,value,value,...] <units>
  or
    <*>RESULT <graph_name>: <trace_name>= {<mean>, <std deviation>} <units>
  For example,
    *RESULT vm_final_browser: OneTab= 8488 kb
    RESULT startup: reference= [167.00,148.00,146.00,142.00] msec
  The leading * is optional; if it's present, the data from that line will be
  included in the waterfall display. If multiple values are given in [ ], their
  mean and (sample) standard deviation will be written; if only one value is
  given, that will be written. A trailing comma is permitted in the list of
  values.
  Any of the <fields> except <value> may be empty, in which case
  not-terribly-useful defaults will be used. The <graph_name> and <trace_name>
  should not contain any spaces, colons (:) nor equals-signs (=). Furthermore,
  the <trace_name> will be used on the waterfall display, so it should be kept
  short.  If the trace_name ends with '_ref', it will be interpreted as a
  reference value, and shown alongside the corresponding main value on the
  waterfall.
  """
  RESULTS_REGEX = re.compile(
      r'(?P<IMPORTANT>\*)?RESULT '
       '(?P<GRAPH>[^:]*): (?P<TRACE>[^=]*)= '
       '(?P<VALUE>[\{\[]?[\d\., ]+[\}\]]?)( ?(?P<UNITS>.+))?')

So basically GraphingPageCyclerLogProcessor will be the command_class in:

  def AddTestStep(self, command_class, test_name, test_command,
                  test_description='', timeout=600, workdir=None, env=None,
                  locks=None, halt_on_failure=False, do_step_if=True):
    """Adds a step to the factory to run a test.

    Args:
      command_class: the command type to run, such as shell.ShellCommand or
          gtest_command.GTestCommand
      test_name: a string describing the test, used to build its logfile name
          and its descriptions in the waterfall display
      timeout: the buildbot timeout for the test, in seconds.  If it doesn't
          produce any output to stdout or stderr for this many seconds,
          buildbot will cancel it and call it a failure.
      test_command: the command list to run
      test_description: an auxiliary description to be appended to the
        test_name in the buildbot display; for example, ' (single process)'

GraphingPageCyclerLogProcessor's top parent is:

class PerformanceLogProcessor(object):
  """ Parent class for performance log parsers. """
  def Process(self, revision, data):
    """Invoked by the step with data from log file once it completes.

    Each subclass needs to override this method to provide custom logic,
    which should include setting self._revision.
    Args:
      revision: changeset revision number that triggered the build.
      data: content of the log file that needs to be processed.

    Returns:
      A list of strings to be added to the waterfall display for this step.
    """
    self._revision = revision
    return []

okay so I get this. Basically GraphingPageCyclerLogProcessor is ultimately a Buildbot Step (like ShellCommand). It processes the output according to the regular expressions given above. So why don't I start by reusing one of these page cycler tests, but just change the command and see what happens.

Looking at chromes build log for a perf test:

# upload [uploading perf_expectations.json] [0 seconds]

# page_cycler_moz [page_cycler_moz
PERF_IMPROVE: total_op_b/IO_op_b IO_b: 43.5k (42.8k) IO_b_extcs1: 43.3k IO_op_b: 48.4k (53.6k) IO_op_b_extcs1: 53.6k IO_op_r: 28.2k (27.4k) IO_op_r_extcs1: 28.3k IO_r: 7.84k (7.55k) IO_r_extcs1: 7.68k t: 1.1k (1.09k) t_extcs1: 1.1k vm_pk_b: 15.0M (13.9M) vm_pk_b_extcs1: 16.4M vm_pk_r: 68.8M (83.1M) vm_pk_r_extcs1: 69.0M vm_spk_r: 68.8M (83.1M) vm_spk_r_extcs1: 69.0M ws_pk_b: 31.1M (28.8M) ws_pk_b_extcs1: 32.1M ws_pk_r: 65.7M (79.5M) ws_pk_r_extcs1: 65.5M ws_spk_r: 65.7M (79.5M) ws_spk_r_extcs1: 65.5M
] [70 seconds]

   1. stdio
   2. results

Looking at the stdio:

python_slave ..\..\..\scripts\slave\runtest.py --target Release --build-dir src/build page_cycler_tests.exe --gtest_filter=PageCycler*.MozFile
 in dir C:\b\slave\chromium-rel-xp-perf-1\build (timeout 600 secs)
C:\b\slave\chromium-rel-xp-perf-1\build\src\build\Release\page_cycler_tests.exe --gtest_filter=PageCycler*.MozFile
Note: Google Test filter = PageCycler*.MozFile
[==========] Running 3 tests from 3 test cases.
[----------] Global test environment set-up.
[----------] 1 test from PageCyclerTest
[ RUN      ] PageCyclerTest.MozFile
*RESULT vm_peak_b: vm_pk_b= 14979072 bytes
*RESULT ws_peak_b: ws_pk_b= 31059968 bytes
...
*RESULT total_byte_b: IO_b= 43457 kb
*RESULT total_op_b: IO_op_b= 48440 
RESULT other_byte_b: o_b= 413 kb
*RESULT total_byte_b: IO_b_extcs1= 43324 kb
...
*RESULT times: t= [31,55,33,...] ms
[       OK ] PageCyclerTest.MozFile (23093 ms)
[----------] 1 test from PageCyclerTest (23093 ms total)
[----------] 1 test from PageCyclerReferenceTest
[ RUN      ] PageCyclerReferenceTest.MozFile
*RESULT vm_peak_b: vm_pk_b_ref= 13942784 bytes
...
*RESULT total_byte_b: IO_b_ref= 42783 kb

Remember the format is:

    <*>RESULT <graph_name>: <trace_name>= <value> <units>

So this matches with the above, with reference in brackets. 'PERF_IMPROVE: total_op_b/IO_op_b' means that this perf metric is better than the var/delta in the expectations file. What are these reference tests? Does it mean running the tests with a previous version of chrome? Yes, it looks like they have an older version of chrome called the 'reference':

class PageCyclerReferenceTest : public PageCyclerTest {
 public:
  // override the browser directory that is used by UITest::SetUp to cause it
  // to use the reference build instead.
  void SetUp() {
    FilePath dir;
    PathService::Get(chrome::DIR_TEST_TOOLS, &dir);
    dir = dir.AppendASCII("reference_build");
    ...
    dir = dir.AppendASCII("chrome");
    browser_directory_ = dir;
    PageCyclerTest::SetUp();
  }

Yep, these are actually checked into the chrome svn:

src/chrome/tools/test/reference_build/chrome - Windows reference build for performance testing.

Also, the perf_expectations.json looks at the diff between current build and reference build to avoid glitches on the machine.

We don't need a reference because we will never see 'glitches', our perf data is deterministic.

graphs.dat comes from this function in GraphingLogProcessor:

  def __SaveGraphInfo(self):
    """Keep a list of all graphs ever produced, for use by the plotter.

    Build a list of dictionaries:
      [{'name': graph_name, 'important': important, 'units': units},
       ...,
      ]
    sorted by importance (important graphs first) and then graph_name.
    Save this list into the GRAPH_LIST file for use by the plotting
    page. (We can't just use a plain dictionary with the names as keys,
    because dictionaries are inherently unordered.)
    """

Okay. Lets try this out.

Andrew Canis 2010/07/30 8:00

buildbot is leaving running vsim jobs if you do a 'force stop' or a build times out. I think the problem is that buildbot sends a SIGKILL instead of a SIGTERM.

Lets see what happens when I send a SIGKILL to runtest:

acanis@acanis-desktop:~$ pstree -p -a 16133
sh,16133 -c runtest\040../dejagnu/*.exp
  └─expect,16134 -- /usr/share/dejagnu/runtest.exp ../dejagnu/jpeg.exp
      ├─make,16185 v
      │   └─sh,16211 -c vsim\040-note\0402009\040-c\040-do\040"run\0407000000000000000ns;\040exit;"\040work.main_tb
      │       └─vish,16212 -- -vsim -note 2009 -c -do run\0407000000000000000ns;\040exit; work.main_tb
      │           ├─vlm,16220 652114806 814696814
      │           │   └─mgls_asynch,16221  -f6,10
      │           └─vsimk,16224 -port 40073 -stdoutfilename /tmp/VSOUTH6KLFK -note 2009 -c -do run\0407000000000000000ns;\040exit; work.main_tb
      └─{expect},16154

Ran:

acanis@acanis-desktop:~$ kill -9 16134

For some reason 'runtest' doesn't kill any of it's children:

acanis@acanis-desktop:~$ pstree -p -a 16185
make,16185 v
  └─sh,16211 -c vsim\040-note\0402009\040-c\040-do\040"run\0407000000000000000ns;\040exit;"\040work.main_tb
      └─vish,16212 -- -vsim -note 2009 -c -do run\0407000000000000000ns;\040exit; work.main_tb
          ├─vlm,16220 652114806 814696814
          │   └─mgls_asynch,16221  -f6,10
          └─vsimk,16224 -port 40073 -stdoutfilename /tmp/VSOUTH6KLFK -note 2009 -c -do run\0407000000000000000ns;\040exit; work.main_tb

so kill -9 orphans all children. I've posted a buildbot bug:

Buildbot is slowing down my machine. So I'm lowering the priority on all buildslaves processes:

sudo renice +20 --user buildslave

You can also permanently set the priority of buildslave to the lowest in /etc/security/limits.conf:

buildslave           hard     priority        19

See: http://tldp.org/HOWTO/Xterminals/advanced.html

Had to create a .profile to source .bashrc on navy.

Created buildslave:

buildslave create-slave -r --umask=022 buildbot legup.org:3462 navy password

Final .bashrc on navy:

# buildslave
export PATH=~/buildbot-slave-0.8.1/bin:$PATH
export PYTHONPATH=~/legup/python/lib/python/

export PATH=~/legup/bin/dejagnu:$PATH
export PATH=~/legup/bin:$PATH
export PATH=~zhangvi1/altera9.1/quartus/bin:$PATH
export PATH=~zhangvi1/modeltech/bin:$PATH

export MGLS_LICENSE_FILE=7326@picton.eecg.utoronto.ca
export LM_LICENSE_FILE=1802@ra.eecg.toronto.edu

export PATH=~/legup/llvm-gcc4.2-2.7-x86_64-linux/bin/:$PATH
export MIPS_PREFIX=mipsel-elf-

Installing buildbot python module locally in ~/legup/python:

acanis@navy:~/buildbot-slave-0.8.1$ python setup.py install --home=~/legup/python/

buildbot git update wasn't working. Had to modify /usr/share/buildbot/contrib/git_buildbot.py:

master = "legup.org:3462"

Is that the best place to put that script? No, I've moved it inside the git repo folder.

Andrew Canis 2010/07/28 8:00

Really cool top alternative: atop

Trying to figure out the chrome buildbot setup. So basically there are a bunch of builders: 'Chromium XP', 'Chromium Linux', 'XP Tests', 'XP Perf', 'Chromium Builder':

Looking at 'Chromium XP': unit test seems to save the results:

copying dashboard file gtest-results/gpu_unittests\results.json to \\chrome-web.jail.google.com\chrome-bot\www\gtest_results\chromium-rel-xp\gpu_unittests
saving results to \\chrome-web.jail.google.com\chrome-bot\www\gtest_results\chromium-rel-xp\gpu_unittests

.json is a data format like XML.

Andrew Canis 2010/07/23 8:00

Had to hack my git post-receive to get the code to work:

To run a full Quartus compile for all the benchmark circuits:
#!/bin/sh
data=$(cat)
echo "$data" | python /usr/share/buildbot/contrib/git_buildbot.py $1 $2 $3 | exit
echo "$data" | hooks/post-receive-email | exit

Because I want to generate a commit email and also notify the buildbot master.

Really cool command. Set all permissions to equal user permissions.

chmod -R a=u dir

Binding infrastructure done. Setting up buildbot. Not using the older ubuntu version. I'd rather install from scratch. Following installation guide on buildbot website. Needed to install python-dev

Adding a new user

sudo adduser --disabled-login --home /home/buildmaster buildmaster

To verify that there is no password look in /etc/shadow:

buildmaster:!:14812:0:99999:7:::

! indicates no password. * indicates account is locked.

Creating master directory:

buildmaster@acanis-desktop:/home/buildmaster/master$ buildbot create-master .

setting up buildslave

sudo adduser --disabled-login --home /home/buildslave buildslave

git_buildbot.py is missing from the distribution. Had to manually download it from: http://github.com/buildbot/buildbot/raw/master/master/contrib/git_buildbot.py

Need to properly setup environment for the buildmaster and buildslave:

sudo su buildslave
# modify .bashrc

Andrew Canis 2010/07/22 8:00

Getting an error with memset:

FAIL: gxemul simulation. Expected: reg: v0 = 0x0000000000000000

Seems related to:

acanis@acanis-desktop:~/work/legup/examples/memset$ diff ../lib/llvm/liblegup.a /home/acanis/work/fresh/legup/examples/lib/llvm/liblegup.a 
Binary files ../lib/llvm/liblegup.a and /home/acanis/work/fresh/legup/examples/lib/llvm/liblegup.a differ

What generates the liblegup.a file?

Oh it's probably because I haven't pulled in a while… I'll do that later.

Watch out! dhrystone 'make watch' shows differences due to pointers:

sim:
  %3=5a000000
lli:
  %3=bf9f5978

Debugging binding: I have a call that's not being taken in dfadd:

	299:
	begin
		/*   %356 = tail call fastcc i64 @roundAndPackFloat64(i32 %zSign_addr.0.i, i32 %355, i64 %353) nounwind ; <i64> [#uses=2]*/
	/* normalizeRoundAndPackFloat64.exit.i*/
	304:
	begin
		if (roundAndPackFloat64_finish_reg) begin
			cur_state = 305;
		end

Seems to have skipped over the fct call. Maybe finish was left high? What's this:

if (cur_state == 302) begin
roundAndPackFloat64_finish_reg = roundAndPackFloat64_finish;

Wrong state. Wait. there are two calls:

	302:
		if (roundAndPackFloat64_finish) begin

Will have to special case finish to always be the wire.

Fresh repo available in ~/work/fresh

Andrew Canis 2010/07/21 8:00

Trying to merge in changes. Problem with chaining (mips.v):

4:
		/*   %pc.0 = phi i32 [ %pc.1, %bb45 ], [ 4194304, %bb7.preheader ] ; <i32> [#uses=6]*/
		/*   %9 = lshr i32 %pc.0, 2                          ; <i32> [#uses=2]*/

Produces:

56:
posedge clk:
    pc_0_phi_temp = 32'd4194304;   /* for PHI node */
    ...

always @(posedge clk) begin
if (cur_state == 4) begin
var5 = pc_0 >>> 32'd2 % 32;
end
end
always @(posedge clk) begin
if (cur_state == 4) begin
pc_0 = pc_0_phi_temp;
end
end

Andrew Canis 2010/07/20 8:00

Eventually could add this to binding:

    /*
    RTLBuilder *builder;
    RTLOp *finished = t->addCond("==", finish, RTLConst(1));
    RTLOp *display = builder.create("display", 
            "At t=%t clk=%b finish=%b return_val=%d",
            builder.create("$time"), clk, finish, return_val);
    finished.add(display);
    finished.add(builder.create("$finish"));

//    "initial \n" <<
//    "    clk = #0 0;\n" <<
    RTLOp *initial = builder.create("initial");
    initial.add(builder.create("=", clk, 0, "#0");

//    "always @(clk)\n" <<
//    "    clk <= #1 ~clk;\n" <<
    RTLOp *clk_neg = builder.create("~", clk);
    initial.add(builder.create("=", clk, clk_neg, "#1");

//    "initial begin\n" <<
//    "//$monitor(\"At t=%t clk=%b %b %b %b %d\", $time, clk, reset, start, finish, return_val);\n" <<
//    "@(negedge clk);\n" <<
//    "reset = 1;\n" <<
//    "@(negedge clk);\n" <<
//    "reset = 0;\n" <<
//    "start = 1;\n" <<
//    "\n" <<
//    "end\n" <<

    RTLOp *initial = builder.create("initial");
    initial.add(builder.create("@(negedge)", clk));
    initial.add(builder.create("=", reset, RTLConst(1)));
    initial.add(builder.create("@(negedge)", clk));
    initial.add(builder.create("=", reset, RTLConst(0)));
    initial.add(builder.create("=", start, RTLConst(1)));
    */

Andrew Canis 2010/07/19 8:00

Where do 'slots' (%2) get stored for instructions?

acanis@acanis-desktop:~/work/legup/examples/sra$ gdb llc
Breakpoint 1 at 0x85ec45a: file Verilog.cpp, line 1933.
(gdb) run -march=v sra.bc -o sra.v
llvm::operator<< (OS=@0x9173a40, V=@0xae73408) at /home/acanis/work/legup/llvm/include/llvm/Value.h:313
312	inline raw_ostream &operator<<(raw_ostream &OS, const Value &V) {
313	  V.print(OS);
314	  return OS;
315	}
(gdb) s
llvm::Value::print (this=0xae73408, ROS=@0x9173a40, AAW=0x0) at AsmWriter.cpp:2072

Okay found it:

llvm::Value::print (this=0xae73408, ROS=@0x9173a40, AAW=0x0) at AsmWriter.cpp:2072
2077	  if (const Instruction *I = dyn_cast<Instruction>(this)) {
2078	    const Function *F = I->getParent() ? I->getParent()->getParent() : 0;
2079	    SlotTracker SlotTable(F);
2080	    AssemblyWriter W(OS, SlotTable, getModuleFromVal(I), AAW);
2081	    W.printInstruction(*I);

Unfortunately you can't move this code out of AsmWriter.cpp. SlotTracker is defined in the cpp file.

Andrew Canis 2010/07/15 8:00

Debugging binding:

Andrew Canis 2010/07/14 8:00

Modified system→pref→sound capture to usb device.

Andrew Canis 2010/07/06 8:00

Install 'countperl' for cyclomatic complexity of perl programs:

sudo PERL_MM_USE_DEFAULT=1 perl -MCPAN -e 'install Perl::Metrics::Simple'

Had to run this command twice. The first time it failed for some reason.

Pidgin just crashed X11!

Jun 30 15:42:12 acanis-desktop kernel: [4059043.893782] pidgin[9256]: segfault at a98b000 ip b752c253 sp bf8004d0 error 6 in libX11.so.6.2.0[b74fd000+ea000]

Andrew Canis 2010/06/30 8:00

Need to fix h/w partition for James

Can't assign signals in seperate always blocks:

Can't resolve multiple constant drivers for net...

Andrew Canis 2010/06/22 8:00

Working on the scheduler. Try setting every non-memory instruction to have a latency of 0. Problem:

		/*   %12 = load i32* %11, align 4                    ; <i32> [#uses=1]*/
		/*   %load_noop = add i32 %12, 0                     ; <i32> [#uses=19]*/
		load_noop = var8 + 32'd0;

Putting the load noop right after the load. that's wrong. Why do we have the load noop again? Okay, lets get rid of this load noop first. So my hypothesis is instructions that depend on the load don't look at the 'end' cycle, just the 'start'.

okay, fixed that. Now there is an error with the comb logic:

5:
    /*   %11 = getelementptr inbounds [44 x i32]* @imem, i32 0, i32 %10 ; <i32*> [#uses=1]*/
    var7 = {`TAG_imem, 32'b0} + ((var6 + 44 * (32'd0)) << 2);
...
5:
    memory_controller_address = var7;

var7 is a register which isn't updated until the next state…

Andrew Canis 2010/06/21 8:00

I finally figured out the reason for the createXXXXXPass() function. Because there are no header files for any of the passes, you need this global function to create the object.

Posted some results on:

See files in:

work/legup/results.xls
work/legup/plot.m

git fetch just updates the origin (doesn't change working directory):

acanis@acanis-desktop:~/work/legup$ git fetch
acanis@acanis-desktop:~/work/legup$ git diff --summary master..origin
 create mode 100644 examples/chstone/Makefile
 create mode 100644 examples/phi/phi.c
 create mode 100644 llvm/lib/Transforms/LegUp/WatchVariables.cpp
 create mode 100644 tiger/linux_tools/Makefile
 delete mode 100755 tiger/linux_tools/elf2sdram
 delete mode 100755 tiger/linux_tools/find_ra
 create mode 100644 tiger/linux_tools/lib/prog_link_sim.ld
 delete mode 100644 tiger/processor/tiger_mips/sdram.dat
 delete mode 100644 tiger/processor/tiger_mips/tiger.html
 delete mode 100644 tiger/processor/tiger_mips/tiger_sim/cacheMem.ver
 delete mode 100644 tiger/processor/tiger_mips/tiger_sim/sdram.dat
 delete mode 100644 tiger/processor/tiger_mips/tiger_sim/uart_0_log_module.txt
 delete mode 100644 tiger/processor/tiger_mips/tiger_top_hw.tcl~
 create mode 100644 tiger/tool_source/hack_jt.cpp
 create mode 100644 tiger/tool_source/lib/dev_cons.h
 create mode 100644 tiger/windows_tools/lib/prog_link_sim.ld

Andrew Canis 2010/06/18 8:00

Found new command: llvm-extract –func <function_name>

jpeg$ llvm-extract -S --func DecodeHuffMCU  main.ll

Does this command also keep functions called by the specified function? Trying with jpeg, DecodeHuffMCU calls DecodeHuffman and buf_getv. Nope:

declare fastcc i32 @buf_getv(i32) nounwind
declare fastcc i32 @DecodeHuffman() nounwind

So I just need to build up a list of functions and their called functions.

Andrew Canis 2010/06/15 8:00

There are some transformation passes dealing with extracting Functions in the IPO folder. There's a takename function!

          Function *New = Function::Create(I->getFunctionType(),
                                           GlobalValue::ExternalLinkage);
          New->copyAttributesFrom(I);

          // If it's not the named function, delete the body of the function
          I->dropAllReferences();

          M.getFunctionList().push_back(New);
          NewFunctions.push_back(New);
          New->takeName(I);

Interesting:

The members and base classes of a struct are public by default, while in class, they default to private. Note: you should make your base classes explicitly public, private, or protected, rather than relying on the defaults. 

From: http://www.parashift.com/c++-faq-lite/classes-and-objects.html#faq-7.9

Andrew Canis 2010/06/15 8:00

Confirmed again on llist, struct, dhrystone. Doesn't make any difference in area. Synthesis tools must be able to tell that byte enable means that those bytes don't matter. The code is uglier anyway.

I was right. The ram input doesn't depend on size. However, strangely enough this doesn't produce better synthesis results.

`define B0 8-1:0
`define B1 16-1:8
`define B2 24-1:16
`define B3 32-1:24
`define B4 40-1:32
`define B5 48-1:40
`define B6 56-1:48
`define B7 64-1:56


  node1_in[`B0] = memory_controller_in[`B0];

  case (memory_controller_address [0])
      // short/int/long - addr: 000
      0: node1_in[`B1] = memory_controller_in[`B1];
      // byte - addr: 001
      1: node1_in[`B1] = memory_controller_in[`B0];
  endcase

  case (memory_controller_address [1])
      // int/long - addr: 000
      0: node1_in[`B2] = memory_controller_in[`B2];
      // byte/short - addr: 010
      1: node1_in[`B2] = memory_controller_in[`B0];
  endcase

  case (memory_controller_address [1:0])
      // int/long - addr: 000
      0: node1_in[`B3] = memory_controller_in[`B3];
      // short - addr: 010 
      2: node1_in[`B3] = memory_controller_in[`B1];
      // byte - addr: 011
      3: node1_in[`B3] = memory_controller_in[`B0];
      default: node1_in[`B3] = 'bx;
  endcase

  case (memory_controller_address [2:1])
      // long - addr: 000
      0: node1_in[`B4] = memory_controller_in[`B4];
      // short - addr: 011 
      1: node1_in[`B4] = memory_controller_in[`B1];
      // byte/int - addr: 100
      2: node1_in[`B4] = memory_controller_in[`B0];
      default: node1_in[`B4] = 'bx;
  endcase

  case (memory_controller_address [2:0])
      // long - addr: 000
      0: node1_in[`B5] = memory_controller_in[`B5];
      // short/int - addr: 100
      4: node1_in[`B5] = memory_controller_in[`B1];
      // byte - addr: 101
      5: node1_in[`B5] = memory_controller_in[`B0];
      default: node1_in[`B5] = 'bx;
  endcase

  case (memory_controller_address [2:1])
      // long - addr: 000
      0: node1_in[`B6] = memory_controller_in[`B6];
      // int - addr: 100
      2: node1_in[`B6] = memory_controller_in[`B2];
      // byte/short - addr: 110
      3: node1_in[`B6] = memory_controller_in[`B0];
      default: node1_in[`B6] = 'bx;
  endcase

  case (memory_controller_address [2:0])
      // long - addr: 000
      0: node1_in[`B7] = memory_controller_in[`B7];
      // int - addr: 100
      4: node1_in[`B7] = memory_controller_in[`B3];
      // short - addr: 110
      6: node1_in[`B7] = memory_controller_in[`B1];
      // byte - addr: 111
      7: node1_in[`B7] = memory_controller_in[`B0];
      default: node1_in[`B7] = 'bx;
  endcase

Andrew Canis 2010/06/11 8:00

Found a great paper on Catapult C:

Some improvements can be made to the memory controller hardware. The only thing that depends on memory_controller_size is the byte-enable when you are writing to the ram. Otherwise the steering just depends on the addresses.

Andrew Canis 2010/06/10 8:00

The original paper on tcl. Very very well written introduction:

  • Tcl: An Embeddable Command Language

Install tcl:

sudo apt-get install tcl8.5-dev

This was dealt with in the CBackend by using a _phi_temp variable:

llvm_cbe_legup_memcpy_4_2e_exit:
  llvm_cbe_L_ACF_2e_2_2e_08__PHI_TEMPORARY = 0u;   /* for PHI node */
  ...
llvm_cbe_bb_2e_bb_crit_edge:
  llvm_cbe_L_ACF_2e_2_2e_08__PHI_TEMPORARY = llvm_cbe_tmp__2;   /* for PHI node */
  ...
llvm_cbe_bb:
  llvm_cbe_L_ACF_2e_2_2e_08 = llvm_cbe_L_ACF_2e_2_2e_08__PHI_TEMPORARY;

Modified verilog to use a temp variable. it works. Is there an easy way to add a new uniquename not tied to a Value*?

Andrew Canis 2010/06/07 8:00

The problem is with phi dependencies:

/*   %3 = phi i32 [ 1280, %legup_memcpy_4.exit ], [ %load_noop4, %bb.bb_crit_edge ] ; <i32> [#uses=2]*/
/*   %L_ACF.2.08 = phi i32 [ 0, %legup_memcpy_4.exit ], [ %3, %bb.bb_crit_edge ] ; <i32> [#uses=1]*/

In the above code, you MUST take the old value of %3 when evaluating the second phi.

I should _definitely_ have a test case for this.

There is an extra assignment to L_ACF[3] in gcc version:

L_ACF[3]=0
L_ACF[3]=1857
L_ACF[3]=1280 ---- extra
...

So initializing s by a .mif and leaving:

//printf("s:%d\n", s[6]);
s[6]=1280;
//printf("s:%d\n", s[6]);

gives an error. printf's don't fix it. Both return 1280

gsm bug command:

gcc test.c ; ./a.out > log; make; make v; sed -n '/# run/,$p' transcript |grep -v run > hw

Found a good program to hash integers:

Perfect hash of 32 functions:

./perfect -hps < functions

Hashing methods:

Tried some hashing code (in ~/hash). Results don't look that great:

hashtable size: 15, full: 4 (26.666667%), collisions: 193 (97.969543%)
hashtable size: 255, full: 34 (13.333333%), collisions: 163 (82.741117%)
hashtable size: 4095, full: 178 (4.346764%), collisions: 19 (9.644670%)
hashtable size: 65535, full: 197 (0.300603%), collisions: 0 (0.000000%)

Basically I need a massive table to avoid all collisions. I only have 197 elements in total. My hash function was probably inefficient: multiplicative hash by golden ratio of 2^32, then masking that down to the table size.

I've isolated the gsm bug into about 500 lines of .bc

To make transcript:

sed -n '/# run/,$p' transcript |grep -v run > hw

Andrew Canis 2010/06/07 8:00

Mips uses uninitialized data. The very first instruction

  0x8fa40000,			// [0x00400000]  lw $4, 0($29)                   ; 175: lw $a0 0($sp)               # argc

Loads a word from an arbitrary uninitialized portion of the data mem.

400004: 
reg[29] = dmem[63] = -163754450
63 = (2147479548) + 0 

2147479548 = 0x7fffeffc. In the code they manually assign the stack pointer $sp to this value:

		reg[29] = 0x7fffeffc;

For now I'm just going to 0 initialize dmem.

okay going back to the drawing board in gsm. I found a huge problem. So I print out the entire s[] array at the end of autocorrelation. In modelsim s[0..95] = x!

Can I tell modelsim to show me all x's? Maybe create waveforms? There should never be any x's. Actually can i just add an assertion after each instruction to make sure it isn't x? Nothing should ever be x. Wait, well there are many signals that just haven't been assigned yet. It's only after assignment that the signal shouldn't contain any x's (even in upper bits)

Basically I have to add the following after every assignment:

if ( ^A == 1'bx) $display ("ERROR: unknown value in signal A");

To change password (generate a random password):

pwgen
sudo passwd <username>

Andrew Canis 2010/06/03 8:00

Fixing gsm bug (result=4). Putting a printf inside Autocorrelation() fixes the result. also disabling function inlining fixes the problem. the verilog code changes a lot. difficult to pin down the problem area.

Created two files. working.v and bad.v working.v has:

    STEP (3);
    printf("here");
    STEP (4);

Code changes significantly:

acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ diff working.v bad.v |diffstat
 unknown | 3737 ++++++++++++++++++++++++++++++----------------------------------
 1 file changed, 1809 insertions(+), 1928 deletions(-)

what changes immediately after the printf? So the only change that jumps out is: working:

	185:
	begin
		var149 = memory_controller_out[15:0];
        ...
		load_noop22 = var149 + 16'd0;

bad:

	185:
	begin
		var149 = memory_controller_out[31:0];
        ...
		load_noop22 = var149 + 32'd0;

so the bitwidth changes…why would this happen? also why does the 32 bit one not work. you don't lose anything going from 16→32.

In the next state of working:

		/*   %165 = sext i16 %load_noop22 to i32             ; <i32> [#uses=1]*/
		var163 = $signed(load_noop22);

seems like more data is being read in working.

oh i think different things are being read from memory? in bad

    27:
		var79 = {`TAG_so, 32'b0} + ((32'd7 + 160 * (32'd0)) << 1);
    183:
		memory_controller_address = var79;

'so' is being read, should be 16 bits. yep. okay. there is a state 184.

27:
		var80 = {`TAG_L_ACF_i, 32'b0} + ((32'd7 + 9 * (32'd0)) << 2);
	184:
		memory_controller_address = var80;

ya so looks fine actually. interestingly, in the working file there are two more reads from memory right after state 185:

	186:
	begin
		memory_controller_address = var67;
		memory_controller_write_enable = 0;
	end
	187:
	begin
		memory_controller_address = var80;
		memory_controller_write_enable = 0;
	end

it seems like in the bad file there is never another read from var71, var69, var67. But there is a read from var80. And also from var81 (happens later in working).

also note that the $write() is in the same state (27) that all these variables (var66-81) are defined.

comparing the original bytecode. it looks like in bad we have a structure of code that repeats: store, mul, add

  %152 = getelementptr inbounds [160 x i16]* %so, i32 0, i32 7 ; <i16*> [#uses=1]
  %153 = load i16* %152, align 2                  ; <i16> [#uses=2]
  %154 = sext i16 %153 to i32                     ; <i32> [#uses=9]
...
  store i32 %158, i32* %73, align 4
  %159 = mul i32 %118, %154                       ; <i32> [#uses=1]
  %160 = add nsw i32 %159, %141                   ; <i32> [#uses=2]
  store i32 %160, i32* %84, align 4
  %161 = mul i32 %103, %154                       ; <i32> [#uses=1]
  %162 = add nsw i32 %161, %143                   ; <i32> [#uses=2]
  store i32 %162, i32* %97, align 4
  %163 = mul i32 %90, %154                        ; <i32> [#uses=1]
  %164 = add nsw i32 %163, %145                   ; <i32> [#uses=2]

working matches until right after the printf. Which screws up the aliasing. llvm doesn't know if printf will modify memory so we can't keep using %154 in every multiply. no wait it's not %154. in bad:

  %88 = getelementptr inbounds [160 x i16]* %so, i32 0, i32 3 ; <i16*> [#uses=1]
  %89 = load i16* %88, align 2                    ; <i16> [#uses=2]
  %90 = sext i16 %89 to i32                       ; <i32> [#uses=9]
...
  %163 = mul i32 %90, %154                        ; <i32> [#uses=1]

in working

  %163 = call i32 (i8*, ...)* @printf(i8* noalias getelementptr inbounds ([5 x i8]* @.str, i32 0, i32 0)) nounwind ; <i32> [#uses=0]
  %164 = load i16* %88, align 2                   ; <i16> [#uses=2]
  %165 = sext i16 %164 to i32                     ; <i32> [#uses=1]
  %166 = mul i32 %165, %154                       ; <i32> [#uses=1]
  %167 = add nsw i32 %166, %145                   ; <i32> [#uses=2]

why does %88 have to be reloaded after the printf? why %154 doesn't…

  %152 = getelementptr inbounds [160 x i16]* %so, i32 0, i32 7 ; <i16*> [#uses=2]
  %153 = load i16* %152, align 2                  ; <i16> [#uses=1]
  %154 = sext i16 %153 to i32                     ; <i32> [#uses=9]

%88 is from so[3], %154 is so[7]. doesn't make sense to me. maybe the optimizer is doing peephole optimization where it doesn't look across a function call?

okay i'm making modifications to working.ll and then recompiling. the funny thing is i can remove a bunch of stores in the code and the result stays 0!

acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ llc -march=v working.ll -o gsm.v

If I remove every store the result is x. okay. so I doubt that section is the problem if i can remove _every_ store and still get a 0 result. I can remove tons of stores and still get a result=0.

It's very troubling that you can remove instructions and still get the correct result. But running the modified bytecode through 'lli' gets a result of 0. The fundamental issue is: how do you know the h/w verilog matches the llvm bytecode?

  • it's difficult to match them due to parallelism and instruction reordering.
  • llvm-db doesn't currently support printing values!

Andrew Canis 2010/06/02 8:00

Does GAUT support chstone? I doubt it. The latest version fails to even compile sra:

acanis@acanis-desktop:~/work/legup/examples/sra$ /home/acanis/GAUT_2_4_3/GautC/cdfgcompiler/bin/cdfgcompiler -S -c2dfg -O2 -I /home/acanis/GAUT_2_4_3/GautC/lib -I. sra.c
Warning : Variable inData(1) is used but not defined (constant ?) !!!
Warning : Variable inData(0) is used but not defined (constant ?) !!!
sra.c: In function ‘int main()’:
sra.c:10: internal compiler error: Segmentation fault
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.

The fundamental problem with bugpoint right now is it depends on the linker. So to debug the mips pass, the code is split into two pieces <safe> and <test>. <safe> is compiled by gcc into a shared object (using the cbackend pass to convert .bc to .c). <test> is compiled by llc -march=mips. Then the exe is created by linking safe and test together. bugpoint minimizes the code in test.

Legup can't use this flow because there is no concept of a linker. The flow would have to be: remove a function from the bytecode, run it with 'lli' to get expected output. Then simulate the bytecode in modelsim and compare the return values. Note: the program would need to be able to run without the removed function… In fact there would be no guarantee that you could remove any of the byte code and still be able to reproduce the bug because the program functionally changes at that point. Whereas in the original bugpoint flow the program never functionally changes, just portions of it are 'safe' and you can assume they will be compiled correctly.

So we need something similar to the 'crash debugger'. Which keeps removing code to find the minimum bytecode to trigger a segfault in the optimizer. Could add an assertion in the scheduler to run modelsim? I'm not convinced this would work, as the code will probably not compile after removing random instructions. How can you remove random portions of the code and still have the error occur?

We don't have the benefit of a golden reference (gcc) to test against.

Actually we could solve this. If we implement a SystemC backend then we are compatible with gcc and could link. So bugpoint would work fine and we could debug the scheduler… There are other advantages to having a SystemC backend:

  • faster simulation - useful for jpeg
  • better printf, file i/o support for testbenches
  • allow to test binding/alloc without scheduling
  • better readability for s/w designers

Shouldn't be that hard to implement. Would also be good to do this so we know that we can support vhdl in the future.

The only other options:

  • some kind of interface with verilog and C - very difficult
  • print out variable values at the end of every state in modelsim and compare to lli version
    • this would be very useful for pinpointing exactly where modelsim and C differ
    • basically how I've been debugging so far - adding random printf statements into the code
    • how do you deal with parallelism?
    • big problem: printf's affect scheduling - just like with gsm
    • I wonder if this will happen with bugpoint using SystemC?
      • probably it will…but will still narrow down problem

No luck using bugpoint to debug gsm. There is an option to specify a command to execute the bitcode. However, there are external globals without initialization.

acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ rm bug.log; bugpoint gsm.bc -run-custom -exec-command ../../bugpoint.pl

Bugpoint gives a .bc filename as an argument to bugpoint.pl. However, even just running this through lli has problems:

running: lli bugpoint.test.bc-Jkwtbo 2>&1
Output: LLVM ERROR: Could not resolve external global address: inData
...
running: lli bugpoint.test.bc-QobI8c 2>&1
Output: 'main' function not found in module.

The patch to add this option was fairly recent from Pekka Jääskeläinen:

http://llvm.org/viewvc/llvm-project/llvm/trunk/tools/bugpoint/ExecutionDriver.cpp?r1=45421&r2=50373

Looked at C2H Verilog from James. To disable Quartus verilog warnings:

// turn off superfluous verilog processor warnings 
// altera message_level Level1 
// altera message_off 10034 10035 10036 10037 10230 10240 10030 

It's also cool how they specify simulation only (synthesis translate_off) and synthesis only (synthesis read_comments_as_HDL) code.

Andrew Canis 2010/06/01 8:00

Wow. Really hard to figure out this memory bug. Possible causes:

  • Scheduler object is deallocated because it is a FunctionPass

Try modifying Scheduler to be a ModulePass I still get an error:

0x085e1cda in std::less<llvm::Function*>::operator() (this=0x9d29b38, __x=@0x10, __y=@0xbf8789d8) at /usr/include/c++/4.3/bits/stl_function.h:230
230	      { return __x < __y; }
(gdb) p __x
$1 = (class llvm::Function * const&) @0x10: <error reading variable>

Coming from

#4  0x085f4e16 in legup::Scheduler::getFSM (this=0x9d29ad8, F=0x9d2af60) at Scheduler.h:46
46	            FSM[F] = new FiniteStateMachine();

Here's a full backtrace:

(gdb) bt
#0  0x085e1cda in std::less<llvm::Function*>::operator() (this=0x9d29b38, __x=@0x10, __y=@0xbf8789d8) at /usr/include/c++/4.3/bits/stl_function.h:230
#1  0x085f45e7 in std::_Rb_tree<llvm::Function*, std::pair<llvm::Function* const, legup::FiniteStateMachine*>, std::_Select1st<std::pair<llvm::Function* const, legup::FiniteStateMachine*> >, std::less<llvm::Function*>, std::allocator<std::pair<llvm::Function* const, legup::FiniteStateMachine*> > >::_M_insert_unique_ (this=0x9d29b38, __position={_M_node = 0x9d29b3c}, __v=@0xbf8789d8) at /usr/include/c++/4.3/bits/stl_tree.h:1183
#2  0x085f49a5 in std::map<llvm::Function*, legup::FiniteStateMachine*, std::less<llvm::Function*>, std::allocator<std::pair<llvm::Function* const, legup::FiniteStateMachine*> > >::insert (this=0x9d29b38, __position={_M_node = 0x9d29b3c}, __x=@0xbf8789d8) at /usr/include/c++/4.3/bits/stl_map.h:496
#3  0x085f4a9c in std::map<llvm::Function*, legup::FiniteStateMachine*, std::less<llvm::Function*>, std::allocator<std::pair<llvm::Function* const, legup::FiniteStateMachine*> > >::operator[] (this=0x9d29b38, __k=@0xbf878a44) at /usr/include/c++/4.3/bits/stl_map.h:419
#4  0x085f4e16 in legup::Scheduler::getFSM (this=0x9d29ad8, F=0x9d2af60) at Scheduler.h:46
#5  0x085e0b98 in legup::LLVMLegUpPass::runOnModule (this=0x9d2a0a0, M=@0x9d1c128) at Verilog.cpp:259
#6  0x08d7358e in llvm::MPPassManager::runOnModule (this=0x9d2ae48, M=@0x9d1c128) at PassManager.cpp:1424
#7  0x08d754ca in llvm::PassManagerImpl::run (this=0x9d28c60, M=@0x9d1c128) at PassManager.cpp:1506
#8  0x08d7552f in llvm::PassManager::run (this=0xbf878bc8, M=@0x9d1c128) at PassManager.cpp:1535
#9  0x0856fc74 in main (argc=3, argv=0xbf878d04) at llc.cpp:342

So llc calls PM→run(). Finds the implementation of the pass. Runs module pass for LLVMLegUpPass runOnModule(). Tries to lookup the FSM. Segfaults in std::map because of an invalid pointer in the tree.

This segfault occurs as we schedule 'main'. The first function. In fact the only function in 'sra'.

$8 = (class llvm::Function * const&) @0x10: <error reading variable>

Is the pointer overloaded? Maybe it's just getting over written. Because it's a weird value 0x10. It's not NULL.

Why is the main function being added to the fsm map twice?

INSERTING main
Scheduling Function: main
INSERTING main

Debug which passes are run:

llc -march=v --debug-pass=Structure sra.bc 
Pass Arguments:  -preverify -domtree -verify -asap
Target Data Layout
Basic Alias Analysis (default AA impl)
  ModulePass Manager
    FunctionPass Manager
      Preliminary module verification
      Dominator Tree Construction
      Module Verifier
    ASAP Scheduling
      Unnamed pass: implement Pass::getPassName()
    LLVMLegUpPass backend
      Unnamed pass: implement Pass::getPassName()
Pass Arguments:  -memdep
Basic Alias Analysis (default AA impl)
  FunctionPass Manager
    Memory Dependence Analysis
Pass Arguments:  -memdep
Basic Alias Analysis (default AA impl)
  FunctionPass Manager
    Memory Dependence Analysis
INSERTING main
Scheduling Function: main
INSERTING main

So already something really weird is going on.

Reverted back to FunctionPass version. I get a slightly different segfault.

Pass Arguments:  -preverify -domtree -verify -memdep -asap
Target Data Layout
Basic Alias Analysis (default AA impl)
  FunctionPass Manager
    Preliminary module verification
    Dominator Tree Construction
    Module Verifier
    Memory Dependence Analysis
    As Soon As Possible Scheduling
    LLVMLegUpPass backend
[New Thread 0xb74356d0 (LWP 9392)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb74356d0 (LWP 9392)]
0x085e88a7 in __gnu_cxx::new_allocator<std::pair<llvm::Function* const, legup::FiniteStateMachine*> >::construct (this=0xbfe6c59f, __p=0xadf8828, 
    __val=@0x12) at /usr/include/c++/4.3/ext/new_allocator.h:108
108	      { ::new((void *)__p) _Tp(__val); }

Error is in the copy constructor of scheduler:

#7  0x085d2d3e in HwModule (this=0xadf88a0, LegUpPass=0xadf00a0, F=0xadf0f60) at Verilog.cpp:332
332	    sched = LegUpPass->getAnalysis<Scheduler>();
(gdb) do
#6  0x085efa14 in legup::Scheduler::operator= (this=0xadf892c) at Scheduler.h:26
26	class Scheduler {

Again there's a problem in that map<Function*, FiniteStateMachine*>. Some sort of invalid dangling pointer stored in there. I just don't understand why that would happen. What actual order are passes being run/deconstructed

Starting program: /home/acanis/work/legup/llvm/Debug/bin/llc -march=v sra.bc --debug-pass=Details
[Thread debugging using libthread_db enabled]
 -- 'LLVMLegUpPass backend' is not preserving 'As Soon As Possible Scheduling'
 -- 'LLVMLegUpPass backend' is not preserving 'As Soon As Possible Scheduling'
 -- 'LLVMLegUpPass backend' is not preserving 'Memory Dependence Analysis'
 -- 'LLVMLegUpPass backend' is not preserving 'Dominator Tree Construction'
 -- 'LLVMLegUpPass backend' is not preserving 'Preliminary module verification'
 -- 'LLVMLegUpPass backend' is not preserving 'Module Verifier'
Pass Arguments:  -preverify -domtree -verify -memdep -asap
Target Data Layout
Basic Alias Analysis (default AA impl)
  FunctionPass Manager
    Preliminary module verification
    Dominator Tree Construction
    Module Verifier
    Memory Dependence Analysis
    As Soon As Possible Scheduling
    LLVMLegUpPass backend
0x9d3ee58   Executing Pass 'Preliminary module verification' on Function 'main'...
0x9d3ee58   Executing Pass 'Dominator Tree Construction' on Function 'main'...
0x9d3ee58   Executing Pass 'Module Verifier' on Function 'main'...
0x9d3d0f8     Required Analyses: Preliminary module verification, Dominator Tree Construction
 -*- 'Module Verifier' is the last user of following pass instances. Free these instances
0x9d3ee58    Freeing Pass 'Dominator Tree Construction' on Function 'main'...
0x9d3ee58    Freeing Pass 'Module Verifier' on Function 'main'...
0x9d3ee58    Freeing Pass 'Preliminary module verification' on Function 'main'...
0x9d3ee58   Executing Pass 'Memory Dependence Analysis' on Function 'main'...
0x9d3d8f8     Required Analyses: No Alias Analysis (always returns 'may' alias)
0x9d3ee58   Executing Pass 'As Soon As Possible Scheduling' on Function 'main'...
0x9d3da58     Required Analyses: Memory Dependence Analysis
0x9d3ee58   Executing Pass 'LLVMLegUpPass backend' on Function 'main'...
0x9d3e0a0     Required Analyses: Memory Dependence Analysis, No Alias Analysis (always returns 'may' alias), Scheduler

why is scheduler there twice at the beginning? so we are running all the function passes on 'main' only. note: the ASAP Schedule pass is never deallocated.

Okay here is a definite problem. I removed the map<function*, FiniteStateMachine*> and now I get:

0xb16fe58   Executing Pass 'LLVMLegUpPass backend' on Function 'main'...
0xb16f0a0     Required Analyses: Memory Dependence Analysis, No Alias Analysis (always returns 'may' alias), Scheduler
 -- 'LLVMLegUpPass backend' is not preserving 'As Soon As Possible Scheduling'
 -- 'LLVMLegUpPass backend' is not preserving 'As Soon As Possible Scheduling'
 -- 'LLVMLegUpPass backend' is not preserving 'Memory Dependence Analysis'
 -*- 'LLVMLegUpPass backend' is the last user of following pass instances. Free these instances
0xb16fe58    Freeing Pass 'As Soon As Possible Scheduling' on Function 'main'...
0xb16fe58    Freeing Pass 'LLVMLegUpPass backend' on Function 'main'...
0xb16fe58    Freeing Pass 'Memory Dependence Analysis' on Function 'main'...
[New Thread 0xb75a46d0 (LWP 6133)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb75a46d0 (LWP 6133)]
0xb77d52f5 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string () from /usr/lib/libstdc++.so.6
(gdb) up
#1  0x085ebc78 in legup::State::getName (this=0x0) at State.h:56
56	    string getName() { return name; }
(gdb) 
#2  0x085d0051 in printFunctionHandshaking (fsm=0xb16f740, Out=@0xbfcd14c4) at Verilog.cpp:2264
2264	    Out << indent << "\t\tcur_state = " << firstState->getName() << ";\n"; 
(gdb) 
#3  0x085ddebe in legup::HwModule::printDatapath (this=0xb177850, Out=@0xbfcd14c4) at Verilog.cpp:2359
2359	    printFunctionHandshaking(fsm, Out);
(gdb) 
#4  0x085de76f in legup::HwModule::printVerilog (this=0xb177850, Out=@0xbfcd150c) at Verilog.cpp:1741
1741	    printDatapath(Datapath);
(gdb) 
#5  0x085dea8a in legup::LLVMLegUpPass::doFinalization (this=0xb16f0a0, M=@0xb161128) at Verilog.cpp:304
304	        HW->printVerilog(SS);

So I use doFinalization() to print of the verilog. This is called AFTER the scheduler is deallocated! The InstList in State will definitely be destroyed by the destructor.

“A class's destructor (whether or not you explicitly define one) automagically invokes the destructors for member objects. They are destroyed in the reverse order they appear within the declaration for the class. ”

Also: “A derived class's destructor (whether or not you explicitly define one) automagically invokes the destructors for base class subobjects. Base classes are destructed after member objects. In the event of multiple inheritance, direct base classes are destructed in the reverse order of their appearance in the inheritance list.”

See: http://www.parashift.com/c++-faq-lite/dtors.html

Andrew Canis 2010/05/28 8:00

Required fields for a bug:

    • product (string) Required - The name of the product the bug is being filed against.
    • component (string) Required - The name of a component in the product above.
    • summary (string) Required - A brief description of the bug being filed.
    • version (string) Required - A version of the product above; the version the bug was found in.

Andrew Canis 2010/05/27 8:00

Options for config file:

  • command line options - gets clumsy for many options
  • xml - not great if user must modify
  • tcl - used in many EDA tools (quartus, xilinx). Might be overkill
  • apache config files - is it powerful enough?
  • custom format - like .td files. Not worth it
  • boost program options: can read commandline options from a file. Only have to write code once
    • similar to apache config files
        optimization = 1
  

I think going with tcl is the best option for now. Tcl is the standard in EDA and it is very powerful. Doesn't seem too bad to parse: Tcl_EvalFile() from libtcl8.4. One big disadvantage: requires tcl C library Interesting thread on tcl topic:

LLVM actually has a few different lex/parsers:

./lib/AsmParser/LLLexer.cpp
./lib/MC/MCAsmLexer.cpp
./utils/TableGen/TGLexer.cpp
./tools/llvm-mc/AsmLexer.cpp

Including one for the custom language for TableGen .td files.

Does LLVM support pragmas? pragmas appear to be preprocessor directives. So they will not show up in the LLVM IR… One test-suite example has OpenMP. LLVM supports OpenMP (llvm-gcc -fopenmp). LLVM IR has calls to (which must be linked in later):

declare void @GOMP_parallel_end() nounwind
declare i32 @omp_get_thread_num()
declare void @GOMP_barrier() nounwind
declare i32 @omp_get_num_threads()

So we might have to use clang to support pragmas? btw clang does not support openMP. A bit of discussion on how to add a new pragma to clang. Basically you use the pragma to set metadata in the LLVM IR (new feature in 2.6)

To reassign a bug, send email to bugzilla-daemon with [Bug 6] in the subject and in the body:

@assigned_to = stetorvs@gmail.com

Andrew Canis 2010/05/26 8:00

Created legup-bugs mailing list. Added to cc: for all new bugs

Completely reinstalled bugzilla 3.6 in /var/www/legup.org/bugs After every install run (to set permissions):

./checksetup.pl

Added bugzilla email support. Add the following to /etc/aliases:

bugzilla-daemon:         "|/var/www/legup.org/bugs/email_in.pl"

Remember to run:

sudo newaliases

Then change bugzilla permissions (do AFTER running checksetup.pl)

sudo chown -R nobody:www-data bugs/

postfix runs /etc/aliases tasks as nobody:nogroup

So you can now reply to bugzilla-daemon emails with comments. Added a slight fix to email_in.pl to filter out gmail reply comments:

our $gmail = qr/^On .* wrote:$/;

To fix a bug, send an email with:

@status = resolved
@resolution = fixed

To declare bug 4 a duplicate of bug 5:

to: bugzilla-daemon@legup.org
subject: [bug 4]
@dup_id = 5

Ran doxygen for all of LLVM: http://legup.org/doxygen/

For legup namespace: http://legup.org/doxygen/namespacelegup.html

To regenerate (takes a long time):

cd llvm/docs
make doxygen

Also from the Target/Verilog folder you can run:

doxygen

But this will only create html for legup files. There will be no links to LLVM classes.

jpeg works:

 # At t=         17621933000 clk=1 finish=1 return_val=         0
# ** Note: $finish    : main.v(17100)
#    Time: 17621933 ns  Iteration: 2  Instance: /main_tb

real    69m12.512s
user    69m8.219s
sys     0m0.396s

Andrew Canis 2010/05/20 8:00

Load isn't being given an extra state?

	57:
	begin
		/*   %7 = getelementptr inbounds [44 x i32]* @imem, i32 0, i32 %6 ; <i32*> [#uses=1]*/
		var8 = {`TAG_imem, 32'b0} + ((var7 + 44*(32'd0)) << 2);
		cur_state = 58;
	end
	59:
	begin
		/*   %8 = load i32* %7, align 4                      ; <i32> [#uses=18]*/
		var9 = memory_controller_out[31:0];
		/*   %10 = lshr i32 %8, 26                           ; <i32> [#uses=2]*/
		var10 = var9 >>> (32'd26 % 32);
		cur_state = 60;
	end

Strange, 59 isn't put right after 58… Okay fixed small bug in Verilog.cpp

Why did I remove load_noop? makes it really hard to diff the changes… okay put them back and mips works.

The memory_controller_out isn't always at the start of the state. Fixed.

Why is there a ret void in the middle of a random basic block?

		/*   %64 = load i32* %dlti, align 4                  ; <i32> [#uses=1]*/
		/*   ret void*/
		finish = 1;
		cur_state = 60;

finish = 1 is in the wrong place…

Fixed. All return instructions moved to last state. adpcm works.

Everything works except for gsm. gsm has a segfault. Very strange, there is a GlobalValue addr=0x11 not sure what's causing this… so the I must have been deleted. that's the only explanation. okay so one problem with the analysis pass is it modifies the code to add the load_noop. Can I move this to the non-analysis pass? moved insertLoadNoop() into main legup pass. didn't fix anything. moved insertLoadNoop() back into scheduler so i remember to remove it

so some instr is getting removed? happens right on this instruction:

<code>

STARTING: 92

call void @llvm.memset.i64(i8* %scevgep90.796, i8 0, i64 32, i32 4)
store i32 0, i32* %L_ACF, align 4
</code>

caused by the memset. gsm was the only benchmark that had this intrinsic.

Maybe I should git pull? No. I'd rather not have a segfault in this case. Why does this happen?

Interesting. So originally in gsm.ll the instr looks like:

  call void @llvm.memset.i64(i8* %scevgep90.796, i8 0, i64 32, i32 4)

Then when I print the basic blocks inside I see this instruction:

bb.nph47:                                         ; preds = %gsm_mult_r.exit, %gsm_mult_r.exit.us, %bb7
  %62 = load i16* %s, align 2                     ; <i16> [#uses=1]
  %scevgep90.7 = getelementptr i32* %L_ACF, i32 1 ; <i32*> [#uses=11]
  %scevgep90.796 = bitcast i32* %scevgep90.7 to i8* ; <i8*> [#uses=1]
  %63 = call i8* @memset(i8* %scevgep90.796, i32 0, i32 32) ; <i8*> [#uses=0]

Why is that memset() not in the original? The code has been subtlety modified…why is uses=0? Also I noticed this in the diffs… the registers were slightly misnamed… when do i ever modify the instruction?? It's lowerIntrinsics that does the slight modification:

acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ make &> log; grep memset log
  call void @llvm.memset.i64(i8* %scevgep90.796, i8 0, i64 32, i32 4)
  call void @llvm.memset.i64(i8* %r_addr.075.1111, i8 0, i64 14, i32 2)
define i8* @memset(i8* %m, i32 %c, i32 %n) nounwind {
acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ make &> log; grep memset log
  %63 = call i8* @memset(i8* %scevgep90.796, i32 0, i32 32) ; <i8*> [#uses=0]
  %134 = call i8* @memset(i8* %r_addr.075.1111, i32 0, i32 14) ; <i8*> [#uses=0]
define i8* @memset(i8* %m, i32 %c, i32 %n) nounwind {

So if we added the original instruction before lowerIntrinsics, we'll fail because it has been modified. So we must call scheduler AFTER lower intrinsics.

So moved lowerIntrinsics to doinitialization() and made sure to return true (modified)

Running without function inlining. Everything works but aes. aes mif initialization has a problem. Fixed. Caused by one bit global variable.

Okay everything works. Running jpeg with function inlining.

To add new users to git:

cd ~/gitosis-admin/keydir
# open id_rsa.pub and get user/host name from the end of the line
cp /new/user/id_rsa.pub user@host.pub
# add user@host to gitosis.conf
git add user@host.pub
git commit -a -v
git push

Andrew Canis 2010/05/19 12:00

Export to EMF from xfig for the image to show up properly in MS office.

Setup a new mailing list for legup-commits. Grabbed my old post-receive script from cashstream. Needed to run the following:

legup.git/hooks$ sudo git config hooks.mailinglist "legup-commits@legup.org"
legup.git/hooks$ sudo git config hooks.envelopesender "legup-commits@legup.org"

Envelopesender makes the email look like it was sent from legup-commits@legup.org

Andrew Canis 2010/05/14 12:00

Both Victor and Ahmed ran into some issues with adpcm returning 1. Victor fixed it by adding a random printf. So this is probably a scheduling issue.

Had to configure my username to make my xml-rpc work

I backed up the old verilog files in examples/backupVerilog

Andrew Canis 2010/05/10 03:00

Upgraded wiki to latest version and moved to legup.org/wiki

Andrew Canis 2010/05/05 11:00

So I was noticing a massive mismatch between 'df -h' and 'disk usage analyzer'. Basically my whole hard drive was completely full even though I should have had 200GB left. Turns out this was caused by rescue time:

acanis@acanis-desktop:/home$ lsof -s |grep deleted
...
/usr/bin/  5517     acanis   34w      REG        8,3 259150815232  16190085 /home/acanis/.rescuetime/tmp/notifier.debuglog (deleted)
...

That's 259 150 815 232 bytes = 241.353004 gigabytes!

To get a thesis:

library.utoronto.ca
e-resources
search: dissertation
Dissertations & Theses: Full Text

Andrew Canis 2010/05/21 12:59

Okay works now:

acanis@acanis-desktop:~/work/legup/examples/sra$ llc -march=v --debug-pass=Structure sra.bc 
Pass Arguments:  -preverify -domtree -verify -memdep -asap
Target Data Layout
Basic Alias Analysis (default AA impl)
  ModulePass Manager
    FunctionPass Manager
      Preliminary module verification
      Dominator Tree Construction
      Module Verifier
      Memory Dependence Analysis
      As Soon As Possible Scheduling
      LLVMLegupPass backend

Oh shit. I was compiling with 'make' instead of 'makellvm llc'!

Trying to add the scheduler analysis pass. Don't see it for some reason:

acanis@acanis-desktop:~/work/legup/examples/sra$ llc -march=v --debug-pass=Structure sra.bc 
Pass Arguments:  -preverify -domtree -verify -memdep
Target Data Layout
Basic Alias Analysis (default AA impl)
  ModulePass Manager
    FunctionPass Manager
      Preliminary module verification
      Dominator Tree Construction
      Module Verifier
      Memory Dependence Analysis
      LLVMLegupPass backend

Andrew Canis 2010/05/09 19:26

Passes can definitely be run in order based on how they are added to the pass manager: PM.add(pass)

  • opt provides the ability to run any of LLVM's optimization or analysis passes in any order. The -help option lists all the passes available. The order in which the options occur on the command line are the order in which they are executed (within pass constraints).

Check out Support/StandardPasses.h for -O3 optimizations added to the pass manager

To run -O3 on a bitcode:

opt -O3 -time-passes sra.bc  > /dev/null

There is an LLVM wiki:

Andrew Canis 2010/04/08 09:46

Pretty cool command: lastlog

Andrew Canis 2010/04/01 07:00

The time required to run quartus on all the chstone benchmarks: 5.5h

real    327m22.206s
user    328m59.682s
sys     2m31.025s

Probably should look into how to speed this up. Parallelism? right now I run everything serially. It was also slowed because I ran quartus_map before the full compile to initialize each project.

Andrew Canis 2010/03/30 22:40

Before the shift changes to motion.c. -O3 fails (after LTO) but with no optimization the result is correct.

acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ llvm-gcc mpeg2.c --emit-llvm -c -O3 -o mpeg2.prelto.bc
acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ llvm-ld  mpeg2.prelto.bc -b=mpeg2.bc
acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ lli mpeg2.bc 
2
acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ llvm-gcc mpeg2.c --emit-llvm -c -o mpeg2.prelto.bc
acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ llvm-ld  mpeg2.prelto.bc -b=mpeg2.bc
acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ lli mpeg2.bc 
0

Result is 2 until both changes have been made:

diff --git a/examples/chstone/motion/getbits.c b/examples/chstone/motion/getbits.c
index bf3219d..aeacf1d 100755
--- a/examples/chstone/motion/getbits.c
+++ b/examples/chstone/motion/getbits.c
@@ -110,7 +110,7 @@ unsigned int
 Show_Bits (N)
      int N;
 {
-  return ld_Bfr >> (32 - N);
+  return ld_Bfr >> (unsigned)(32-N)%32;
 }
 
 
diff --git a/examples/chstone/motion/motion.c b/examples/chstone/motion/motion.c
index b2f7278..8d490a0 100755
--- a/examples/chstone/motion/motion.c
+++ b/examples/chstone/motion/motion.c
@@ -152,6 +152,7 @@ decode_motion_vector (pred, r_size, motion_code, motion_residual,
 {
   int lim, vec;
 
+  r_size = r_size % 32;
   lim = 16 << r_size;
   vec = full_pel_vector ? (*pred >> 1) : (*pred);
 

Andrew Canis 2010/03/29 22:40

So gcc thinks that:

4042285200 >> 4294967128 = 240

not true BUT

4042285200 >> (4294967128 % 32) = 240

One problem with motion is inside:

  decode_motion_vector (&PMV[0], h_r_size, motion_code, motion_residual, full_pel_vector);

To match GCC we need to add:

  r_size = r_size%32;

This is a legitimate discrepency between LLVM and GCC:

  acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ lli mpeg2.bc 
PMV[0][0][0] = 45
PMV[0][0][0] = 286
2
acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ gcc -fmudflap -lmudflap mpeg2.c; ./a.out 
PMV[0][0][0] = 45
PMV[0][0][0] = 1566
0

Again in Show_Bits() there is another shift problem:

  volatile int tmp = (32-N);
  return ld_Bfr >> tmp;

Which LLVM also gets wrong:

acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ lli mpeg2.bc 
PMV[0][0][0] = 45
PMV[0][0][0] = 1326
2
acanis@acanis-desktop:~/work/legup/examples/chstone/motion$ gcc -fmudflap -lmudflap mpeg2.c; ./a.out 
PMV[0][0][0] = 45
PMV[0][0][0] = 1566
0

After fixing the 2 above problems lli performs correctly. However LegUp doesn't work without an extra printf statement in Get_Bits(). There is one final scheduling error to deal with.

jpeg also works without any warnings/errors. I'm leaving it out of the test suite because it takes so long to run (45min).

# At t=         17621931000 clk=1 finish=1 return_val=         0
# ** Note: $finish    : main.v(19128)
#    Time: 17621931 ns  Iteration: 2  Instance: /main_tb

real    44m57.348s
user    44m50.964s
sys     0m0.292s

Unfortunately I still get warning with gsm even after the fix…

# addr:000000000
# memset cur_state:  0
# main cur_state:  127
# addr:800000140
# memset cur_state:  0
# main cur_state:  128
# main_tb.main_inst.memory_controller_inst.so.altsyncram_component Warning : Address pointed at port A is out of bound!
# main_tb.main_inst.memory_controller_inst.so.altsyncram_component Warning : Address pointed at port A is out of bound!

Found the problem, another off by one error that mudflap didn't find

  /*   Rescaling of the array s[0..159]
   */
  if (scalauto > 0)
    for (k = 160; k >= 0; k--)
      *s++ <<= scalauto;

160 should be changed to 159.

Checked all other benchmarks. None have errors with mudflap except for unaligned (as expected)

Running ./unaligned/dg.exp ...
FAIL: unaligned
Failed with exit(2)gcc -g -fmudflap -lmudflap unaligned.c; ./a.out
Two integers: aabbccdd eeff0011
Byte 0: aabbccdd
Byte 1: 11aabbcc
Byte 2: 11aabb
Byte 3: ff0011aa
Byte 4: eeff0011
Byte 5: 65eeff00
Byte 6: fa66eeff
Byte 7: a8fa67ee
*******
mudflap violation 1 (check/read): time=1269831641.070330 ptr=0xbfa8fa65 size=4
pc=0x4003a8ed location=`unaligned.c:16:9 (main)'
      /usr/lib/libmudflap.so.0(__mf_check+0x3d) [0x4003a8ed]
      ./a.out(main+0x2d4) [0x8048a98]
      /usr/lib/libmudflap.so.0(__wrap_main+0x4f) [0x4003b34f]
Nearby object 1: checked region begins 5B into and ends 1B after
mudflap object 0x95d7b38: name=`unaligned.c:7:18 (main) a'
bounds=[0xbfa8fa60,0xbfa8fa67] size=8 area=stack check=4r/1w liveness=5
alloc time=1269831641.070284 pc=0x4003b2ed
number of nearby objects: 1
*******
mudflap violation 2 (check/read): time=1269831641.070605 ptr=0xbfa8fa66 size=4
pc=0x4003a8ed location=`unaligned.c:16:9 (main)'
      /usr/lib/libmudflap.so.0(__mf_check+0x3d) [0x4003a8ed]
      ./a.out(main+0x2d4) [0x8048a98]
      /usr/lib/libmudflap.so.0(__wrap_main+0x4f) [0x4003b34f]
Nearby object 1: checked region begins 6B into and ends 2B after
mudflap object 0x95d7b38: name=`unaligned.c:7:18 (main) a'
number of nearby objects: 1
*******
mudflap violation 3 (check/read): time=1269831641.070673 ptr=0xbfa8fa67 size=4
pc=0x4003a8ed location=`unaligned.c:16:9 (main)'
      /usr/lib/libmudflap.so.0(__mf_check+0x3d) [0x4003a8ed]
      ./a.out(main+0x2d4) [0x8048a98]
      /usr/lib/libmudflap.so.0(__wrap_main+0x4f) [0x4003b34f]
Nearby object 1: checked region begins 7B into and ends 3B after
mudflap object 0x95d7b38: name=`unaligned.c:7:18 (main) a'
number of nearby objects: 1
make: *** [all] Error 176

There is definitely some sort of array overrun in gsm. gcc -fbounds-check doesn't work. Apparently there are two other options: mudflap and ssp

Okay found something. There is a violation in Autocorrelation.

acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ gcc -g -fmudflap -lmudflap gsm.c
acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ ./a.out
*******
mudflap violation 1 (check/write): time=1269830619.922753 ptr=0xbf918c18 size=4
pc=0xb7e5d8ed location=`lpc.c:88:7 (Autocorrelation)'
      /usr/lib/libmudflap.so.0(__mf_check+0x3d) [0xb7e5d8ed]
      ./a.out(Autocorrelation+0x3e7) [0x80492e7]
      ./a.out(Gsm_LPC_Analysis+0x3b) [0x8051df4]
Nearby object 1: checked region begins 1B after and ends 4B after
mudflap object 0x90d9c68: name=`lpc.c:317:12 (Gsm_LPC_Analysis) L_ACF'
bounds=[0xbf918bf4,0xbf918c17] size=36 area=stack check=0r/0w liveness=0
alloc time=1269830619.922702 pc=0xb7e5e2ed
number of nearby objects: 1
*******
mudflap violation 2 (check/read): time=1269830619.923481 ptr=0xbf918c18 size=4
pc=0xb7e5d8ed location=`lpc.c:151:7 (Autocorrelation)'
      /usr/lib/libmudflap.so.0(__mf_check+0x3d) [0xb7e5d8ed]
      ./a.out(Autocorrelation+0x6762) [0x804f662]
      ./a.out(Gsm_LPC_Analysis+0x3b) [0x8051df4]
Nearby object 1: checked region begins 1B after and ends 4B after
mudflap object 0x90d9c68: name=`lpc.c:317:12 (Gsm_LPC_Analysis) L_ACF'
number of nearby objects: 1
*******
mudflap violation 3 (check/write): time=1269830619.923933 ptr=0xbf918c18 size=4
pc=0xb7e5d8ed location=`lpc.c:151:7 (Autocorrelation)'
      /usr/lib/libmudflap.so.0(__mf_check+0x3d) [0xb7e5d8ed]
      ./a.out(Autocorrelation+0x6806) [0x804f706]
      ./a.out(Gsm_LPC_Analysis+0x3b) [0x8051df4]
Nearby object 1: checked region begins 1B after and ends 4B after
mudflap object 0x90d9c68: name=`lpc.c:317:12 (Gsm_LPC_Analysis) L_ACF'
number of nearby objects: 1
Segmentation fault

Nice! Easy patch:

diff --git a/examples/chstone/gsm/lpc.c b/examples/chstone/gsm/lpc.c
index baa99d0..dfe126c 100755
--- a/examples/chstone/gsm/lpc.c
+++ b/examples/chstone/gsm/lpc.c
@@ -84,7 +84,7 @@ Autocorrelation (word * s /* [0..159]     IN/OUT  */ ,
 #define STEP(k)         L_ACF[k] += ((longword)sl * sp[ -(k) ]);
 
 #define NEXTI   sl = *++sp
-    for (k = 9; k >= 0; k--)
+    for (k = 8; k >= 0; k--)
       L_ACF[k] = 0;
 
     STEP (0);
@@ -147,7 +147,7 @@ Autocorrelation (word * s /* [0..159]     IN/OUT  */ ,
        STEP (8);
       }
 
-    for (k = 9; k >= 0; k--)
+    for (k = 8; k >= 0; k--)
       L_ACF[k] <<= 1;
 
   }

Andrew Canis 2010/03/28 22:40

Figured out a solution to the memset() issue. Create a .bc for memset and then link it in with llvm-ld -disable-opt.

Looking into gsm segfault:

acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ valgrind ./a.out
==27584== Memcheck, a memory error detector.
==27584== Copyright (C) 2002-2008, and GNU GPL'd, by Julian Seward et al.
==27584== Using LibVEX rev 1884, a library for dynamic binary translation.
==27584== Copyright (C) 2004-2008, and GNU GPL'd, by OpenWorks LLP.
==27584== Using valgrind-3.4.1-Debian, a dynamic binary instrumentation framework.
==27584== Copyright (C) 2000-2008, and GNU GPL'd, by Julian Seward et al.
==27584== For more details, rerun with: -v
==27584== 
==27584== Invalid write of size 4
==27584==    at 0x8049647: main (gsm.c:103)
==27584==  Address 0xfffffff8 is not stack'd, malloc'd or (recently) free'd
==27584== 
==27584== Process terminating with default action of signal 11 (SIGSEGV)
==27584==  Access not within mapped region at address 0xFFFFFFF8
==27584==    at 0x8049647: main (gsm.c:103)
==27584==  If you believe this happened as a result of a stack overflow in your
==27584==  program's main thread (unlikely but possible), you can try to increase
==27584==  the size of the main thread stack using the --main-stacksize= flag.
==27584==  The main thread stack size used in this run was 8388608.
==27584== 
==27584== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 13 from 1)
==27584== malloc/free: in use at exit: 0 bytes in 0 blocks.
==27584== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==27584== For counts of detected errors, rerun with: -v
==27584== All heap blocks were freed -- no leaks are possible.
Segmentation fault

If I add int i declaration into Gsm_LPC_Analysis segfault disappears. if i comment out Autocorrelation segfault disappears

I don't understand how this line could ever segfault:

      for (i = 0; i < N; i++)

There must be a buffer overrun which overwrites the instruction?

mmmm. so gsm doesn't segfault when -O3 is enabled:

acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ gcc -O3 -g gsm.c; ./a.out
0

Really strange. I don't think this is worth fixing.

This I should fix though:

# main_tb.main_inst.memory_controller_inst.so.altsyncram_component Warning : Address pointed at port A is out of bound!
# main_tb.main_inst.memory_controller_inst.so.altsyncram_component Warning : Address pointed at port A is out of bound!
# main_tb.main_inst.memory_controller_inst.so.altsyncram_component Warning : Address pointed at port A is out of bound!
# main_tb.main_inst.memory_controller_inst.so.altsyncram_component Warning : Address pointed at port A is out of bound!
#          0
# At t=            21583000 clk=1 finish=1 return_val=         0
# ** Note: $finish    : gsm.v(6986)

for some reason i'm not seeing any latency improvement after enabling LTO. llvm-ld definitely performs optimizations:

acanis@acanis-desktop:~/work/legup/examples/chstone/dfsin$ llvm-ld dfsin.bc -b=dfsin.lto.bc -stats
===-------------------------------------------------------------------------===
                          ... Statistics Collected ...
===-------------------------------------------------------------------------===

   4 globalopt        - Number of functions converted to fastcc
  18 globalopt        - Number of functions deleted
   2 globalsmodref-aa - Number of functions without address taken
   4 globalsmodref-aa - Number of global vars without address taken
   2 gvn              - Number of instructions deleted
   1 gvn              - Number of loads deleted
   3 inline           - Number of functions deleted because all callers found
   3 inline           - Number of functions inlined
   4 instcombine      - Number of insts combined
  22 internalize      - Number of functions internalized
   4 internalize      - Number of global vars internalized
   1 loopsimplify     - Number of pre-header or exit blocks inserted
   2 memdep           - Number of block queries that were completely cached
1180 memdep           - Number of fully cached non-local ptr responses
 527 memdep           - Number of uncached non-local ptr responses
  58 sccp             - Number of basic blocks unreachable
   1 sccp             - Number of globals found to be constant by IPSCCP
 180 sccp             - Number of instructions removed
  10 sccp             - Number of instructions removed by IPSCCP

Definitely differences in bitcode:

acanis@acanis-desktop:~/work/legup/examples/chstone/dfsin$ wc -l dfsin.lto.ll
1678 dfsin.lto.ll
acanis@acanis-desktop:~/work/legup/examples/chstone/dfsin$ wc -l dfsin.ll
2122 dfsin.ll
acanis@acanis-desktop:~/work/legup/examples/chstone/dfsin$ ll dfsin.*bc
-rw-r--r-- 1 acanis acanis 17K 2010-03-28 02:42 dfsin.bc
-rwxr-xr-x 1 acanis acanis 14K 2010-03-28 02:42 dfsin.lto.bc

ohh shit. It's calling the verilog file something different!

-rw-r--r--  1 acanis acanis 286K 2010-03-28 02:45 dfsin.lto.v

pre-LTO:

# At t=           126617000 clk=1 finish=1 return_val=         0
# ** Note: $finish    : dfsin.v(13344)

post-LTO:

# At t=           114743000 clk=1 finish=1 return_val=         0
# ** Note: $finish    : dfsin.v(9599)

About 10% latency improvement:

; 63309-57372
	5937
; 5937/63309
	~0.09377813581007439701

Trying to figure out how to enable link time optimization Enabling -flto does nothing as long as -c or -S is given - we get out the same LLVM bitcode. When you leave out -c -S then you get the error:

acanis@acanis-desktop:~/work/legup/examples/chstone/gsm$ make
llvm-gcc gsm.c --emit-llvm -O3 -flto  -fno-builtin -o gsm.bc
/tmp/cc7Icg04.o: file not recognized: File format not recognized
collect2: ld returned 1 exit status

This is basically saying the linker doesn't recognize LLVM bitcode. So I have to set this up now.

okay figured it out. very simple actually. Note -c instead of -S, llvm-ld can't take a textual llvm bitcode as input.

llvm-gcc gsm.c --emit-llvm -O3 -c -fno-builtin -o gsm.bc
# produces binary bitcode: gsm.bc
llvm-ld gsm.bc -b=gsm.bc.lto
# produces a.out shell script (to call lli) and gsm.bc.lto binary bitcode
llvm-dis gsm.bc.lto
# produces textual bitcode: gsm.bc.lto.ll

Andrew Canis 2010/03/28 02:09

Should be an IR to represent Verilog so errors like this are not possible:

-- Compiling module ByRef
** Error: functions.v(225): Bounds of part-select into 'memory_controller_out' are reversed.
** Error: functions.v(225): MSB of part-select into 'memory_controller_out' is out of bounds.

Andrew Canis 2010/03/26 21:30

Found SPARK benchmark circuits MPEG-1 and adpcm:

New classes to be implemented:

class Writer { };
class VerilogWriter : Writer { };
class VHDLWriter : Writer { };
class Target { };
class TargetDevice : Target { };
class StratixIV : TargetDevice { };
class TargetTool : Target { };
class Altera : TargetTool { };
class Generic : TargetTool { };
class Binding { };
class FSM { };
class Resource { };
class FunctionalUnit : Resource { };
class AddSub : FunctionalUnit { };
class Multiplier : FunctionalUnit { };
class Shifter : FunctionalUnit { };
class MemoryUnit : Resource { };
class Register : MemoryUnit { };
class Constraints {
    public:
        Resource* getResource(Instruction* I) { return NULL; }
};
class Schedule {
    virtual unsigned getState(HwModule *M, Instruction *I) = 0;
};
class ScheduleASAP : Schedule { };
class ScheduleALAP : Schedule { };

Andrew Canis 2010/03/10 08:15

A good page with a lot of benchmark links:

Looking around for the '92 and '95 HLS benchmarks. Found a paper for '95.

Both of these links are broken:

Andrew Canis 2010/03/10 16:45

So it's basically an unaligned access due to memset. Can I get rid of the need for memset? Trying -ffreestanding. Doesn't work:

-fno-builtin also doesn't work

Running into some warning on gsm:

# addr:800000024
# memset cur_state: 12
# Autocorrelation cur_state:  95
# memset cur_state:  0
# Reflection_coefficients cur_state:   0
# Quantization_and_coding cur_state:   0
# main cur_state:  9
# main_tb.main_inst.memory_controller_inst.L_ACF_i.altsyncram_component Warning : Address pointed at port A is out of bound!

Andrew Canis 2010/03/09 02:31

Really really cool. There is an x86 instruction called 'rdtsc' that increments every clock cycle!

Running into alignment issues with memory controller. I think I'm going to make it a requirement that all accesses be aligned.

  • fully aligned (i.e. an N-byte integer starts at an address that is a multiple of N) numbers

Good description:

Andrew Canis 2010/03/03 04:31

Got modelsim working. Added the following to /etc/rc.local (to run at startup)

ssh -L 7325:128.100.10.141:7326 isis.eecg.utoronto.ca &
ssh -L 7327:128.100.10.141:7327 isis.eecg.utoronto.ca &

And added this to .bashrc:

export MGLS_LICENSE_FILE=7325@localhost
export PATH=/home/acanis/modelsim/install/modeltech/linux:$PATH

To synthesize a verilog file:

quartus_map dfadd --analyze_file=../dfadd.v

Andrew Canis 2010/02/16 08:40

Good verilog reference:

gnuplot:

Andrew Canis 2010/02/10 22:29

Example of dominator tree:

Andrew Canis 2010/02/02 16:04

Tip: if make is taking too long take out the -debug flag (speeds up code 100x)

Andrew Canis 2010/02/01 16:25

To create .mif soft links:

acanis@acanis-desktop:~/work/legup/examples/chstone/aes/testbench$ for i in ../*.mif
> do
> ln -s $i
> done

Note: you need the line breaks or you can do this one-liner:

for i in ../*.mif; do ln -s $i; done

Andrew Canis 2010/01/28 14:20

Multidimensional arrays are stored in row-major order in C: (from wikipedi) A[row][column] can then be computed as: offset = row*NUMCOLS + column

Andrew Canis 2010/01/26 14:24

Benchmark problems:

  • adpcm: dynamic pointers
  • aes: multi-dimensional array
  • blowfish: dynamic pointers (like adpcm)
    %10 = load i8* %d.1, align 1                    ; <i8> [#uses=1]
    addr:   %d.1 = select i1 %8, i8* %7, i8* %data          ; <i8*> [#uses=2]
    getRam:   %d.1 = select i1 %8, i8* %7, i8* %data          ; <i8*> [#uses=2]
    getRam:   %8 = icmp ult i8* %7, %3                        ; <i1> [#uses=1]
    LLVM ERROR: Cannot find ram!
  • gsm: LLVM intrinsic function memset
  • jpeg: global pointer. ie. a ram needs to hold a pointer. Need to support pointer arithmetic: *ptr++
  • motion: same as jpeg
  • sha: multi-dimensional array

Andrew Canis 2010/01/22 16:44

adpcm problem. Dynamic pointers:

getRam:   %ril.0.in = getelementptr inbounds [31 x i32]* %quant26bt_pos.pn, i32 0, i32 %5 ; <i32*> [#uses=1]
getRam:   %quant26bt_pos.pn = select i1 %abscond, [31 x i32]* @quant26bt_pos, [31 x i32]* @quant26bt_neg ; <[31 x i32]*> [#uses=1]
LLVM ERROR: Cannot find ram!

Andrew Canis 2010/01/21 17:16

All softfloat benchmarks work (dfadd, dfsin, dfmul, dfdiv). Unfortunately, each function call adds an additional cycle of latency to a load/store, so for now I've set the latency of a load/store to 10. This will have to be fixed.

Stepping through dfadd. Double precision floating point:

sign (1 bit) | exponent (11 bits) | fractions (52 bits)
-1 = 0xbff0000000000000
-1 = 1 | 01111111111 | 00000000000000000000000000...
-1 = (-1)^1 x 2^ (1023 - 1023) X (1 + 0)

To print hex in gdb:

p/x b

Some old comments I found (single precision, 32-bit floating point):

// IEEE 754-1985 FP representation
// sign (1 bit) | exponent (8 bits) | fractions (23 bits)
// one = 0 | 01111111 | 00000000000000000000000;
// one = (-1)^0 x 2 ^ (127 - 127) x (1 + 0)
// two = 0 | 10000000 | 00000000000000000000000
// two = (-1)^0 x 2 ^ (128 - 127) x (1 + 0)
//
// matlab:
// dec2ieee754(1, 'single')

Andrew Canis 2010/01/20 9:45

Working: DFMUL, DFDIV. DFADD only 38/46 of tests work. DFSIN only 1/36 tests work.

Andrew Canis 2010/01/19 18:39

So I've noticed something. You can't simply say value.abs() on an APInt because if the value is the maximum negative value you can't actually represent the absolute value. ie. an 8 bit integer is from -128 to 127. So if you take the abs(-128) you'll get -128 back out, not 128 as expected because it's too big.

Andrew Canis 2010/01/18 15:13

Add a “-64'd” prefix to fix the error: (note the negative sign needs to be in front of this prefix)

# ** Error: ../dfadd.v(800): near "9221120237041090560": Numeric value exceeds 32-bit capacity.

To handle global variables:

  • Every function has inputs/outputs for every global variable. So every global value is passed from main() down to every other function.

dfadd is almost working.

Andrew Canis 2010/01/17 18:26

Found a good introduction to alias analysis in a master's thesis: http://lenherr.name/~thomas/ma/introduction.page

Andrew Canis 2010/01/16 06:35

Found a bug. Dependencies between load/store/call instructions is not accounted for. These instructions are not connected by a use-def chain, we need alias analysis. LLVM seems to have an analysis pass called Memory Dependency Analysis:

Andrew Canis 2010/01/14 15:58

Todo: fix dealing with reference parameters to functions.

Andrew Canis 2010/01/13 07:43

To avoid function inlining:

llvm-gcc -fno-inline-functions 

Other issues:

  • global variables - names need to be available to each h/w module - but only initialized once

Andrew Canis 2010/01/12 02:47

dfadd issues:

  • functions
  • pointers passed as function parameters.
    • Use a register file, pass address
    • Make the parameter an 'output' and also an 'input' if used. Pass the value pointed too, and then modify all variables when function returns. I think this is best.
  • void return

Really cool, type 'wh' in gdb.

Debug build: ./configure –disable-optimized

Andrew Canis 2010/01/08 05:36

There is still one Quartus warning:

"can't check case statement for completeness because the case expression has too many possible states"

This warning occurs when the variable used in switch() is 32-bits wide, which is too big for quartus to check all 2^32 possibilities. I don't think this is a big deal to leave in for now. In the future we could down cast the variable to be big enough for the biggest case, which will be much smaller than 32 bits.

Mips is working! PC incremented properly. dmem is getting out the correct sorted values, and the return value is correct.

# pc=00400020, state=232, dmem_out=ffffffef, dmem_address= 1, dmem_write_enable=0, dmem_in=00000026,
# pc=00400020, state=233, dmem_out=fffffff7, dmem_address= 2, dmem_write_enable=0, dmem_in=00000026,
# pc=00400020, state=234, dmem_out=00000000, dmem_address= 3, dmem_write_enable=0, dmem_in=00000026,
# pc=00400020, state=235, dmem_out=00000003, dmem_address= 4, dmem_write_enable=0, dmem_in=00000026,
# pc=00400020, state=236, dmem_out=00000005, dmem_address= 5, dmem_write_enable=0, dmem_in=00000026,
# pc=00400020, state=237, dmem_out=0000000b, dmem_address= 6, dmem_write_enable=0, dmem_in=00000026,
# pc=00400020, state=238, dmem_out=00000016, dmem_address= 7, dmem_write_enable=0, dmem_in=00000026,
# pc=00400020, state=239, dmem_out=00000026, dmem_address= 7, dmem_write_enable=0, dmem_in=00000026,
# pc=00400020, state=240, dmem_out=00000026, dmem_address= 7, dmem_write_enable=0, dmem_in=00000026,
# pc=00400020, state=241, dmem_out=00000026, dmem_address= 7, dmem_write_enable=0, dmem_in=00000026,
# pc=00400020, state=242, dmem_out=00000026, dmem_address= 7, dmem_write_enable=0, dmem_in=00000026,
# pc=00400020, state=243, dmem_out=00000026, dmem_address= 7, dmem_write_enable=0, dmem_in=00000026,
# At t=            25126000 clk=1 return_val=         0
#  exit

Initial stats (Stratix IV Auto): latency 25126/2=12563 cycles. Fmax=234.63MHz (slow corner). The latency is 1/234.63e6*25126/2=53.54us

Overall not that bad for a first cut. xPilot had a latency of 42us (fast constraint) and 30us (slow constraint).

Area:

  • 244 states
  • 1952 ALUTs
  • 3192 registers
  • 4512 BRAM bits
  • 8 DSPs

Added a noop add instruction after each load to handle the RAM latency. This was really needed because phi instructions were being pushed back into the same state as the load, which is wrong, they need to be 2 cycles afterwards. Adding the noop solves this.

Andrew Canis 2009/12/21 14:54

From wikipedia:

Verilog-2001 is the dominant flavor of Verilog supported by the majority of commercial EDA software packages.

So we can use the arithmetic shift right (»>) operator

I'm adding some assertions to the code to fail when we don't support instructions (floating point etc).

Andrew Canis 2009/12/18 17:17

Running into an issue with signed/unsigned numbers I think. After the instruction:

  0x1100000b,			// [pc=0x00400070]  beq $8, $0, 44 [L2-0x00400070]  ; 23: beq  $t0,$zero,L2

The pc jumps to 0x4440009c instead of:

  0x8fbf0008,			// [pc=0x0040009c]  lw $31, 8($29)                  ; 34: lw   $ra,8($sp)     ; L2

Todo: fix ram to handle signed integers. Right now it assumes unsigned.

Note, the LLVM primitive types (i8, i32, etc) do not have a sign. Instead certain instructions have two versions: ashr, lshr. See: http://nondot.org/sabre/LLVMNotes/TypeSystemChanges.txt

.mif files don't support negative decimal numbers

I've updated the square root approx example to include a global array and a printf. It's working fine. Working on mips right now, the issues still to deal with:

  • Switch statements
  • Different types: short, long long

Had an architecture meeting yesterday. The slides were a little bit technical - focusing too much on C++ class details. But generally looked good. The goal now is just to push the rest of the chstone through and get a test suite. Then we can focus on QoR for a bit, including some basic resource constraints. When we're at roughly the same performance as xPilot we can make a public release.

Required reading for using LLVM: http://llvm.org/docs/ProgrammersManual.html#coreclasses

Andrew Canis 2009/12/17 09:48

To generate doxygen documentation for Verilog.cpp:

doxygen Doxyfile

Played around with Spark a little more. Doesn't support any c++ features (const, c++ comments). Also doesn't seem to support arrays, for instance this snippet gives a segfault in common subexpression elimination:

int main () {
    int main_result;
    int i;
    int A[2];
    A[0] = 1;
    A[1] = 2;
    for (i = 0; i < 2; i++) {
        main_result = A[i];
    }
    return main_result;
}

Tried running spark on mips.c, I get a segfault:

acanis@acanis-desktop:~/spark/spark-linux-1.2/tutorial/mips$ spark -m -hli -hcs -hcp -hdc -hs -hcc -hvf -hb -hec mips.c 
Copyright (C) 2000-2003 The Regents of the University of California. All Rights Reserved.

SPARK Version 1.2 (built on Feb  4 2004 17:31:10) is initializing ... Done!

-- Start initializing IR (and dependences) for routine: mips_c_main()
  -- Start initializing routine: mips_c_main()
At end of source: internal error: assertion failed: dump_expr: bad expr node
          kind (c_gen_be.c, line 4497)


 : init_stmt : 
-- Start Build Dominator Tree on mips_c_main
-- Done Build Dominator Tree on mips_c_main
-- Start Lowering Exprs of routine: mips_c_main()
Got false from isVarConstField for OrigLeftHandOperand 
Got false from isVarConstField for OrigLeftHandOperand 
Got false from isVarConstField for OrigLeftHandOperand 
Got false from isVarConstField for OrigLeftHandOperand 
Got false from isVarConstField for OrigLeftHandOperand 
-- Done Lowering Exprs of routine: mips_c_main()

-- Start DataDependence on mips_c_main
-- Done DataDependence on mips_c_main
-- Done initializing IR (and dependences) for routine: mips_c_main()
WARNING: not computing loop bounds for WhileLoopNode just yet
WARNING: not computing loop bounds for WhileLoopNode just yet

-- Start Doing Loop Invariant Code Motion in  routine: mips_c_main()

-- Start LoopInvariantCM on mips_c_main
-- Done LoopInvariantCM on mips_c_main
-- Done Doing Loop Invariant Code Motion in  routine: mips_c_main()

-- Start Unrolling Loops in  routine: mips_c_main()
-- Done Unrolling Loops in  routine: mips_c_main()

-- Start Finding Common SubExprs (CSE) in  routine: mips_c_main()
Segmentation fault

Andrew Canis 2009/12/15 13:36

Using the visitor pattern is actually really easy. Just inherit from InstVisitor, override the necessary visitXXX methods and then call visit(instr). I can use this to avoid a few of the case statements in my code.

Andrew Canis 2009/12/10 18:00

The volatile keyword is handled in llvm by accessing the variables using 'load/store volatile'. So you can assume normal registers are never volatile.

Looking at mips.c. There are a few issues I immediately foresee:

  • Global variables
  • Ram initialization
  • Switch statements
  • Different types: short, long long
  • printf()

Andrew Canis 2009/12/10 00:19

Note: there is a df_iterator for depth first iteration of the CFG.

Added support for phi instructions. Used a function called printPHICopiesForSuccessor() from CBackend.

Creating a new user:

sudo adduser --home /home/xxxx xxxx

I can now split up basic blocks based on data dependencies. First I tried cloning instructions but I noticed the names were different, so I changed the code to move instructions to the new basic block instead. I made use of splitBasicBlock().

Now I'm going to get loop.c working. I'll have to fix up my ram address code.

Andrew Canis 2009/12/09 01:54

Mark came up with a good idea of how to handle breaking basic blocks up based on data dependencies. First find which state each instruction should be in, then just sort the instructions based on their state, which lets you use splitBasicBlock() in LLVM.

Andrew Canis 2009/12/08 02:30

I've decided to implement the state machines using the llvm IR. Each state will be a seperate basic block, the terminating statement will be converted to the appropriate next_state verilog code. I feel this will simplify things, and I don't see any benefit right now for adding a new datastructure just for states.

How should we handle phi instructions? I think we should keep phi instructions around as long as possible until the actual Verilog generation. To turn phi instructions into verilog you have to push the assignment back into each state mentioned in the phi. There is some code in PHIElimination.cpp to do this. Can I reuse it? Also CBackend has the same issue, there's probably code I can use in there.

Another idea I've had is to use verilog blocking statements within a state. This will be great because we won't have to worry about data dependencies within a state. One thing to make sure is a load from RAM can't be used within the same state.

I've read the GraphTraits class that defines the CFG between basic blocks. I've realized that this simply wraps the following:

  • For BasicBlock successors: BasicBlock→getTerminator()→getSuccessors()
  • For BasicBlock predecessors: iterating over use-def of the BasicBlock, which contains either terminate or phi instructions. Ignoring the phi instructions and call terminate→getParent()

What's interesting is that even BasicBlocks have a use-def chain. In fact every Value has a use-def chain.

The main data structure of LLVM: doubly-linked lists. Either instructions within basic blocks, basic blocks within functions, functions within modules.

Andrew Canis 2009/12/06 09:26

To generate llvm doxygen:

llvm/$ cd docs
llvm/$ make doxygen

Look at: llvm/Support/CFG.h

Print out a dot graph of the CFG:

F.viewCFG();
F.viewCFGOnly();

Found a good STL c++ reference:

Andrew Canis 2009/12/04 23:32

Cool dot graph at this site: http://compilers.cs.ucla.edu/fernando/projects/soc/proposal.html

So here's my issue. tags in gvim suck. For instance, lets say I want to find the class Value:

:ts Value
# scroll through 50 lines, find that it's 20, hit q, then 20<enter>

Now I'm looking at the Value class, I notice a print() method. I try ctrl-]. Get back 100 matches, none of which are remotely correct.

So the problem here is that the prototype is defined as llvm::Value::print, but the declaration is simply Value::print (inside a using namespace llvm). So it can't be found… Is this a bug in ctags? Nope. Changed the declaration to be llvm::Value::print, still ctrl-] doesn't prioritize it.

Read this: http://cscope.sourceforge.net/cscope_vim_tutorial.html

To use cscope in vim. :h cscope-howtouse

$ cscope -b -q `getsrcs.sh`

Now just type <C-X> <C-O> to open auto complete dialog. :h omnicppcomplete

Added:

map <C-F12> :!ctags -R --c++-kinds=+p --fields=+iaS --extra=+q --languages=c++ .<CR>

Found a really cool vim plugin:

You have to build your ctags database with at least the following options:

        --c++-kinds=+p  : Adds prototypes in the database for C/C++ files.
        --fields=+iaS   : Adds inheritance (i), access (a) and function 
                          signatures (S) information.
        --extra=+q      : Adds context to the tag name. Note: Without this
                          option, the script cannot get class members.

Just some commands to remember for browsing vim tags. type 'g[' select one tag from list (hit 'q' if too many). ctrl-t jump back after using ctrl-]

Added 's' command in visual mode for adding code tags around selection:

:vmap s d<S-o><code><enter><\/code><esc>kp

Added the 'doku' command to my .bashrc:

alias doku='gvim -u ~/.vim/doku.vim'

And added the following to doku.vim:

source ~/.vim/dokuvimki.vim
:DWAuth
:DWEdit andrew_s_log

Wow, really really cool. I can edit this wiki using a vimplugin: http://www.stumbleupon.com/su/1b3ysr/www.chimeric.de/blog/2008/0314_dokuwiki_xml-rpc_g_vim_dokuvimki

:DWAuth
:DWEdit andrew_s_log
# edit
:DWSend
:help dokuvimki

Found dokuwiki shortcut keys: http://www.dokuwiki.org/accesskeys ctrl-shift e to edit. ctrl-shift s to save.

llvm printing is done in lib/VMCore/AsmWriter.cpp

Trying to figure out why llvm uses a custom StringRef class instead of std::string. Turns out the reason is std::string always stores the string value in the heap which can cause problems in multithreaded environments where there is contention for the global heap (http://www.ddj.com/cpp/184405453). Same with Twine. Note: neither stringref/twine store the actual string so don't try to store them.

Blocked legup.org from google for now (robots.txt):

User-agent: *
Disallow: /

Andrew Canis 2009/12/03 21:53

Had to add daemon=yes to my gitosis.conf file for git-daemon to work. And had to add this to gitweb.conf so gitweb=no works:

# Point to projects.list file generated by gitosis.
$projects_list = "/home/git/gitosis/projects.list";

There's a difference between git-daemon and gitosis access. To get read-only access using git-daemon run:

git clone git://legup.org:7326/legup legup 

For push access (over ssh), add your ssh key to gitosis-admin and then run:

git clone git@legup.org:legup legup

Playing around with the gitosis permissions. It's helpful to look at /var/log/git-daemon. Getting the error:'receive-pack': service not enabled. Okay, this is expected, git-daemon is read-only be default. So I still need to configure gitosis properly. Following: http://vafer.org/blog/20080115011413

acanis@acanis-desktop:~$ git clone git@legup.org:gitosis-admin gitosis-admin

Just installed the new VirtualBox (3.1). Adds support for live migration and branched snapshots

You can now run any previous snapshot and create branches from previous snapshots. This is a huge win, because we don't have to make a new VM anymore every time we want to virtualize another fresh Ubuntu environment. We can just keep around a single snapshot of a clean install and branch it as needed.

Andrew Canis 2009/12/02 23:28

Okay it's working. Latency 11240/2=5620. Fmax=125

# At t=            11240000 clk=1 ret_result=         0

So it appears the default: case of the switch() statement is not handled properly. If you look at b30s in the code, you can see the 16 cases of the switch statement: 32'h 21 (ADDU), 32'h 23 (SUBU), etc. However, the default case actually tries to test every single value not included in the switch statement, and misses tons of cases. I replaced the default case with what it should have been (true only if no other case was true):

ni1148_suif_tmp39 = ~(ni1125_suif_tmp23 | ni1126_suif_tmp24 | ni1127_suif_tmp25 | ni1131_suif_tmp26 | ni1135_suif_tmp27 | ni1136_suif_tmp28 | ni1137_suif_tmp29 | ni1138_suif_tmp30 | ni1139_suif_tmp31 | ni1140_suif_tmp32 | ni1141_suif_tmp33 | ni1142_suif_tmp34 | ni1143_suif_tmp35 | ni1144_suif_tmp36 | ni1145_suif_tmp37 | ni1146_suif_tmp38 );

I seem to be getting stuck in state 48:

# At t=            11197001 clk=1 thisState=         48 p_return_value_en=0 stateEn=1

r_pc is not quite the actual pc. It seems to increment once too many every time there is a branch. But this matches the behaviour of r_n_instr which only increments once in these cases.

dmem is getting written with the right final result. Here's the final write to dmem[7]:

...
# At t=            10766000 clk=1 we=1 w_addr= 7 r_addr=56 din=00000026 dout=xxxxxxxx

Didn't create module for cregister. The .mif files forgot the ';' after END.

Still doesn't work. The shl() doesn't seem to be implemented properly.

# ** Fatal: (vsim-3734) Index value 32 is out of range 31 downto 0.
#    Time: 695 ns  Iteration: 2  Process: /tb_mips/mips_0/mips/line__1398 File: Z:/mips/hw/lib/impack.vhd
# Fatal error in ForLoop loop at Z:/mips/hw/lib/impack.vhd line 424
# 
# HDL call sequence:
# Stopped at Z:/mips/hw/lib/impack.vhd 424 ForLoop loop
# called from  Z:/mips/hw/mips_comp.vhd 1398 Architecture rtl

Needed to replace:

architecture rtl of main is
...
  ni659_reg <= shl(r_reg, r_shamt);

with

library ieee;
use ieee.numeric_std.all;
architecture rtl of main is
...
  ni659_reg <= shl(r_reg, to_integer(unsigned(r_shamt)));

From Windows modelsim:

vsim 4> do setup.tcl
vsim 4> r
vsim 4> run 10us
# Cannot continue because of fatal error.
# HDL call sequence:
# Stopped at C:/altera/91/modelsim_ase/win32aloem/../altera/vhdl/src/altera_mf/altera_mf.vhd 39474 Subprogram read_my_memory
# called from  C:/altera/91/modelsim_ase/win32aloem/../altera/vhdl/src/altera_mf/altera_mf.vhd 40968 Process memory
# 

ImpulseC can't generate testbenches in verilog. My modelsim on linux doesn't have a vhdl simulation license. To make a project from linux:

acanis@acanis-desktop:~/vmshare/mips$ icProj2make.pl mips.icProj 
# Make targets supported: clean, build_exe, build, build_testbench, export_hardware, export_software
make -f _Makefile build

Interesting command called impulse_s2xml that converts the binary IR into an xml format (Fir51.xic)

ImpulseC license isn't working in my VirtualBox. The MAC address is different. Click gear icon beside Network→NAT to change MAC address. Works.

So there is no actual GUI for ImpulseC in linux: http://impulse-support.com/forums/index.php?showtopic=832&hl=linux

The documentation for the Linux release is somewhat minimal, but you can find a section entitled “Command Line Tools” that explains how to create Makefiles from .icProj files, which are then used to build various targets: desktop simulation executable, HDL generation, and exporting hardware and software files.
To create a project under Linux, create a .icProj file using a text editor. Modifying an existing file is a good way to start. The key/value pairs used for the project options correspond to the same fields shown in the CoDeveloper GUI environment under Windows, so you can refer to the CoDeveloper documentation for information on these options.

Remember to source: ~/impulsec/codeveloper/Impulse/CoDeveloper3/codeveloper-profile.sh

Getting ImpulseC to work. Also installing ImpulseC and Quartus on my Windows VirtualBox. Had to add:

SERVER acanis-desktop 00044b190164 <port number>

To license file. Still get an error when running:

acanis@acanis-desktop:~/impulsec/FLEXnet/v11.5/i86_re3$ sudo ./lmgrd -c /home/acanis/impulsec/codeveloper/Impulse/CoDeveloper3/bin/CoDeveloper.lic 
....
16:56:04 (lmgrd) impulsed already running 27001@acanis-desktop
16:56:04 (lmgrd) The license server manager (lmgrd) is already serving all vendors, exiting.

Andrew Canis 2009/12/01 16:45

Code for testing ac_fixed:

#include "ac_fixed.h"
#include <iostream>
using namespace std;

int main() {
    const float num = 1.888888;
    ac_fixed<16,3,true> a = num;
    ac_fixed<16,10,true> b = num;
    ac_fixed<16,16,true> c = num;

    cout << a.to_string(AC_DEC) << "\n";
    cout << b.to_string(AC_DEC) << "\n";
    cout << c.to_string(AC_DEC) << "\n";
}

Found another xPilot bug. In the xPilot-prj.tcl file, set_cycle_time cannot be larger than 8 (otherwise the compiler gets stuck in an infinite loop). Also the “-unit ns” means nothing, xPilot always uses ns.

Just discovered how to use TbGen.tcl:

~/xpilot/xpilot-rel/samples/mips/auto_work/systemc$ TbGen.tcl main.tbgen.tcl

Found a great html validation site: http://www.htmlhelp.com/tools/validator/

mips.c error about printf():

Error: Cannot find the CDFG: printf, you may import a design first.
Shell-4 ERROR: CDFG "printf" not imported

Andrew Canis 2009/11/30 14:16

Installed Windows and Office '07 in a VirtualBox VM so I can create the powerpoint presentation for Wednesday.

CHStone benchmarks are on a Virtex 4. Slice: 2 4-LUTs, 2 2-input MUXes, 2 registers.

Andrew Canis 2009/11/29 22:28

Trying dfadd. If I use a void pointer cast in ullong_to_double() instead of a union then xPilot synthesizes without an assertion. Strange: the printf() in main compiles fine! This didn't happen for mips.c.

Andrew Canis 2009/11/28 22:09

Okay finally working. My comments:

  • Need a 1-1 mapping (or close to) between C variable names and verilog variable names for easier debugging.
    • reports/main.verbose.rpt has useful info though
  • sign extend causes lots of problems. Should try to always assign signals of equal bit length. xPilot made a mistake here.
  • Don't make user manually create rams and initialize them. It's error prone.
  • dumping out rams to files can be a useful debugging tool.
  • wire/reg issues when assigning ports on rams. An input to a ram must be a wire because rams can't be instantiated inside an always block.
  • strange N_n_main module instance. I'm not sure what it does but to get the code working I had to replace it with:
always @(posedge clk)
    __main_134_done <= __main_134_start;
  • Finished at t=9570, so 9570/2 = 4785 clock cyles.
  • states=98
  • I don't understand how they come up with this. How can you determine worst-case latency when the input is arbitrary?
Best-case latency (# of clock cycles): 83
Averge-case latency (# of clock cycles): 153
Worst-case latency (# of clock cycles): 178
 
  • Resource usage (for clock period=8ns)
- Synthesized datapath summary
------------------------------------------------------------
* Number of ports: 4
* Number of functional units: 165
    * Number of mults: 1
    * Number of addsubs: 10
    * Others = 154
* Number of registers: 32
* Number of register files: 0
* Equivalent number of 2-to-1 multiplexers: 111
* Number of nets: 443

I made another mistake:

ram[address0] <= d0;
$writememh(write, ram);

Wasn't printing the change to ram (nonblocking assignment). Changed to:

ram[address0] = d0;
$writememh(write, ram);

dmem_q0 = ffff ffef, outdata_q0 = 0fff ffef, when they should be identical. Damn, that's my fault negative values in outdata were not 32 bits wide, and $readmemh doesn't sign extend.

$readmemh is supported but using a parameter as the filename gives an error! So you have to set the file to a constant string literal.

Got the program counter of the verilog simulation matching the gcc mips.c version perfectly. There was a problem with Reg_239 being a signed 5 bit number, which meant when it was assigned to the 6 bit address it was sign extended, making 31 (011111) → 63 (111111).

But the result is still incorrect. I need to stop the simulation as soon as done=1. There are lots of instances of size truncation in the design, I'm wondering whether a similar problem is affecting the dmem ram? The results are very close dmem is correct except for the last entry which is 0x16 instead of 0x26.

Andrew Canis 2009/11/27 13:44

Cool, just read that Quartus supports the $readmemb and $readmemh system commands in Verilog to initialize memories with a text file.

Okay, I can get ret_result to -8 if I change the input data to match mips.c (I was using a different input vector). This is due to the fact that 3 is the same between input and output. So, the n_inst is also wrong. Can I print it? However, dmem is still identical to the input vector.

Setting a = outdata, I get ret_result = -1 (from n_inst)

Added some display lines to main.v, I'm always in the same state:

# At t=                 431 clk=0 CS=1010001
parameter    ST_81 = 7'b1010001;

Okay, it was a problem with my ram code. Now I get a result but it's wrong:

# At t=                 458 clk=1 main_result_o=11111111111111111111111111110111 A_address0=1000 A_ce0=0 A_q0=00000000000000000000000000000001, imem_ce0=0 imem_address0=101011 imem_q0=00000011111000000000000000001000, outData_ce0=0 outData_address0=1000 outData_q0=00000000000000000000000000001000, ret_result=-9 done=1 

Had to create my own ram in memory. Seems to be reading A and imem properly, but I never get a result (even after 500k clock cycles):

# At t=           939178000 clk=1 main_result_o=00000000000000000000000000000000 A_address0=1000 A_ce0=0 A_q0=00000000000000000000000000000001, imem_ce0=0 imem_address0=011100 imem_q0=00010001000000000000000000001011, outData_ce0=0 outData_address0=xxxx outData_q0=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, ret_result=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx done=0 

Nope. It's not a xilinx vs altera thing. dmem and reg refer to arrays in my mips.c code…

Damn. Same prob with both verilog and vhdl:

Error (10481): VHDL Use Clause error at main.vhd(504): design library "work" does not contain primary unit "N_n_main

I think this might be a problem with using Altera/Stratix as the target. Lets switch back to Xilinx. Nope, didn't work, switching back to Altera.

Can't find dmem.h reg.h N_n_main.h anywhere in the xPilot directory. So I'm just going to move to verilog for now.

Working on xPilot for meeting today. Can't get systemC to compile:

acanis@acanis-desktop:~/xpilot/xpilot-rel/samples/mips/auto_work/systemc$ g++ -I. -I/home/acanis/xpilot/xpilot-rel/tools/systemc-2.1.v1/src  -o sim_main main.cpp   -lsystemc -L/home/acanis/xpilot/xpilot-rel/tools/systemc-2.1.v1/lib

Andrew Canis 2009/11/25 11:07

So, the makefile isn't detecting changes to llvm/tools. What about lib? Same thing, I touch a file in lib, make from legup, nothing. Then make from llvm-objects and the file is recompiled. It's probably because those makefiles aren't included in autoconf/configure.ac: AC_CONFIG_MAKEFILE(tools/legup/Makefile) etc.

Trying to figure out LLVM build system. I modify ../llvm/tools/llc.cpp from within legup and rerun make, nothing happens. But from llvm-object I get:

..
make[1]: Entering directory `/home/acanis/legup/llvm-objects/tools/llvm-config'
llvm[1]: Regenerating LibDeps.txt.tmp
make[2]: Entering directory `/home/acanis/legup/llvm-objects/tools/llc'
llvm[2]: Compiling llc.cpp for Release build
llvm[2]: Linking Release executable llc (without symbols)
llvm[2]: ======= Finished Linking Release Executable llc (without symbols)

ImpulseC License is expired:

acanis@acanis-desktop:~/impulsec/FLEXnet/v11.5/i86_re3$ 13:12:29 (lmgrd) FLEXnet Licensing (v11.5.0.0 build 56285 i86_re3) started on acanis-desktop (linux) (11/24/2009)
13:12:29 (lmgrd) Copyright (c) 1988-2007 Macrovision Europe Ltd. and/or Macrovision Corporation. All Rights Reserved.
13:12:29 (lmgrd) US Patents 5,390,297 and 5,671,412.
13:12:29 (lmgrd) World Wide Web:  http://www.macrovision.com
13:12:29 (lmgrd) License file(s): /home/acanis/impulsec/codeveloper/Impulse/CoDeveloper3/bin/CoDeveloper.lic
13:12:29 (lmgrd) lmgrd tcp-port 27001
13:12:29 (lmgrd) Starting vendor daemons ... 
13:12:29 (lmgrd) Started impulsed (internet tcp_port 57454 pid 18294)
13:12:29 (impulsed) FLEXnet Licensing version v11.5.0.0 build 56285 i86_re3
13:12:29 (impulsed) EXPIRED: codeveloper
13:12:29 (impulsed) EXPIRED: appmonitor
13:12:29 (impulsed) EXPIRED: build_hdl
13:12:29 (impulsed) EXPIRED: stagemaster
13:12:29 (impulsed) EXPIRED: genvhdl
13:12:29 (impulsed) EXPIRED: genvlog
13:12:29 (impulsed) EXPIRED: smexplorer
13:12:29 (impulsed) EXPIRED: cyclesim
13:12:29 (impulsed) EXPIRED: covalidator
13:12:29 (impulsed) License server system started on acanis-desktop
13:12:29 (impulsed) No features to serve, exiting
13:12:29 (impulsed) EXITING DUE TO SIGNAL 36 Exit reason 4
13:12:29 (lmgrd) impulsed exited with status 36 (No features to serve)
13:12:29 (lmgrd) impulsed daemon found no features.  Please correct
13:12:29 (lmgrd) license file and re-start daemons.

view symbols in static/shared library:

nm -g Release/lib/libLLVMVerilog.a

xPilot synthesis is done with an LLVM optimization pass:

cd xpilot-rel/samples/samples/enc83/auto_work
opt -load ../../../../../xpilot-rel/bin/libxPilotHwSynPass.so -scalarrepl -mem2reg -raise -disaggr -loop-dep -array-flatten -instcombine -gcse -simplifycfg -xpilot -tcl xPilot.work.tcl -legalize tmp/enc.o.opt.presyn.bc

./configure –with-llvmsrc=$PWD/../llvm –with-llvmobj=/home/acanis/legup/llvm-objects/

I’m a huge fan of the git source control system for many reasons. http://whygitisbetterthanx.com/

Basing the legup repository on clang.

Andrew Canis 2009/11/23 13:41

xPilot chstone:

  • mips - commented printf to synthesize. Number of states: 98. Much higher than chstone paper.
  • dfadd - assertion failure: Assertion `use_begin() == use_end() && “Uses remain when a value is destroyed!”' failed.
  • dfdiv - same assertion
  • dfmul - same assertion
  • dfsin - same assertion
  • adpcm
BWA-11 ERROR: Pointer exception in:
	%bpl_addr.0 = getelementptr int* %bpl, uint %inc.0.sum		; <int*> [#uses=1]
Shell-4 ERROR: Unsynthesizable Pointer found.
  • gsm:
BWA-11 ERROR: Pointer exception in:
	%tmp.6 = getelementptr short* %s, uint %indvar		; <short*> [#uses=1]
Shell-4 ERROR: Unsynthesizable Pointer found.
  • jpeg
opt: ArrayFlatten.cpp:683: bool pass::ArrayFlattenPass::flattenMultiDimArrayGEP(llvm::GetElementPtrInst*, llvm::Value*): Assertion `IsZeroConst(*oi)' failed.
  • motion
opt: /curr/irene/project/xpilot_repos/pkg/llvm-1.7/../../pkg/llvm-1.7/llvm/lib/VMCore/Value.cpp:157: void llvm::Value::replaceAllUsesWith(llvm::Value*): Assertion `New->getType() == getType() && "replaceAllUses of value with new value of different type!"' failed.
 
  • aes
BWA-11 ERROR: Pointer exception in:
	%tmp.17 = getelementptr int* %key, int %tmp.16		; <int*> [#uses=1]
Shell-4 ERROR: Unsynthesizable Pointer found.
  • blowfish
BWA-11 ERROR: Pointer exception in:
	%p2.0 = getelementptr uint* %s2, uint %indvar		; <uint*> [#uses=1]
Shell-4 ERROR: Unsynthesizable Pointer found.
  • sha
BWA-11 ERROR: Pointer exception in:
	%buffer_addr.0 = getelementptr ubyte* %buffer, uint %tmp.		; <ubyte*> [#uses=2]
Shell-4 ERROR: Unsynthesizable Pointer found.

Setting up git repo. I need two submodules: llvm, and llvm-gcc.

git submodule add git://github.com/earl/llvm-mirror.git llvm
git submodule add git://repo.or.cz/llvm-gcc-4.2.git llvm-gcc-4.2

Okay, I've decided not to put these in the repo. I'm going to setup the project like clang. Put it in llvm/tools.

I need a simple test suite that will run a testbench for the SRA circuit and confirm the result.

After that I can setup buildbot and nightly snapshots and then cleanup the website color scheme and I'm done.

After that:

  • Put all the website code in git
  • Open source license with U of T
  • ImpulseC - just requested an evaluation license
  • Forte - Doesn't have an evaluation request on the website
  • Start going through CHStone with xPilot

Andrew Canis 2009/11/22 09:27

Setup bugzilla: legup.org/bugs

Setting up the git repository: http://vafer.org/blog/20080115011320

  • Restart git-daemon: sudo sv restart git-daemon
  • Changed /etc/sv/git-daemon to: exec chpst -ugit git daemon –port=7326 –verbose –base-path=/home/git/repositories/
  • Logfile: /var/log/git-daemon

Setup auto updates:“System” menu, then “Administration”, then “Software Sources”. Open up the “Updates” tab and select “Automatic updates”, also select “Install security updates without confirmation”.

Added subdomain http://lists.legup.org. Followed: http://blog.agdunn.net/?p=162 to setup the apache2 virtual site for the list subdomain in /etc/apache2/sites-enabled/020-mailman

Mail server setup. Created a subdomain for my MX record: smtp.legup.org, pointed at my fixed ip. Following:

To change mailman password: sudo mmsitepass -c <pass>

Setting up a webpage at legup.org. I need a website, git repo, mailing list, bugzilla, buildbot, and blog for now.

Andrew Canis 2009/11/21 07:33

Looking into gcc now. The GIMPLE traversal is much messier!

A simple LLVM backend is done. Very impressed by the C++ api.

Very useful site: http://llvm.org/doxygen/

Needed to modify llc.cpp to change the file extension to .v for -march=v (verilog)

What does this mean while making?

llvm-config: unknown component name: veriloginfo
llvm-config: unknown component name: veriloginfo

Okay, it was a problem with my backend Makefiles.

To see actual commands while running llvm make:

make TOOL_VERBOSE=1

Figured out how the LLVMInitializeVerilogTargetInfo() gets called. Look in Target/TargetSelect.h:

#define LLVM_TARGET(TargetName) void LLVMInitialize##TargetName##TargetInfo();

I was accidentally calling the backend 'Verilog' in some places and 'VerilogBackend' in other places. Also take a look at: llvm-objects/include/llvm/Config/Targets.def

To compile only llc, use utils/makellvm. Actually utils has a lot of really useful scripts, for instance 'llvmgrep'.

acanis@acanis-desktop:~/work/llvm/llvm-svn/lib/Target/Verilog$ /home/acanis/work/llvm/llvm-svn/utils/makellvm -obj ~/work/llvm/llvm-objects/ llc

Added a new backend: lib/Target/Verilog. Had to modify autoconf/configure.ac to add Verilog to TARGETS_TO_BUILD and I changed the required autoconf and libtool version in AutoRegen.sh. To regenerate the configure file: cd autoconf; ./AutoRegen.sh

Nice, my patch was rolled into LLVM! Development is really active. I posted the bug, within 30 min it was marked as a duplicate. Within an hour and a half I had a patch. It was fixed along with another bug and checked in that same day!

Andrew Canis 2009/11/17 23:54

Looking at just the output files, gcc (gimple) is much easier to read than the llvm output. gcc even has a .cfg file that breaks up the basic blocks into a control flow graph.

gcc -fdump-tree-all outputs a .vcg file (Visualization of Compiler Graphs): http://rw4.cs.uni-sb.de/~sander/html/gsvcg1.html

  • xvcg -psoutput graph.ps sra.c.006t.vcg

Wow! Just wasted a lot of time. C operator precedence killed me:

  • x - x»3 != x - (x»3)
  • the expression on the left evaluates to 0! No wonder the llvm optimizations looked so weird.

I just realized a flaw in the SRA algo from the book:

  • 0.875x != x - x » 3 . This is an approximation.

Alright starting the HLS on llvm. The goal for tonight:

  • output verilog for the simple square root approx (SRA) from Gajski's book
  • use ASAP scheduling. Don't optimize for area or delay. No pipelining. No functions, just a single BB.
  • if I finish this, try to do the same thing as a gcc plugin.

Andrew Canis 2009/11/17 01:13

Gajski's university of california irvine tool (SCE) seems to use SpecC as its input specification. No download, but you can contact them for a cdrom.

Playing around with the de2. To get the demonstrations: http://www.terasic.com/downloads/cd-rom/de2/DE2_System_v1.6.zip

CBackend.cpp has grown by over 2000 lines in 5 years (since llvm-1.3) to a total of 3696 lines. There is an incredible amount of complexity in there, some of which I think is attributed to making gcc correctly compile the resulting C file. I saw sse2 instructions in there too.

Okay, so after diffing the changes made by trident to llvm-1.3 they are trivial. vhdl.cpp is almost identical to Writer.cpp (the old CBackend code generator) a few minor changes to help make the C output easier to parse (ie. renaming type “unsigned long long” to “ulong”) . The trident tool 'llv' is identical to 'llc' but hardcodes the optimization passes that are run (including some new trident passes), and has some special case if you specify -march=v (vhdl) to rename the output file extension to .llv. So in terms of llvm modifications, there were basically none. They just had to implement an llvm parser on the sea cucumber java side. Note: the CBackend.cpp file has changed substantially since 1.3. It would be very difficult to update trident's vhdl.cpp.

So I think the best approach is to first get the trident llvm compiling under the latest llvm version. The directory trident/compiler/llvm-1.3 is setup as an LLVM project:

Useful article on basic blocks: http://gcc.gnu.org/onlinedocs/gccint/Basic-Blocks.html

Interesting, seems to be a guy, Elvis Dowson who tried to get trident working with llvm v1.5.

He quotes price ranges of $145000 to $170000 for a commercial HLS compiler license.

llvm 1.5 had a tool called llv. What did that correspond to? Okay no, trident implemented a new tool called llv ('v' probably stands for vhdl):

acanis@acanis-desktop:~/trident/compiler/llvm-1.3$ find . -name \*.cpp
./lib/float_passes/FloatLoopUnroll.cpp
./lib/float_passes/LowerPHI.cpp
./lib/float_passes/RenameDuplicateVars.cpp
./lib/vhdl/vhdl.cpp
./tools/llv/main.cpp

lib/vhdl/vhdl.cpp is the backend for generating vhdl. This could be useful.

Okay, so llvm 1.5 is hopelessly out of date. It can't even compile stdlib.h!!

acanis@acanis-desktop:~/trident/llvm-obj$ ../cfrontend/x86/llvm-gcc/bin/llvm-c++ test.cpp 
test.cpp:2:20: while reading precompiled header: No such file or directory
In file included from test.cpp:2:
stdlib.h:278: error: expected constructor, destructor, or type conversion
stdlib.h:278: error: expected `,' or `;'
stdlib.h:283: error: expected constructor, destructor, or type conversion
stdlib.h:283: error: expected `,' or `;'
stdlib.h:288: error: expected constructor, destructor, or type conversion
stdlib.h:288: error: expected `,' or `;'
stdlib.h:297: error: expected constructor, destructor, or type conversion
stdlib.h:297: error: expected `,' or `;'

Damn. Still doesn't compile. There is a linker error. Probably because g++-3.4 isn't supported on this ubuntu version, my binutils is probably out of date.

Trident doesn't work with the latest llvm. Had to download the older 1.5 versions (and compile with g++ 3.4):

tar xvzf llvm-1.5.tar.gz
tar xvzf cfrontend-1.5.i686-redhat-linux-gnu.tar.gz
cd cfrontend/x86/
./fixheaders
cd ../..
export PATH=$PWD/cfrontend/x86/llvm-gcc/bin/:$PATH
mkdir llvm-obj
cd llvm-obj
../llvm/configure CXX=g++-3.4
make

For some reason header files were missing stdlib.h and string.h: Run this .sh script in llvm/include to add them to all the header files:

#!/bin/sh
find -maxdepth 2 -type f | while IFS= read vo
do
echo "#include<stdlib.h>
#include<string.h>
" > .tmp
cat "$vo" >> .tmp
mv .tmp "$vo"
done

Tons of issue compilingThis is taking too long. Trying to compile with g++-3.4

Compiling trident:

cd trident/compiler
export CLASSPATH=/usr/share/java/antlr.jar:/usr/share/java/gnu-getopt.jar
export LLVMGCCDIR=~/work/llvm/llvm-gcc-4.2-install
ant

Figured out how to get “llvm-” prefix on llvm-gcc: configure option –program-prefix=llvm-

Just found a very interesting open source HLS tool called Trident: http://trident.sourceforge.net Created by the Los Alamos National Laboratory. The last commit was Nov 2006. It's written in Java but uses LLVM as the front-end. Based on Sea Cucumber: a synthesizing compiler mapping Java byte-code to FPGAs. Very interesting. This looks like the first GPL tool I've found. Too bad it's in Java. There are two papers:

The most interesting thing about Trident is it supports floating point algorithms.

Looking at lib/Target/CBackend/. We need to subclass TargetMachine and implement: WantsWholeFile() and addPassesToEmiteWholeFile()

Due to the bug listed below, I've temporarily reverted to llvm revision 88706:

$ cd llvm-svn
$ svn update -r 88706

Created legup.sourceforge.net just in case.

Andrew Canis 2009/11/15 19:08

From the LLVM documentation on functions:

A function definition contains a list of basic blocks, forming the CFG (Control Flow Graph) for the function. Each basic block may optionally start with a label (giving the basic block a symbol table entry), contains a list of instructions, and ends with a terminator instruction (such as a branch or function return).
The first basic block in a function is special in two ways: it is immediately executed on entrance to the function, and it is not allowed to have predecessor basic blocks (i.e. there can not be any branches to the entry block of a function). Because the block can have no predecessors, it also cannot have any PHI nodes.

So the first question, should HLS be implemented as a backend or as an optimization pass? Do we have access to alias analysis etc. in the backend?

Damn, updated llvm from svn and now llvm-gcc won't compile. Filed a bug: http://llvm.org/bugs/show_bug.cgi?id=5497 At least I got to learn how to file a bug (http://llvm.org/docs/HowToSubmitABug.html) and use bugpoint. git bisected the error down to a commit a few days ago.

Dominators: in control flow graphs, a node d dominates a node n if every path from the start node to n must go through d.

svn updated my llvm and llvm-gcc4.2 and recompiled. Looking over LLVM documentation:

  • llvm/lib/Target/CBackend which implements a LLVM-to-C converter.
    • llc is the LLVM backend compiler, which translates LLVM bitcode to a native code assembly file or to C code (with the -march=c option).
  • See llvm/projects/sample for an example of how to set up your own project.
  • llvm/utils/vim: vim syntax files

LLVMs whole contribution is to be a “cleaner” IR. They use the gcc front-end to parse C/C++, convert GIMPLE to LLVM IR, then output assembly and let gcc compile the assembly. So you can't even call LLVM a compiler. It's simply an IR, or compiler infrastructure.

Done reading the book chapter. Great overview of HLS. I'm going to implement a simple time constrained scheduler for the square-root approximation in the book. The question still remains, gcc or llvm? Just implement it twice, first on llvm (probably will be easier) then on gcc. I'd like this to be in C++. Work from svn versions of both gcc and llvm.

Andrew Canis 2009/11/15 02:17

Reading a book chapter on high level synthesis from Gajski et al., “Embedded System Design: Modeling, Synthesis and Verification”. Springer. August 24, 2009

The chapter looks at various examples of transforming C code to hardware.

  • C is converted into a control data flow graph (CDFG), which is a control flow graph connecting basic blocks.
  • CDFG is then converted to a finite state machine with data (FSMD) graph, which shows a FSM with operations conducted in each state giving a cycle accurate view of the algorithm.
  • An architecture must be allocated for the FSMD.
  • Three tasks to simplify the architecture: register/memory sharing, functional unit sharing, and bus sharing.
    • Each step consists of grouping variables, operations, or connections using a compatibility graph.
    • Each node in the graph is a var/op/conn where two nodes can be connected by an incompatibility edge or priority edge.
    • An incompatibility edge means the nodes cannot be combined, the priority edge gives a weight on the benefit of combining two nodes.
    • The nodes in the compatibility graph are then combined using a graph-partitioning algorithm until only supernodes connected with incompatibility edges remain (ie. no more combinations can be made).
    • Merging registers into register files.
  • Chaining fast functional units to improve performance, and using multi-cycle units to reduce cost.
  • Pipelining functional units, datapath, and control to improve performance
  • Scheduling: converting a CDFG → FSMD
    • As soon as possible (ASAP) and as late as possible (ALAP) to determine critical operations in the datapath
    • Resource constrained: specify functional/memory available
    • Time constrained: specify clock cycle
  • External Bus Interfaces: Such as Advanced Microcontroller Bus Architecture

Andrew Canis 2009/11/14 05:46

I created a wordpress blog on my local machine: http://128.100.241.23/wordpress/

Okay, I'm going to hold off on the blog for now. I've created a google group for legup: http://groups.google.com/group/legup

I'm thinking it's a good idea to replace this wiki log with a blog. Why? Keep track of dates better? It'll be good for people to keep track of my progress (Desh, Jason, Steve, Mark). It'll help me write a bit better and organize my ideas. I can still keep a wiki like this if I want, but it's better to organize my writing a little more before posting it.

I've decided to use Google Code. It lacks git support but I can just use git-svn or self-host the git repo on eecg. It has support for blogs, and google groups (mailing lists). Damn, name is already taken. Ok, I'll use github. Setup: http://github.com/acanis/legup It even has an issue tracker!

Just found an interesting project page: SableCC.org. They use trac to keep a wiki. They also have a mailing list, bug tracker, and git repo. I need something like this.

Read about SIMPLE IR that GIMPLE was based on: L. Hendren, C. Donawa, M. Emami, G. Gao, Justiani, and B. Sridharan. Designing the McCAT compiler based on a family of structured intermediate representations. In Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing, pages 406-420. Lecture Notes in Computer Science, no. 457, Springer-Verlag, August 1992.

  • There are 15 supported operations. It is quite easy to perform alias analysis on SIMPLE.

Andrew Canis 2009/11/13 01:15

Yes, this was the problem. pass name “ssa” refers to pass_build_ssa which is an early pass. What I want is the plugin to run after “loopdone”

Trying to see how the tree changes with auto-vectorization. Currently I don't see a change, I might be too early in the optimization passes.

../install/bin/gcc -fplugin=plugin.so -O2 vector.c -fdump-tree-all-details -ftree-vectorize -msse2

Where vector.c is:

#include<stdio.h>
int main() {
        int a[256], b[256], c[256];
        int i;

        for (i=0; i<256; i++){
                a[i] = b[i] + c[i];
        }

        for (i=0; i<256; i++){
                printf("%d %d %d\n", a[i], b[i], c[i]);
        }
}

To run high level synthesis plug-in:

cd ~/work/gcc/plugin
make
../install/bin/gcc -fplugin=plugin.so -O2 test.c -fdump-tree-all-details
cat *hls

Auto-vectorization is not trivial to implement. wc -l tree-vec* in gcc shows 20K lines of code.

So using the gcc plugin architecture is optional. But there are some big advantages listed here: http://gcc.gnu.org/wiki/GCC_Plugins The big one is you don't have to patch/bootstrap every new version of gcc, you can just grab the latest gcc binary and run your plugin without modification.

Good graph of different trees (GENERIC, GIMPLE, RTL): http://gcc.gnu.org/projects/tree-ssa/#ssa

Aside: Wow, I have been looking for this for years. Instrumenting cycle counts for a function: http://libplugin.sourceforge.net/tutorials/simple-gcc.html Okay this is similar to pfmon. Unfortunately my kernel doesn't support it.

Damn. Just realized that gcc has changed it's optimization pass code to use “plug-ins”. See: http://ehren.wordpress.com/2009/11/04/creating-a-gcc-optimization-plugin/ I'm going to need the latest svn version of gcc. Recompiled svn gcc (had to autoconf configure.ac to get configure to work)

../gcc-svn/configure --prefix=$PWD/../install --enable-languages=c,c++ --enable-checking

Note –enable-checking. This was recommended when working with gcc trees.

p in passes.c is a linked list. Start it by setting p to an address, end it by setting p to NULL. These get executed in order by execute_pass_list(). Actually look right above init_optimization_passes() in passes.c for a good description of the flow.

Interesting discussion about merging gcc and llvm. Including a GIMPLE to LLVM translator…

Okay, this makes more sense now, if you look in the llvm-gcc-4.2-svn you'll see ENABLE_LLVM define statements. There is a c++ file called llvm-convert.cpp that takes GIMPLE AST and converts it to LLVM IR. The actual function is llvm_emit_code_for_current_function().

Unfortunately, high GIMPLE is converted to LLVM before any of the tree-ssa optimizations are run (which require low GIMLPLE). So you loose gcc auto-vectorization.

Writing an optimization pass: http://gcc.gnu.org/wiki/WritingANewPass

tree.def and cp/operators.def describe the possible tree nodes in GIMPLE.

Simple optimization pass: http://gcc.gnu.org/ml/gcc-patches/2005-05/msg01061.html

SIMPLE: The IR that GIMPLE was based on: http://www.springerlink.com/content/7h14h08051754749/

Andrew Canis 2009/11/12 00:38

Useful: http://gcc.gnu.org/readings.html

To print out GIMPLE c-like representation: gcc -fdump-tree-gimple. Handled by code in tree-dump.c

Used simulink to generate a simple FIR filter and display the results in both the time and frequency domain. (work/simulink) There is a problem where my high pass filter (with fs=2nhz) is filtering out a 1khz signal. Can't figure out why. To design the filter use: » fdatool. This tool can also generate verilog with testbenches.

Downloaded Algorithmic C Datatypes from Mentor Graphics. Contains three templated datatypes: ac_int, ac_fixed, ac_complex.

LLVM auto-vectorization support is not finished. The last LLVM dev mailing list post from Andreas Bolka (the Google summer of code applicant that took on the project) was April 1, 2009.

To create a CDFG using GAUT run:

/home/acanis/GAUT_2_4/GautC/cdfgcompiler/bin/cdfgcompiler -S -c2dfg -O2 -I /home/acanis/GAUT_2_4/GautC/lib -I. test.c

Andrew Canis 2009/11/10 13:09

Met with Karen Tam. She was confused about the bounding box intersection calculation used to find the location of each pseudo pin. Also explained how all forces are calculated using the new positions expected in the next iteration, not the current positions. Overall a productive meeting. The code is installed and running on her account.

Installed Ubuntu 9.04 32-bit version. I was having too many problems with 64-bit linux.

Compiled gcc-4.3:

$ sudo apt-get install gcc-4.3-source
$ cd /usr/src/gcc-4.3/
$ tar xvjf gcc-4.3.3.tar.bz2
$ mkdir obj
$ mkdir install
$ cd obj
$ configure ../gcc-4.3.3/configure --prefix=$PWD/../install --enable-languages=c,c++
$ make -j4

Emailed Philippe about GAUT, apparently he's just cleaning up the source and will release it.

Read this:

Andrew Canis 2009/11/06 02:15

Found a free book on DSPs:

Andrew Canis 2009/11/03 01:49

Installed VirtualBox, it's really slick, you can mount the host O/S home directory. I think it's better than VMware.

Turns out I can just recompile the libtcl8.4.so and libelf.so.1 files using apt-get source and adding -m32 to the configure script, and adding soft links in xpilot/tools/bin. So xPilot is working on my 64-bit machine

Andrew Canis 2009/10/30 16:56

SPARK looks very interesting. Currently reading Sumit Gupta's phD thesis from UC Irvine.

xPilot binaries don't work for 64-bit. Missing libelf.so.1. Installing VirtualBox to run a 32-bit version of Ubuntu:

Submitted a request for GAUT, another HLS tool

Andrew Canis 2009/10/28 16:07

Something wrong with my DE2 nios, the JTAG debug module isn't appearing

> jtagconfig
1) USB-Blaster [USB 1-1.6]
  020B40DD   

Okay, working. jtag was messed up. Followed the directions on:

Compile LLVM (ELLCC verison with Nios2 target support):

cd ~/work/ellcc/llvm-build
make -j4 ENABLE_OPTIMIZED=1
make -j4 ENABLE_OPTIMIZED=1 install

Compile LLVM gcc front-end:

export LLVMOBJDIR=/home/acanis/work/ellcc/llvm-install
cd ~/work/llvm/
mkdir obj
mkdir install
cd obj
../llvm-gcc4.2-x.y.source/configure --prefix=`pwd`/../install --program-prefix=llvm- \
  --enable-llvm=$LLVMOBJDIR --enable-languages=c,c++$EXTRALANGS $TARGETOPTIONS
make -j4 $BUILDOPTIONS
make -j4 install

Not compiling. Downloaded latest svn LLVM and LLVM-gcc and following the LLVM build instructions in:

Andrew Canis 2009/10/14 01:49

Played around with cetus. It's really cool, it can automatically parallelize loops by detecting data dependencies and inserting openMP pragmas.

Andrew Canis 2009/10/12 22:31

Added VPR to repo. Done reading the first warp processor journal paper. I'm going to download the NetBench, MediaBench, EEMBC, and Powerstone benchmarks used in the paper. Also I'll write up a summary of my thoughts on the paper.

Andrew Canis 2009/09/30 10:58

Added milestones page. Reading over everything on Jason's HW Compile website. In particular looking into the warp processor.

Andrew Canis 2009/09/29 09:30

Fixed the jtag server:

$ sudo mount -t usbfs /dev/bus/usb/ /proc/bus/usb/
$ killall jtagd
$ sudo <quartus install path>/bin/jtagd
$ jtagconfig

To run jtagd I had to create a /bin/arch script containing:

#!/bin/bash
uname -m

DE2 is now working again. These issues were due to a ubuntu update.

Andrew Canis 2009/09/28 11:35

Got Quartus license working. The license file is: /opt/license.dat. Uncommented lines in .bashrc setting LM_LICENSE_FILE=/opt/license.dat Need to run lmgrd at startup as non-root:

  • Added a softlink: /etc/init.d/lmgrd → /opt/linux-qii71/lmgrd
  • Dug up a line from ~/linux/log to add this binary to my startup: sudo update-rc.d lmgrd defaults
  • Had to make one more softlink: /usr/local/flexlm/licenses/license.dat → /opt/license.dat

Andrew Canis 2009/09/28 10:48

Setup wiki.

Andrew Canis 2009/09/27 16:00

andrew_s_log.txt · Last modified: 2011/10/18 17:07 by acanis