Original from http://lists.legup.org/pipermail/legup-dev/2010-March/000011.html:
adpcm 41934 aes 10571 blowfish 345519 dfadd 1898 dfdiv 1140 dfmul 680 dfsin 57372 gsm 10784 jpeg 8810967 mips 11374 motion 10652 sha 387458
My computer with llvm-gcc:
adpcm 41934 (same) aes 10571 (same) blowfish 345649 (.04% increase) dfadd 1898 (same) dfdiv 1140 (same) dfmul 680 (same) dfsin 57372 (same) gsm 10713 (.66% decrease due to llvm.memset align 2 being optimized) mips 11374 (same) motion 10655 (.03% increase) sha fail
My computer with clang:
adpcm fail aes 10571 (same) blowfish 345649 (.04% increase) dfadd 1846 (2.82% decrease) dfdiv 1139 (.09% decrease) dfmul 676 (.90% decrease) dfsin 57431 (.10 % increase) gsm fail mips 11407 (.29% increase) motion 10667 (.11% increase) sha fail
clang seems to do better with 64-bit double floating precision benchmarks, but otherwise just slightly slower than llvm-gcc
Ahmed's computer with ModelSim 6.6a
adpcm fail aes 9841 (7.42 % decrease) blowfish 344924 (.17% decrease) dfadd 1878 (5.91% decrease) dfdiv 1132 (.71% decrease) dfmul 680 (same) dfsin 56390 (1.74% decrease) gsm 10784 (same) mips 11374 (same) motion 10605 (.44% decrease) sha 375122 (3.29% decrease)
Definitely faster than ModelSim 6.6, but not by a great deal
Some big differences in performance (cycles on Andrew's machine):
LLVM 2.6svn 2.7 Diff adpcm 41934 43284 +1350 aes 10571 10571 0 blowfish 345519 345587 +68 dfadd 1898 1804 -94 dfdiv 1140 1490 +350 dfmul 680 703 +23 dfsin 57372 66144 +8772 gsm 10784 10761 -23 jpeg 8810967 8810969 +2 mips 11374 11374 0 motion 10652 10651 -1 sha 387458 387456 -2
This should be slower since I'm no longer inlining, but I have modified scheduling to chain fast operations “freely” with no latency. The only fast operations now are bitshift by a constant and casting (zext, sext). This is using LLVM 2.7:
chained non-inlined % diff (non-optimized / optimized - 1) adpcm 46566 48121 3.3 aes 11432 12181 6.6 blowfish 323527 348131 7.6 dfadd 2028 2186 7.8 dfdiv 1222 1361 11.4 dfmul 771 817 6.0 dfsin 57442 64145 11.7 gsm 15085 17852 18.3 jpeg n/a n/a mips 10233 11647 13.8 motion 10696 10733 3.5 sha 365166 377712 3.4 dhrystone 12818 12864 3.6
Adding chaining after loads (var = memory_controller_out[7:0] is very fast, so chain after that, but can't chain loads and stores until binding is complete):
chained chained (2) % diff (non-optimized / optimized - 1) adpcm 46566 43797 6.3 aes 11432 10758 6.3 blowfish 323527 293219 10.3 dfadd 2028 1906 6.4 dfdiv 1222 1161 5.3 dfmul 771 729 5.8 dfsin 57442 55369 3.7 gsm 15085 14719 2.5 jpeg n/a n/a mips 10233 9743 5.0 motion 10696 10680 0.1 sha 365166 348713 4.7 dhrystone 12818 11569 10.8
This won't be accurate without comparing the difference in fmax though, but it should be minimal.
dhrystone (does not have any shifts to chain):
non-chained: fmax = 167.00 MHz
chained: fmax = 177.49 MHz (critical path changes)
chained again: fmax = 172.47 MHz (2.9% slower than chained, same critical path)