These results are with the L1 instruction cache on, and branch prediction on. The L2 cache, MMU, and L1 data cache are all off.
MIPS SW | ARM – No Cache -O0 | ARM – No Cache -O3 | ARM – No Cache LLVM | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Benchmark | Cycles | Freq. | Time (us) | Cycles | Freq. | Time (us) | Cycles | Freq. | Time (us) | Cycles | Freq. | Time (us) | ||||
adpcm | 193607 | 74.26 | 2607 | 7017469 | 800 | 8772 | 1893766 | 800 | 2367 | 2046954 | 800 | 2559 | ||||
aes | 73777 | 74.26 | 993 | 1655516 | 800 | 2069 | 780486 | 800 | 976 | 411924 | 800 | 515 | ||||
blowfish | 954563 | 74.26 | 12854 | 23927868 | 800 | 29910 | 13667297 | 800 | 17084 | 13515078 | 800 | 16894 | ||||
dfadd | 16496 | 74.26 | 222 | 307817 | 800 | 385 | 91531 | 800 | 114 | 47603 | 800 | 60 | ||||
dfdiv | 71507 | 74.26 | 963 | 263314 | 800 | 329 | 91800 | 800 | 115 | 800 | ||||||
dfmul | 6796 | 74.26 | 92 | 127975 | 800 | 160 | 27054 | 800 | 34 | 20119 | 800 | 25 | ||||
dfsin | 2993369 | 74.26 | 40309 | 10375208 | 800 | 12969 | 2792847 | 800 | 3491 | 800 | ||||||
gsm | 39108 | 74.26 | 527 | 1068140 | 800 | 1335 | 254839 | 800 | 319 | 274252 | 800 | 343 | ||||
jpeg | 29802639 | 74.26 | 401328 | 116196808 | 800 | 145246 | 37729565 | 800 | 47162 | 36918091 | 800 | 46148 | ||||
mips | 43384 | 74.26 | 584 | 847272 | 800 | 1059 | 225037 | 800 | 281 | 216350 | 800 | 270 | ||||
motion | 36753 | 74.26 | 495 | 1229726 | 800 | 1537 | 90479 | 800 | 113 | 309910 | 800 | 9599 | ||||
sha | 1209523 | 74.26 | 16288 | 41578084 | 800 | 51973 | 6796182 | 800 | 8495 | 7678854 | 800 | 9599 | ||||
Geomean | 173331.98 | 74.26 | 2335.02 | 2715618.02 | 800 | 3394.52 | 712316.75 | 800 | 890.40 | 750768.45 | 800 | 1293.67 | ||||
Ratio | 1 | 1 | 1 | 15.67 | 10.77 | 1.45 | 4.11 | 10.77 | 0.38 | 4.33 | 10.77 | 0.55 |
The following results were obtained after enabling the MMU and L2 cache. So far the best results are with all caches and branch prediction enabled. Several optimizations have been tested, but few of them produce noticeable improvements.
MIPS SW | ARM | |||||||
---|---|---|---|---|---|---|---|---|
L1 I & D Cache, L2 Cache, MMU, B. Predict | ||||||||
Benchmark | Cycles | Freq. | Time (us) | Cycles | Freq. | Time (us) | ||
chstone/adpcm | 193607 | 74.26 | 2607 | 150968 | 800 | 189 | ||
chstone/aes | 73777 | 74.26 | 993 | 236462 | 800 | 296 | ||
chstone/blowfish | 954563 | 74.26 | 12854 | 1645745 | 800 | 2057 | ||
chstone/dfadd | 16496 | 74.26 | 222 | 34420 | 800 | 43 | ||
chstone/dfdiv | 71507 | 74.26 | 963 | 47052 | 800 | 59 | ||
chstone/dfmul | 6796 | 74.26 | 92 | 13193 | 800 | 16 | ||
chstone/dfsin | 2993369 | 74.26 | 40309 | 1418380 | 800 | 1773 | ||
chstone/gsm | 39108 | 74.26 | 527 | 144940 | 800 | 181 | ||
chstone/jpeg | 29802639 | 74.26 | 401328 | 10146576 | 800 | 12683 | ||
chstone/mips | 43384 | 74.26 | 584 | 55917 | 800 | 70 | ||
chstone/motion | 36753 | 74.26 | 495 | 5519 | 800 | 7 | ||
chstone/sha | 1209523 | 74.26 | 16288 | 2360086 | 800 | 2950 | ||
dhrystone | 28855 | 74.26 | 389 | 76182 | 800 | 95 | ||
mandelbrot | 45868987 | 74.26 | 617681 | 44271778 | 800 | 55340 | ||
Geomean | 227146.49 | 74.26 | 3060.05 | 259945.76 | 800 | 324.93 | ||
Ratio | 1 | 1 | 1 | 1.14 | 10.77 | 0.11 |
More detailed results can be found here: arm_vs_mips.pdf
The following things were learned when performing benchmarking:
Details on setting up the caches on the ARM Cortex-A9 MPCore can be found here: Using ARM Caches
Currently LegUp use LLVM to generate ARM assembly, then assembles and links with gcc. If you compile a benchmark with the command:
make sw ARM_PROFILE=1
you get instrumented ARM code that counts the cycles for each function, but stops the cycle counters when printf() is called.
make run_on_board
will run the code on the ARM on the Cyclone V, and show the profiling data. Profiling events 0x60 and 0x61 show the cycles spent waiting for the ICache and DCache, respectively.
benchmark | total cycles | i cache miss stall cycles | d cache miss stall cycles | i+d cache miss stall cycles | % cache miss cycles |
---|---|---|---|---|---|
chstone/adpcm | 77,578 | 3,563 | 2,723 | 6,286 | 8% |
chstone/aes | 32,625 | 1,552 | 2,808 | 4,360 | 13% |
chstone/blowfish | 697,692 | 1,041 | 11,320 | 12,361 | 2% |
chstone/dfadd | 8,011 | 1,327 | 1,333 | 2,660 | 33% |
chstone/dfdiv | 11,493 | 2,269 | 663 | 2,932 | 26% |
chstone/dfmul | 3,499 | 701 | 782 | 1,483 | 42% |
chstone/dfsin | 402,183 | 4,041 | 2,043 | 6,084 | 2% |
chstone/gsm | 23,175 | 2,433 | 1,541 | 3,974 | 17% |
chstone/jpeg | 2,535,123 | 3,251 | 21,014 | 24,265 | 1% |
chstone/mips | 24,808 | 909 | 535 | 1,444 | 6% |
chstone/motion | 10,056 | 119 | 1,017 | 1,136 | 11% |
chstone/sha | 483,699 | 103 | 13,545 | 13,648 | 3% |
dhrystone | 15,002 | 1,142 | 196 | 1,338 | 9% |
mandelbrot | 22,106,098 | 222 | 222 | 0% | |
geomean | 81,159 | 1,007 | 1,945 | 3,260 | 4% |
These results are generally 2x-10x better than the older results, shown above.
These results are different for a couple reasons:
LegUp wasn't producing NEON code for ARM by default. With NEON, we get slightly better performance on these benchmarks:
benchmark | total cycles | i cache miss stall cycles | d cache miss stall cycles | i+d cache miss stall cycles | % cache miss cycles |
---|---|---|---|---|---|
chstone/adpcm | 69,451 | 3,638 | 2,303 | 5,941 | 9% |
chstone/aes | 30,318 | 1,152 | 2,504 | 3,656 | 12% |
chstone/blowfish | 626,288 | 933 | 12,183 | 13,116 | 2% |
chstone/dfadd | 7,582 | 1,345 | 1,298 | 2,643 | 35% |
chstone/dfdiv | 12,346 | 2,303 | 901 | 3,204 | 26% |
chstone/dfmul | 3,341 | 977 | 527 | 1,504 | 45% |
chstone/dfsin | 404,446 | 3,976 | 2,061 | 6,037 | 1% |
chstone/gsm | 17,335 | 1,611 | 696 | 2,307 | 13% |
chstone/jpeg | 2,301,332 | 2,093 | 23,730 | 25,823 | 1% |
chstone/mips | 22,166 | 458 | 305 | 763 | 3% |
chstone/motion | 9,824 | 23 | 1,388 | 1,411 | 14% |
chstone/sha | 434,235 | 125 | 12,400 | 12,525 | 3% |
dhrystone | 9,939 | 437 | 274 | 711 | 7% |
mandelbrot | 22,907,027 | ||||
geomean | 73,840 | 832 | 1,802 | 3,515 | 7% |