User Tools

Site Tools


arm_chstone_benchmark_results

ARM Benchmark Results

Initial Results

These results are with the L1 instruction cache on, and branch prediction on. The L2 cache, MMU, and L1 data cache are all off.

MIPS SW ARM – No Cache -O0 ARM – No Cache -O3 ARM – No Cache LLVM
Benchmark CyclesFreq.Time (us) CyclesFreq.Time (us) CyclesFreq.Time (us) CyclesFreq.Time (us)
adpcm 19360774.262607 70174698008772 18937668002367 20469548002559
aes 7377774.26993 16555168002069 780486800976 411924800515
blowfish 95456374.2612854 2392786880029910 1366729780017084 1351507880016894
dfadd 1649674.26222 307817800385 91531800114 4760380060
dfdiv 7150774.26963 263314800329 91800800115 800
dfmul 679674.2692 127975800160 2705480034 2011980025
dfsin 299336974.2640309 1037520880012969 27928478003491 800
gsm 3910874.26527 10681408001335 254839800319 274252800343
jpeg 2980263974.26401328 116196808800145246 3772956580047162 3691809180046148
mips 4338474.26584 8472728001059 225037800281 216350800270
motion 3675374.26495 12297268001537 90479800113 3099108009599
sha 120952374.2616288 4157808480051973 67961828008495 76788548009599
Geomean 173331.9874.262335.02 2715618.028003394.52 712316.75800890.40 750768.458001293.67
Ratio 111 15.6710.771.45 4.1110.770.38 4.3310.770.55

Results with MMU and L2 Cache Enabled

The following results were obtained after enabling the MMU and L2 cache. So far the best results are with all caches and branch prediction enabled. Several optimizations have been tested, but few of them produce noticeable improvements.

MIPS SW ARM
L1 I & D Cache, L2 Cache, MMU, B. Predict
Benchmark CyclesFreq.Time (us) CyclesFreq.Time (us)
chstone/adpcm 19360774.26 2607 150968 800 189
chstone/aes 7377774.26 993 236462 800 296
chstone/blowfish 95456374.26 12854 1645745 800 2057
chstone/dfadd 1649674.26 222 34420 800 43
chstone/dfdiv 7150774.26 963 47052 800 59
chstone/dfmul 679674.26 92 13193 800 16
chstone/dfsin 299336974.26 40309 1418380 800 1773
chstone/gsm 3910874.26 527 144940 800 181
chstone/jpeg 2980263974.26 401328 10146576 800 12683
chstone/mips 4338474.26 584 55917 800 70
chstone/motion 3675374.26 495 5519 800 7
chstone/sha 120952374.26 16288 2360086 800 2950
dhrystone 2885574.26 389 76182 800 95
mandelbrot 4586898774.26 617681 44271778 800 55340
Geomean 227146.4974.26 3060.05 259945.76 800 324.93
Ratio 1 1 1 1.14 10.77 0.11

More detailed results can be found here: arm_vs_mips.pdf

Summary of Benchmark Results

The following things were learned when performing benchmarking:

  • Branch prediction is very important
  • L1 instruction cache is very important (especially for compute-limited benchmarks)
  • L1 data cache provides modest improvements
  • L1 data prefetch provides modest improvements
  • L2 cache is very important (especially for memory bandwidth-limited benchmarks)
  • MMU is very important because it allows caches to be used to their full potential
  • Normal memory should be marked as cacheable, inner and outer write-back in translation table entries
  • Memory should be marked as non-shareable in translation table entries
  • L2 cache controller read, write, and hold delays should be set to their minimums

Details on setting up the caches on the ARM Cortex-A9 MPCore can be found here: Using ARM Caches

Updated Results - August 5, 2016

Currently LegUp use LLVM to generate ARM assembly, then assembles and links with gcc. If you compile a benchmark with the command:

 make sw ARM_PROFILE=1 

you get instrumented ARM code that counts the cycles for each function, but stops the cycle counters when printf() is called.

 make run_on_board 

will run the code on the ARM on the Cyclone V, and show the profiling data. Profiling events 0x60 and 0x61 show the cycles spent waiting for the ICache and DCache, respectively.

benchmark total cycles i cache miss stall cycles d cache miss stall cycles i+d cache miss stall cycles % cache miss cycles
chstone/adpcm 77,578 3,563 2,723 6,286 8%
chstone/aes 32,625 1,552 2,808 4,360 13%
chstone/blowfish 697,692 1,041 11,320 12,361 2%
chstone/dfadd 8,011 1,327 1,333 2,660 33%
chstone/dfdiv 11,493 2,269 663 2,932 26%
chstone/dfmul 3,499 701 782 1,483 42%
chstone/dfsin 402,183 4,041 2,043 6,084 2%
chstone/gsm 23,175 2,433 1,541 3,974 17%
chstone/jpeg 2,535,123 3,251 21,014 24,265 1%
chstone/mips 24,808 909 535 1,444 6%
chstone/motion 10,056 119 1,017 1,136 11%
chstone/sha 483,699 103 13,545 13,648 3%
dhrystone 15,002 1,142 196 1,338 9%
mandelbrot 22,106,098 222 222 0%
geomean 81,159 1,007 1,945 3,260 4%

These results are generally 2x-10x better than the older results, shown above.

These results are different for a couple reasons:

  • we now run slightly different llvm passes on the IR before it is turned into SW, allowing for different optimizations
  • these results do not include printf cycles
  • the ARM startup code has been modified a bit, so cache misses now incur fewer idle cycles
  • it is also possible that we ran the previous tests with NO_OPT, NO_INLINE, or similar flags.
  • allowing clang to produce vectorized code (removing -fno-vectorize -fno-slp-vectorize from CLANG_FLAG) and/or using NEON can also have an impact - some benchmarks get slower, and some faster

Updated Results (with NEON) - August 11, 2016

LegUp wasn't producing NEON code for ARM by default. With NEON, we get slightly better performance on these benchmarks:

benchmark total cycles i cache miss stall cycles d cache miss stall cycles i+d cache miss stall cycles % cache miss cycles
chstone/adpcm 69,451 3,638 2,303 5,941 9%
chstone/aes 30,318 1,152 2,504 3,656 12%
chstone/blowfish 626,288 933 12,183 13,116 2%
chstone/dfadd 7,582 1,345 1,298 2,643 35%
chstone/dfdiv 12,346 2,303 901 3,204 26%
chstone/dfmul 3,341 977 527 1,504 45%
chstone/dfsin 404,446 3,976 2,061 6,037 1%
chstone/gsm 17,335 1,611 696 2,307 13%
chstone/jpeg 2,301,332 2,093 23,730 25,823 1%
chstone/mips 22,166 458 305 763 3%
chstone/motion 9,824 23 1,388 1,411 14%
chstone/sha 434,235 125 12,400 12,525 3%
dhrystone 9,939 437 274 711 7%
mandelbrot 22,907,027
geomean 73,840 832 1,802 3,515 7%
arm_chstone_benchmark_results.txt · Last modified: 2016/08/11 16:48 by bain