Warp Processor
Journal Papers
F. Vahid, G. Stitt, R. Lysecky. Warp Processing: Dynamic Translation of Binaries to FPGA Circuits. IEEE Computer, Vol. 41, No. 7, pp. 40-46, July 2008.
pdf
An impressive amount of work has been done:
Complete FPGA CAD flow: ROCCAD (100K LOC)
Custom FPGA + MAC: W-FPGA (852K gates)
Decompiler that can convert a binary into control and data flow graphs
Synthesis tool can recover ifs/loops from the flow graphs and then synthesize into hardware
Hardware profiler that can determine the most frequent backwards branches of running code (2000 gates)
Binary update. Replace branch with h/w init and sleep main processor until W-FPGA is done
15 benchmarks comparing speedup/energy of W-FPGA vs Virtex-E
Energy consumption comparison with 9 other processors (ARM, XScale, Pentium)
The fast CAD flow is accomplished by simplifying the FPGA architecture. Less flexibility of routing resources for the tool allows faster place and route. This could allow speedups in complicated architectures by restricting the tool to a subset of the architecture and expanding this subset if the circuit is highly congested.
The warp processor does not require a special compiler like traditional software/hardware partitioning, this is a big plus.
They never cited any Java JIT papers, there might be some interesting work done in that area, especially targeting Java bytecode to FPGAs.
The work was aimed at embedded applications. Speedups were relative to a low power ARM7. What about desktop processors? Maybe
there are game physics engine applications?
One improvement would be to move ROCCAD from the on-chip co-processor onto the main processor.
Interesting observation that the clock frequency remains roughly the same between the FPGA and on-chip processor at each generation.
Don't see how they achieved 1.5 higher clock frequency with 25% less power than Xilinx Virtex-E, are the densities the same?
Power is understandable but I would have expected Virtex-E to allow a higher clock frequency than the simple W-FPGA architecture.
An interesting observation was the Multiply-Accumulate (MAC) operation was frequently used in critical regions, so a dedicated
MAC was implemented, how much does this MAC contribute to overall speedup?
Many papers cited for claim that hardware/software partitioning can provide speedups of 100X-1000X. These are for applications that involve extensive bit-level manipulation (bit reversal) or that are highly parallelizable (FIR filter).
Paper cited that shows high-level languages are five times more productive than VHDL.
Look into:
Benchmarks
MEMIK, G., MANGIONE-SMITH, W., AND HU, W. 2001. NetBench: A benchmarking suite for network processors. In Proceedings of the International Conference on Computer-Aided Design (ICCAD), 39–42.
LEE, C., POTKONJAK, M., AND MANGIONE-SMITH, W. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the International Symposium on Microarchitecture (MICRO), 330–335.
-
PowerStone: MALIK, A., MOYER, B., AND CERMAK, D. 2000. A low power unified cache architecture providing power and performance flexibility. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 241–243.
R. Lysecky, F. Vahid. Design and Implementation of a MicroBlaze-based Warp Processor. ACM Transactions on Embedded Computing Systems (TECS), April, 2009, 22 pages.
pdf
Other Papers
Lysecky, R. and Vahid, F. 2004. A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning. In Proceedings of the Conference on Design, Automation and Test in Europe - Volume 1 (February 16 - 20, 2004). Design, Automation, and Test in Europe. IEEE Computer Society, Washington, DC, 10480.
pdf
G. Stiff , F. Vahid, New decompilation techniques for binary-level co-processor generation, Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design, p.547-554, November 06-10, 2005, San Jose, CA
pdf
LYSECKY, R. AND VAHID, F. 2003. On-Chip logic minimization. In Proceedings of the Design Automation Conference (DAC), 334–337.
pdf
Hardware Profiler
GORDON-ROSS, A. AND VAHID, F. 2003. Frequent loop detection using efficient non-intrusive on-chip hardware. In Proceedings of the Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 117–124.
pdf
Patents
Vahid, F., Lysecky, R., and Stitt, G. Warp processor for dynamic hardware/software partitioning, US Patent 7,356,672, 2008.
pdf