User Tools

Site Tools


warp_processor

Warp Processor

Journal Papers

  • F. Vahid, G. Stitt, R. Lysecky. Warp Processing: Dynamic Translation of Binaries to FPGA Circuits. IEEE Computer, Vol. 41, No. 7, pp. 40-46, July 2008. pdf

An impressive amount of work has been done:

  • Complete FPGA CAD flow: ROCCAD (100K LOC)
    • Targeted at an on-chip ARM7
  • Custom FPGA + MAC: W-FPGA (852K gates)
    • Synthesized with Synopsys Design Compiler
    • Gate level sims of power/performance using 0.18um UMC standard cell library
  • Decompiler that can convert a binary into control and data flow graphs
  • Synthesis tool can recover ifs/loops from the flow graphs and then synthesize into hardware
  • Hardware profiler that can determine the most frequent backwards branches of running code (2000 gates)
  • Binary update. Replace branch with h/w init and sleep main processor until W-FPGA is done
  • 15 benchmarks comparing speedup/energy of W-FPGA vs Virtex-E
  • Energy consumption comparison with 9 other processors (ARM, XScale, Pentium)

The fast CAD flow is accomplished by simplifying the FPGA architecture. Less flexibility of routing resources for the tool allows faster place and route. This could allow speedups in complicated architectures by restricting the tool to a subset of the architecture and expanding this subset if the circuit is highly congested.

The warp processor does not require a special compiler like traditional software/hardware partitioning, this is a big plus. They never cited any Java JIT papers, there might be some interesting work done in that area, especially targeting Java bytecode to FPGAs. The work was aimed at embedded applications. Speedups were relative to a low power ARM7. What about desktop processors? Maybe there are game physics engine applications? One improvement would be to move ROCCAD from the on-chip co-processor onto the main processor. Interesting observation that the clock frequency remains roughly the same between the FPGA and on-chip processor at each generation. Don't see how they achieved 1.5 higher clock frequency with 25% less power than Xilinx Virtex-E, are the densities the same? Power is understandable but I would have expected Virtex-E to allow a higher clock frequency than the simple W-FPGA architecture. An interesting observation was the Multiply-Accumulate (MAC) operation was frequently used in critical regions, so a dedicated MAC was implemented, how much does this MAC contribute to overall speedup?

Many papers cited for claim that hardware/software partitioning can provide speedups of 100X-1000X. These are for applications that involve extensive bit-level manipulation (bit reversal) or that are highly parallelizable (FIR filter). Paper cited that shows high-level languages are five times more productive than VHDL.

Look into:

  • Critical Blue - hardware/software partitioning tool
    • Company seems to be focused on multicore.
    • Cascade is an automated coprocessor synthesis solution. Mentions FPGA support.
  • Tensilica - custom coprocessors
    • XPRES compiler, automatically generate processors from standard C code.
  • Decompilation techniques
    • CIFUENTES, C., SIMON, D., AND FRABOULET, A. 1998. Assembly to high-level language translation. Department of Computer Science and Electrical Engineering, University of Queensland. Tech. Rep. 439. pdf
    • CIFUENTES, C., VAN EMMERIK, M., UNG, D., SIMON, D., AND WADDINGTON, T. 1999. Preliminary experiences with the use of the UQBT binary translation framework. In Proceedings of the Workshop on Binary Translation, 12–22. pdf
    • CIFUENTES, C. 1996. Structuring decompiled graphs. In Proceedings of the International Conference on Compiler Construction. Lecture Notes in Computer Science, vol. 1060, 91–105. pdf
  • Dynamic Profiling
    • YSECKY, R., COTTERELL, S., AND VAHID, F. 2004a. A fast on-chip profiler memory using a pipelined binary tree. IEEE Trans. Very Large Scale Integration (TVLSI) 12, 1, 120–122. pdf
  • Vertex Coloring:
    • BRELAZ, D. 1979. New methods to color the vertices of a graph. Commun. ACM 22, 251–256 pdf

Benchmarks

  • MEMIK, G., MANGIONE-SMITH, W., AND HU, W. 2001. NetBench: A benchmarking suite for network processors. In Proceedings of the International Conference on Computer-Aided Design (ICCAD), 39–42.
  • LEE, C., POTKONJAK, M., AND MANGIONE-SMITH, W. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the International Symposium on Microarchitecture (MICRO), 330–335.
  • EEMBC. 2005. The Embedded Microprocessor Benchmark Consortium. http://www.eembc.org.
  • PowerStone: MALIK, A., MOYER, B., AND CERMAK, D. 2000. A low power unified cache architecture providing power and performance flexibility. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 241–243.

  • R. Lysecky, F. Vahid. Design and Implementation of a MicroBlaze-based Warp Processor. ACM Transactions on Embedded Computing Systems (TECS), April, 2009, 22 pages. pdf

Other Papers

  • Lysecky, R. and Vahid, F. 2004. A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning. In Proceedings of the Conference on Design, Automation and Test in Europe - Volume 1 (February 16 - 20, 2004). Design, Automation, and Test in Europe. IEEE Computer Society, Washington, DC, 10480. pdf
  • G. Stiff , F. Vahid, New decompilation techniques for binary-level co-processor generation, Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design, p.547-554, November 06-10, 2005, San Jose, CA pdf
  • LYSECKY, R. AND VAHID, F. 2003. On-Chip logic minimization. In Proceedings of the Design Automation Conference (DAC), 334–337. pdf
  • Hardware Profiler
    • GORDON-ROSS, A. AND VAHID, F. 2003. Frequent loop detection using efficient non-intrusive on-chip hardware. In Proceedings of the Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 117–124. pdf

Patents

  • Vahid, F., Lysecky, R., and Stitt, G. Warp processor for dynamic hardware/software partitioning, US Patent 7,356,672, 2008. pdf
warp_processor.txt · Last modified: 2010/12/15 15:53 (external edit)