High-Level Synthesis Survey
AutoESL/Xilinx
People
Jason Cong - UCLA
Juanjo Noguera - Xilinx research
Stephen Neuendorffer - Xilinx research
Kees Vissers - Xilinx research
Zhiru Zhang - AutoESL/Xilinx
Bin Liu - PhD UCLA, AutoESL
Sven van Haastregt - PhD at Leiden University, Xilinx
Jesus Barba - Phd grad at UCLM, Spain
Chris Dick - DSP chief architect, Xilinx
Papers
2011 - “J Noguera, S Neuendorffer, S Van Haastregt”. Implementation of sphere decoder for MIMO-OFDM on FPGAs using high-level synthesis tools. Integrated Circuits and …, 2011 - Springer
Implemented a DSP application using AutoPilot (version 2010.07.ft) and compared to a hand-written RTL version. Virtex-5, 225 Mhz. Sphere decoding is used for WiMAX mobile wireless networks.
Started with a MATLAB reference implementation (from Dick 2010)
Used Xilinx's system generator to create verilog
Converted MATLAB to C++ and used HLS
Uses a systolic array. Input data rate is 1 input sample/clock. Clock cycles per channel = 64.
Used C++ template classes for arbitrary precision integer types and template functions for parameterized blocks.
HLS constraints for target FPGA family and target clock frequency.
Pragmas for loop unrolling, and to specify which FPGA resource implements an array.
System-C adaptors are generated to reuse the C++ testbenches to test the final RTL.
Reference C++ code (derived from matlab) ~2000 lines. Fixed point.
Refactoring
macro-architecture: split code into functions representing h/w blocks that communicate with arrays. each function represents a pipeline state. arrays translated into ping-pong buffers to allow parallel execution.
parameterization: use c++ template parameters. initiation interval can have a big impact on resource sharing (they present an example of this at the end of section 6)
time division multiplexing: In pipelines with feedback loops registers cannot be inserted freely without introducing pipeline stalls - these recurrences (feedback loops) limit throughput. The inner loop had a 15 cycle recurrence, so c-slowing (or time division multiplexing) over 15 separate datasets accommodated the recurrence without any pipeline stalls. HLS reports the recurrences to the designer.
FPGA optimizations: bit-width optimization (18-bit fixed point using c++ template classes). efficient use of DSP48 blocks - create a template parameterized function of a multiplication followed by a subtraction (can be mapped into a single DSP48)
pragmas:
ARRAY_STREAM: array corresponds to stream for dataflow computation instead of BRAMs.
ARRAY_PARTITION:
PIPELINE II = MM_II: pipeline the loop with initiation interval set by template parameter MM_II
LATENCY max=2: use a maximum of two cycles to schedule this function
INTERFACE ap_none port=return register: use a register for the return value
Fixed point datatypes are used: ap_int<18>
All verification can be done in C - much faster than RTL.
Final QoR: AutoESL vs RTL: LUTs +4%, Register -26%, DSP48s -15%, 18K BRAM -28%.
Same Fmax and throughput.
CAIRN research group at IRISA, Rennes France
People
Steven Derrien: assistant professor at University of Rennes 1, France. Former student of P. Quinton
Sanjay Rajopadhye: now an associate Professor, Colorado State University. Previously at IRISA
Patrice Quinton: Prof at University of Rennes 1
Tanguy Risset: Professor at Insa-Lyon, France
Papers
High-Level Synthesis of Loops Using the Polyhedral Model. S Derrien, S Rajopadhye, P Quinton… - High-Level Synthesis, Springer 2008
Pico uses partial loop unrolling with software pipelining. Need a way to specify loop transformations (tiling, fusion, interchange) in the source code. Programming with ALPHA language based on the polyhedral model, using MMAlpha software. Mentions WraPit project. Loop nests are written as a recurrence equation.
Future work: scheduling with a multi-dimensional time function to support loop tiling.