| |
Closely Coupled Co-processors for Algorithmic Acceleration Introduction During the early stages of field programmable gate arrays, FPGAs were used primarily as glue logic for chip-to-chip communication or for other bridging type functionality. Current FPGA device solutions encompass complete embedded processors and processing subsystems. The demands on designers continue in the areas of processing performance, new feature content, and reduced cost. With continued requirements towards performance improvement, the inherent parallelism in FPGAs provides the unique capability to accelerate performance through closely-coupled co-processors to achieve significant acceleration through hardware. This is particularly important when managing compute-intensive applications. Systems that can benefit from this type of algorithmic acceleration include wireless communications and image processing platforms such as those seen in medical, video, and other graphic applications that require a large degree of signal processing. Mechanics of Acceleration There are several ways of achieving hardware acceleration in today’s FPGAs. One method that is gaining wide acceptance is hardware acceleration through custom co-processing engines tightly-coupled to processor pipelines and register files. The co-processing function is called by either a user-defined instruction (UDI) or by a pre-defined instruction that is part of the co-processing instruction set. Another method that is attracting interest is dynamic partial reconfigurability. This capability enables designers to satisfy both performance acceleration and design flexibility requirements by dynamically modifying the architecture or FPGA functionality during system operation. . This flexibility avoids the limitations inherent in a general purpose processor. While dynamic partial reconfigurability is a developing and important attribute in FPGAs, it is beyond the scope of this article and shall be discussed at a later time. This paper focuses on the implementation of co-processing instructions where dedicated hardware is used as a co-processing engine to achieve algorithmic acceleration. Developing a performance- and area-optimized co-processing engine can require highly focused engineering resources. ESL (Electronic Systems-Level) design automation tools with system analysis and code profiling / C-to HDL (RTL or EDIF) translation capability can reduce the amount of effort involved. These ESL tools accept high-level descriptions written in a language familiar to embedded software programmers. The standard programming language for embedded systems design is C, C++ with others starting to show up in a variety of applications. Figure 1 highlights the main steps in a generic conversion flow.
Figure 1. C-to-HDL Translation flow Many companies are vying for system designers’ time to evaluate their latest ESL offerings. Among the more recent ones supporting FPGAs with soft / hard integrated processors are Impulse Accelerated Technologies Co-Developer tools; Poseidon Technology’s Triton tools and Celoxica’s latest DK tool suite. Celoxica’s DK tools suite enables users to mix C/C++, used for algorithm specification and functional testbenches and Handel C source code, used for description of parallel algorithms, to provide a direct path to EDIF or RTL. Mentor Graphics’ Catapult C provides synthesis of untimed, system level C/C++ source code according to user feedback, and requires only slight modifications to the resulting synthesizeable RTL. A growing trend in the ESL space provides tight integration with current synthesis and place-and-route tools. One example is Impulse Accelerated Technologies Co-developer suite which enables a complete “system on programmable logic” to be developed and targeted to specific FPGAs and development boards. Figure 2 showcases an integrated flow using Co-Developer together with the Xilinx Platform Studio embedded tool suite. Impulse Co-Developer tools simplify the conversion of C subroutines to FPGA logic and create the necessary software-to-hardware communications. The Triton tool suite from Poseidon Technology tackles the hardware / software partitioning problem through a system analyzer to identify architecture bottlenecks in an existing system and suggest changes to improve performance / resource utilization. The analysis component is paired with conversion tools that translate targeted portions of software into hardware. ESL tools aim to enable system architects and embedded engineers with the capability to optimize their architecture and software algorithmic performance through detailed analysis of their system and other dedicated architecture analysis tools. Ideally, the end result is an efficient path from a high level language abstraction or “executable specification” to netlist direct to FPGA or synthesizeable RTL.
Figure 2. Impulse Co-Developer Tool operating in conjunction with Xilinx Platform Studio tool suite All of these ESL solutions require some aspect of fine tuning, as there is no fully automatic, push-button solution yet available. Designers should expect some degree of manual dialing in when using these ESL tools. To achieve optimal system level hardware / software partitioning, several key criteria and considerations must be taken into account. These criteria include performance, area-utilization of resources and associated timeframe required to reach desired level of optimization, along with minor re-structuring of the code to fit the desired hardware conversion. Leveraging these tools provides the first steps towards achieving the hardware acceleration. The following sections discuss the pros and cons for accelerating designs with co-processing engines and provide a real- world implementation demonstrating the resulting benefits. Why a hardware co-processor? Other than achieving a performance speed-up, additional motivations might be to reduce the workload of the processor or to implement a feature that is not present in software. For example, if a processor has no floating point support and does software emulation for floating point operations, a hardware floating point unit acting as a co-processor can offload the processor resulting in a faster, more efficient system. 1) Evaluation and profiling effort to decide partitioning Efforts should be focused on offloading only critical parts of the computation to hardware, otherwise the performance speed-up could be minimal and more engineering hours are required to re-evaluate the system. In a hardware resource-constrained environment, implementing co-processing functions in hardware will take away FPGA resources that could be required for other functions. Therefore, the designer has to spend time carefully allocating hardware real estate. For example, if adding hardware floating point co-processor requires a substantial amount of FPGA resources, it may prove difficult getting the design to pass synthesis and place-and-route. 3) Synchronization between hardware and software When portions of the overall computation is done in hardware and others in software, extra logic may be needed put to synchronize operations between the two domains, resulting in additional complexity and increased latency. How the solution works: The following steps summarize a general methodology for hardware acceleration on an existing system: 1) Profile the present application to find out where performance bottlenecks lie. For instance, if the application is written in C, freely-available profiling tools such as gprof can point out which functions take up most of the processing cycles. 2) Next, consider which bottlenecks will yield the most performance speed-up at the least cost.
3) Using design automation tools (such as the aforementioned ESL tools) or engineering hours to implement the hardware acceleration function(s).
4) Because decisions at each intermediate step can significantly affect the resultant implementation, designers should go through multiple iterations of the first three steps to fine-tune their hardware-accelerated designs. Implementation of Software to Hardware Comparison System An example of a real-world implementation is hardware acceleration of an Inverse Discrete Transform (IDCT) function in a Xilinx Virtex™-4 FPGA. Figure 3 is a block diagram of the primary modules implemented in the design. System-level data flow is described as follows: First: Image data is loaded into DDR memory. Next: DCT and IDCT operations are performed on the image data, and the output is stored in memory locations reserved for TFT display buffers. Last: The TFT display controller reads output image data from DDR memory and sends it through the VGA port for display on a monitor.
Figure 3. Block diagram of the example system described above for design validation. Processor role- After reading input pixel data from memory, the embedded PowerPC™ 405 (PPC405) processor uses either a software or hardware IDCT routine to process the pixels and then store the output into display memory. Software IDCT In the software IDCT implementation, IDCT operations are coded in C utilizing a Linux open source video player known as xine. Hardware IDCT In the hardware implementation, all IDCT operations are implemented using built-in XTremeDSP™ slices of the FPGA. The System Generator™ tools for DSP from Xilinx are used to generate the IDCT module in Verilog. Custom co-processing instructions, decoded by the integrated Auxiliary Processor Unit controller (APU) replace the C code in the software IDCT. The theoretical throughput utilizing the capability w/ APU quad-word transfer System Overview The system consists of a integrated PPC405 core and APU, hardware accelerator modules for the IDCT and memory and a display controller. The PPC405 processor offloads computations to hardware modules in the FPGA through the APU co-processor interface. Making use of custom instructions for hardware acceleration, the processor via the APU is able to send data to and receive data from the hardware acceleration modules as shown in Figure 4. In the hardware accelerated IDCT system, the data flow follows the following sequence:
Figure 4. IDCT Engine content and data flow From the data in Table 1, it is clear that offloading the processor by using a custom IDCT co-processing engine provides both hardware acceleration and reduces the code size. The final performance metric, frames per second displayed on a 480x480-pixel VGA monitor, takes into account data transfer latency as well as IDCT execution latency.
Table 1. Comparison of key metrics for video system example Maximizing Acceleration It is important to note that co-processing hardware accelerators tightly-coupled to processors may not be the best solution for every system. For an application whose execution time is mainly taken up by compute latency instead of data transfer latency, a peripheral connected to the processor via a system bus will likely yield similar performance improvements as a co-processor. Areas that will provide greatest acceleration include the: 1) most frequently executed functions; 2) most time-consuming functions; and 3) availability of more efficient hardware implementation, e.g., hardware floating point unit instead of software floating point emulation. Areas that are less advantageous for acceleration are less compute intensive algorithms and hardware resource intense implementations. Conclusion This article captures some of the key considerations for algorithmic acceleration through hardware co-processing on an FPGA system. ESL design automation tools to automate this process are actively being developed. However, due to tradeoffs between performance, resource utilization and cost, it is still difficult to produce a true push-button solution. Besides tackling the complexity of partitioning resources between software and hardware, ESL tool developers must also continue to work with FPGA vendors to achieve tight integration with synthesis and place-and-route flows. The general methodology for implementing co-processing hardware acceleration as illustrated with an image processing application demonstrates that by offloading the processor, depending on the nature of the algorithm or function offloaded, a significant increase in overall performance can be achieved.
by Harn Hua Ng, Senior Systems Design Engineer, Xilinx, Inc. and August 30, 2005
Comments on this article? Send them to comments@fpgajournal.com |
All
material on this site copyright © 2006 techfocus media, inc.
All rights reserved.
FPGA and Structured ASIC Journal Privacy Statement |