HOME :: JOB LISTINGS :: WEBCASTS :: ARCHIVES :: MEDIA KIT :: SUBSCRIBE :: FORUMS



Closely Coupled Co-processors for Algorithmic Acceleration
by Harn Hua Ng, Senior Systems Design Engineer, Xilinx, Inc. and
Dan Isaacs, Director of Embedded Processor Marketing, Xilinx, Inc.

Introduction

During the early stages of field programmable gate arrays, FPGAs were used primarily as glue logic for chip-to-chip communication or for other bridging type functionality. Current FPGA device solutions encompass complete embedded processors and processing subsystems.  The demands on designers continue in the areas of processing performance, new feature content, and reduced cost. With continued requirements towards performance improvement, the inherent parallelism in FPGAs provides the unique capability to accelerate performance through closely-coupled co-processors to achieve significant acceleration through hardware. This is particularly important when managing compute-intensive applications. Systems that can benefit from this type of algorithmic acceleration include wireless communications and image processing platforms such as those seen in medical, video, and other graphic applications that require a large degree of signal processing.

Mechanics of Acceleration

There are several ways of achieving hardware acceleration in today’s FPGAs. One method that is gaining wide acceptance is hardware acceleration through custom co-processing engines tightly-coupled to processor pipelines and register files. The co-processing function is called by either a user-defined instruction (UDI) or by a pre-defined instruction that is part of the co-processing instruction set. Another method that is attracting interest is dynamic partial reconfigurability. This capability enables designers to satisfy both performance acceleration and design flexibility requirements by dynamically modifying the architecture or FPGA functionality during system operation. . This flexibility avoids the limitations inherent in a general purpose processor. While dynamic partial reconfigurability is a developing and important attribute in FPGAs, it is beyond the scope of this article and shall be discussed at a later time. This paper focuses on the implementation of co-processing instructions where dedicated hardware is used as a co-processing engine to achieve algorithmic acceleration.

Developing a performance- and area-optimized co-processing engine can require highly focused engineering resources. ESL (Electronic Systems-Level) design automation tools with system analysis and code profiling / C-to HDL (RTL or EDIF) translation capability can reduce the amount of effort involved. These ESL tools accept high-level descriptions written in a language familiar to embedded software programmers. The standard programming language for embedded systems design is C, C++ with others starting to show up in a variety of applications. Figure 1 highlights the main steps in a generic conversion flow.

Figure 1. C-to-HDL Translation flow

Many companies are vying for system designers’ time to evaluate their latest ESL offerings. Among the more recent ones supporting FPGAs with soft / hard integrated processors are Impulse Accelerated Technologies Co-Developer tools; Poseidon Technology’s Triton tools and Celoxica’s latest DK tool suite. Celoxica’s DK tools suite enables users to mix C/C++, used for algorithm specification and functional testbenches and Handel C source code, used for description of parallel algorithms, to provide a direct path to EDIF or RTL. Mentor Graphics’ Catapult C provides synthesis of untimed, system level C/C++ source code according to user feedback, and requires only slight modifications to the resulting synthesizeable RTL.

A growing trend in the ESL space provides tight integration with current synthesis and place-and-route tools. One example is Impulse Accelerated Technologies Co-developer suite which enables a complete “system on programmable logic” to be developed and targeted to specific FPGAs and development boards. Figure 2 showcases an integrated flow using Co-Developer together with the Xilinx Platform Studio embedded tool suite. Impulse Co-Developer tools simplify the conversion of C subroutines to FPGA logic and create the necessary software-to-hardware communications. The Triton tool suite from Poseidon Technology tackles the hardware / software partitioning problem through a system analyzer to identify architecture bottlenecks in an existing system and suggest changes to improve performance / resource utilization. The analysis component is paired with conversion tools that translate targeted portions of software into hardware.

ESL tools aim to enable system architects and embedded engineers with the capability to optimize their architecture and software algorithmic performance through detailed analysis of their system and other dedicated architecture analysis tools. Ideally, the end result is an efficient path from a high level language abstraction or “executable specification” to netlist direct to FPGA or synthesizeable RTL.

Figure 2. Impulse Co-Developer Tool operating in conjunction with Xilinx Platform Studio tool suite

All of these ESL solutions require some aspect of fine tuning, as there is no fully automatic, push-button solution yet available. Designers should expect some degree of manual dialing in when using these ESL tools. To achieve optimal system level hardware / software partitioning, several key criteria and considerations must be taken into account. These criteria include performance, area-utilization of resources and associated timeframe required to reach desired level of optimization, along with minor re-structuring of the code to fit the desired hardware conversion.

Leveraging these tools provides the first steps towards achieving the hardware acceleration. The following sections discuss the pros and cons for accelerating designs with co-processing engines and provide a real- world implementation demonstrating the resulting benefits.

Why a hardware co-processor?

Other than achieving a performance speed-up, additional motivations might be to reduce the workload of the processor or to implement a feature that is not present in software. For example, if a processor has no floating point support and does software emulation for floating point operations, a hardware floating point unit acting as a co-processor can offload the processor resulting in a faster, more efficient system.

Considerations

1) Evaluation and profiling effort to decide partitioning

Efforts should be focused on offloading only critical parts of the computation to hardware, otherwise the performance speed-up could be minimal and more engineering hours are required to re-evaluate the system.

2) Consumption of hardware resources

In a hardware resource-constrained environment, implementing co-processing functions in hardware will take away FPGA resources that could be required for other functions. Therefore, the designer has to spend time carefully allocating hardware real estate. For example, if adding hardware floating point co-processor requires a substantial amount of FPGA resources, it may prove difficult getting the design to pass synthesis and place-and-route.

3) Synchronization between hardware and software

When portions of the overall computation is done in hardware and others in software, extra logic may be needed put to synchronize operations between the two domains, resulting in additional complexity and increased latency.

How the solution works:

The following steps summarize a general methodology for hardware acceleration on an existing system:

1) Profile the present application to find out where performance bottlenecks lie. For instance, if the application is written in C, freely-available profiling tools such as gprof can point out which functions take up most of the processing cycles.

2) Next, consider which bottlenecks will yield the most performance speed-up at the least cost.

a) As Amdahl's Law implies, the overall speed-up of an application due to an optimization depends on the percentage of execution time that the optimized portion consumes. Naturally, a designer wants to improve the portion of the system that yields the most benefits — be it a reduction in execution time or resource usage.

b) Other factors to consider are the amount of hardware resources needed, the complexity involved and the time required to implement the hardware acceleration, as well as the ease of integration with the rest of the system. The ideal algorithms for acceleration are those that are compute intensive but do not require continuously updated inputs from data sources and limited memory accesses in order to complete the computations.

3) Using design automation tools (such as the aforementioned ESL tools) or engineering hours to implement the hardware acceleration function(s).

c) In addition, the data flow requirements need to be assessed along with the impact of any changes that this may have on other system components such as memory, bus accesses and loading.

4) Because decisions at each intermediate step can significantly affect the resultant implementation, designers should go through multiple iterations of the first three steps to fine-tune their hardware-accelerated designs.

Implementation of Software to Hardware Comparison System

An example of a real-world implementation is hardware acceleration of an Inverse Discrete Transform (IDCT) function in a Xilinx Virtex™-4 FPGA.

As mentioned previously, portions of software applications can run faster by moving the implementation into hardware. IDCT is one of the most compute-intensive functions in image encoding and decoding. Therefore, a fast and optimized DCT/IDCT implementation is essential in improving the performance of both the video encoder and decoder. This example compares the execution time of software IDCT versus that of IDCT performed in hardware.

Figure 3 is a block diagram of the primary modules implemented in the design. System-level data flow is described as follows:

First: Image data is loaded into DDR memory.

Next: DCT and IDCT operations are performed on the image data, and the output is stored in memory locations reserved for TFT display buffers.

Last: The TFT display controller reads output image data from DDR memory and sends it through the VGA port for display on a monitor.

Figure 3. Block diagram of the example system described above for design validation.

Processor role- After reading input pixel data from memory, the embedded PowerPC™ 405 (PPC405) processor uses either a software or hardware IDCT routine to process the pixels and then store the output into display memory.

Software IDCT

In the software IDCT implementation, IDCT operations are coded in C utilizing a Linux open source video player known as xine.

Hardware IDCT

In the hardware implementation, all IDCT operations are implemented using built-in XTremeDSP™ slices of the FPGA. The System Generator™ tools for DSP from Xilinx are used to generate the IDCT module in Verilog. Custom co-processing instructions, decoded by the integrated Auxiliary Processor Unit controller (APU) replace the C code in the software IDCT.

The theoretical throughput utilizing the capability w/ APU quad-word transfer
     - Theoretically, assuming 6 cycles per quad-word transaction,  for a 100MHz clock, we can get
       (16 bytes)*(100*1000000/6)/(1*1024*1024 MB) = 254MB/sec
     - This is not achievable because of memory transfer and the IDCT computation latency.

System Overview

The system consists of a integrated PPC405 core and APU, hardware accelerator modules for the IDCT and memory and a display controller. The PPC405 processor offloads computations to hardware modules in the FPGA through the APU co-processor interface. Making use of custom instructions for hardware acceleration, the processor via the APU is able to send data to and receive data from the hardware acceleration modules as shown in Figure 4.

In the hardware accelerated IDCT system, the data flow follows the following sequence:

1. An IDCT operation begins with the processor forwarding a load instruction for IDCT input data to the APU.

2. The APU passes the instruction to a hardware accelerator block in the FPGA that waits for data from memory.

3. When the data arrives, the hardware accelerator block sends it to the IDCT module.

4. Meanwhile, the processor forwards a store instruction to the APU in anticipation of the IDCT output.

5. Eventually, the IDCT module returns IDCT results to the processor via the APU. This data is then written back to memory.

Figure 4. IDCT Engine content and data flow

From the data in Table 1, it is clear that offloading the processor by using a custom IDCT co-processing engine provides both hardware acceleration and reduces the code size. The final performance metric, frames per second displayed on a 480x480-pixel VGA monitor, takes into account data transfer latency as well as IDCT execution latency.

Metric

Without Acceleration

With Hardware Acceleration

 

 

 

Frames / Second

4

28

Lines of Code

160

8

Acceleration

base

7X

Operating Frequency

CPU

300MHz

CPU 300MHz, APU I/F 100 MHz,

Co-processing Engine100MHz

Table 1. Comparison of key metrics for video system example

Maximizing Acceleration

It is important to note that co-processing hardware accelerators tightly-coupled to processors may not be the best solution for every system. For an application whose execution time is mainly taken up by compute latency instead of data transfer latency, a peripheral connected to the processor via a system bus will likely yield similar performance improvements as a co-processor.

Areas that will provide greatest acceleration include the: 1) most frequently executed functions; 2) most time-consuming functions; and 3) availability of more efficient hardware implementation, e.g., hardware floating point unit instead of software floating point emulation. Areas that are less advantageous for acceleration are less compute intensive algorithms and hardware resource intense implementations.

Conclusion

 This article captures some of the key considerations for algorithmic acceleration through hardware co-processing on an FPGA system. ESL design automation tools to automate this process are actively being developed. However, due to tradeoffs between performance, resource utilization and cost, it is still difficult to produce a true push-button solution. Besides tackling the complexity of partitioning resources between software and hardware, ESL tool developers must also continue to work with FPGA vendors to achieve tight integration with synthesis and place-and-route flows.

The general methodology for implementing co-processing hardware acceleration as illustrated with an image processing application demonstrates that by offloading the processor, depending on the nature of the algorithm or function offloaded, a significant increase in overall performance can be achieved.

Click here for printable PDF
(By clicking on this link you agree to FPGA Journal's Terms of Use for PDF files. PDF files are supplied for the private use of our readers. Republication, linking, and any other distribution of this PDF file without written permission from Techfocus Media, Inc. is strictly prohibited.)

by Harn Hua Ng, Senior Systems Design Engineer, Xilinx, Inc. and
Dan Isaacs, Director of Embedded Processor Marketing, Xilinx, Inc.

August 30, 2005

 

About the Authors

Harn Hua Ng is a Senior Systems Design Engineer for the Advanced Products Division at Xilinx. His responsibilities at Xilinx include PowerPC embedded hardware and software. Harn Hua develops evaluation platforms and reference designs, and explores new ways of deploying FPGA-based systems.

Dan Isaacs is Director of Embedded Processor Marketing for the Advanced Products Division at Xilinx. His responsibilities at Xilinx involve PowerPC embedded infrastructure. Dan has over 20 years working in all aspects of engineering, including hardware and software design as well as systems engineering. Recent engineering experience includes Ford Motor, NEC Electronics and LSI Logic.

[back to top]

Comments on this article? Send them to comments@fpgajournal.com

All material on this site copyright © 2006 techfocus media, inc. All rights reserved.
FPGA and Structured ASIC Journal
Privacy Statement