| |
FPGAs Provide Acceleration for Software Algorithms The latest generation of FPGAs featuring embedded processors offer compelling platforms for hardware acceleration of computationally-intensive software algorithms. Design teams taking advantage of these platforms are finding FPGAs to be low-cost, low-risk platforms for application prototyping as well as for use in high-performance end-products. Although hardware-savvy engineers have been quick to embrace embedded processor based FPGAs (or Platform FPGAs), the lack of adequate design methods and general unfamiliarity with hardware design concepts has limited traditional software and embedded application developers from seriously considering these platforms. FPGA-based applications have until recently been the exclusive domain of the hardware designer. More recently, however, there have been advances in design tools supporting software-oriented design techniques for programmable hardware platforms. These tools have brought the power of FPGA-based application design directly to the software programmer. At the same time, system designers (of both hardware and software disciplines) have been given the means to evaluate hardware/software tradeoffs before committing a particular application component (or algorithm) to hardware and going through the lengthy processed needed to implement such a component using traditional hardware design methods. Evaluating the Benefits of a Mixed Hardware/Software Approach For embedded system designers considering a mixed software/hardware implementation, an important step is to evaluate the relative merits of various hardware and software targets for specific elements of the overall system. Often this evaluation is done based on the designer’s past experience, by using “back of the envelope” calculations regarding raw instruction cycle counts for key calculations or through an analysis of data throughput requirements. One specific step in such an evaluation may be to implement the same algorithm (which typically represents one component or software process within a larger hardware/software application) in both an embedded processor and in hardware (possibly within an FPGA) to analyze the relative merits of a software versus hardware solution. In doing so, the designer must evaluate not only the resulting performance of the core algorithm but the relative overhead of setting up and managing the necessary software/hardware interfaces. A critical factor in such an evaluation is the ability to compile algorithms (represented using a software programming language such as C) to both a traditional microprocessor and to hardware, creating a test environment that allows both software and hardware versions to be tested and measured using identical inputs. Such an environment supports software-based experimentation and allows a high degree of creativity on the part of the designer. This is because the software developer is free to quickly evaluate radically different ways of partitioning, describing and implementing a mixed hardware/software application, without the need to write low-level hardware descriptions for those portions destined for hardware or to make tedious hand calculations to determine relative performance numbers. In algorithms involving streams of data (which represent a large percentage of the required processing in such domains as image processing and communications) there is a critical tradeoff to be considered: the amount of computation required versus the data transfer overhead. More specifically, any evaluation of the merits of a hardware-based approach must consider (through direct measurement if possible) the cost of moving data between software components of the system (running on a traditional processor) and the dedicated hardware. Every application evaluated in such a way will yield different cost/benefit results. The greatest benefits to hardware-based approaches are found in algorithms that are not unduly constrained by I/O, are computationally intensive and include opportunities for parallelism to be exploited either at a low-level (by scheduling statements within loops, for example) or at the level of pipelined or parallel processes. The ability to compile software algorithms directly to the FPGA makes such cost/benefit analysis a more efficient, less risky process. Example: Software vs. Hardware Encryption To demonstrate how such an evaluation can be performed, consider the problem of data encryption, in which a stream of incoming data must be processed very quickly against a specified set of values (the key) to generate a resulting encrypted or decrypted data stream. Such a problem may involve substantial computation but is also bandwidth-intensive: the final implementation must not compromise data throughput in order to increase overall performance. In order to evaluate software versus hardware implementations of data encryption, we began with publicly available source code for the triple DES encryption algorithm written by Phil Karn (Qualcomm) and based on the algorithm described in Applied Cryptography (written by Bruce Schneier and published in 1995 by John Wiley & Sons). This original source code was written in standard C language and was not optimized for any specific processor target, nor was it written to take advantage of algorithm-level parallelism. We used CoDeveloper™ (available from Impulse Accelerated Technologies) to make the conversion from the original C code to a version suitable for hardware compilation in the selected FPGA target and to perform the required C to hardware compilation. Because our goal in this evaluation was to quickly evaluate the relative performance and tradeoff of hardware versus software implementations for one specific algorithm (the triple DES encryption function, represented by approximately 180 lines of C source code), we decided at the outset that we would make the minimum changes necessary to allow efficient hardware compilation, and would refrain from making non-obvious changes to the algorithm as a whole. The changes made to the encryption function in support of hardware compilation were as follows: 1. We modified the encryption function (using the Impulse C™ library functions provided with CoDeveloper) allowing it to operate on a stream of data rather than on a static global array. This change better reflects a real-world application, and reflects as well the preferred programming model for hardware/software interfaces. 2. We created an additional configuration data stream as an input that accepts the encryption key (the key schedule) as well as the “SP box” static data specified by the encryption algorithm. (In the legacy C version these values were also accessed via global arrays.) 3. We created top-level “producer” and “consumer” processes (also written in C, and again described using the Impulse C libraries) that serve to create a test bench for the algorithm, allowing us to stream random text characters into both the original, legacy C algorithm (which is compiled along with the test producer and consumer processes into native executable code on the embedded processor) and the hardware version, which is compiled directly to hardware and operates on the FPGA alongside the embedded processor. As we’ll describe in a moment, we also created a more comprehensive test application (developed using Microsoft® Visual Studio™) that exercises the encryption algorithm in a desktop simulation environment. This test application combines the two encryption functions (the legacy C version and the Impulse C version) along with corresponding decryption algorithms to verify the functional correctness of the application using various text inputs. This test was set up and run, and results verified before going to the next step and compiling to the target FPGA platform. The original (legacy) encryption function and the corresponding Impulse C process are summarized below (the C algorithm itself has been omitted for brevity):
Performing Software Simulation Before going through the process of choosing an FPGA-based platform target and compiling/synthesizing the encryption algorithm to that target, we first used a standard C development environment (Microsoft Visual Studio) to verify that the application, including both the legacy C code and the modified Impulse C version of the code, were correct in terms of the computations being performed. Because the Impulse C libraries are compatible with most popular C development environments, we were able to duplicate this test using two versions of Visual Studio (version 6 and .NET) as well as Metrowerks® CodeWarrior™ and the freely-available, GCC-based Dev-C++ tools. By using a standard IDE in conjunction with the Application Monitor provided
with CoDeveloper, we were able to make use of standard C debugging techniques
(including source-level debugging) while at the same time observing how
data moved between the various processes in the system (which now included
two versions of both the encryption and decryption processes, plus the
test consumer and producer processes). During this process it was noted
that performance of the encryption algorithm could be improved somewhat
by increasing the size of the data buffer feeding the encryption process.
After simulating its functionality using standard desktop tools, we were
ready to implement the application on a mixed FPGA/processor target. We
chose a Xilinx® MicroBlaze™-based FPGA target for this test,
selecting the Virtex-II MicroBlaze Development Kit (available from Memec
Design, a division of Insight Electronics) as our reference system. The
Memec kit includes a hardware reference board populated with a Virtex
II FPGA and various peripheral interfaces, as well as all development
tools needed to compile and synthesize hardware and software applications
(consisting of HDL source files for hardware and C source files for software)
to the FPGA target. When combined with Impulse CoDeveloper, this kit provided
us with everything needed to compile and execute the test application
from our C language source files. The design flow using CoDeveloper in
conjunction with the Xilinx-provided tools is illustrated below:
The following detailed steps were required to compile the encryption test application to the target:
A 36X performance increase over the software-only approach The results of the test (expressed as the computation times for a specified
number of data blocks) were generated using timers available on the MicroBlaze
processor and invoked from within the C language test application. The
relative performance of the MicroBlaze (software only) version of the
application and the combined MicroBlaze and FPGA application (the hardware
version) are shown below.
Software versus hardware performance (1000 blocks of text data) The results demonstrated that, for this algorithm running on the Virtex II a hardware implementation would result in significantly faster performance (a 36X speedup) than a software-only solution, even with the modest overhead of data communication between the processor and the FPGA-based encryption algorithm. This is due in part to the extremely low data communication overhead introduced by the Xilinx FSL bus, and due as well to CoDeveloper compiler’s ability to find and exploit low-level parallelism within the inner code loop of the algorithm. While 36X is substantial (and suggests that a hardware implementation for this algorithm may be appropriate) it is actually on the low end of what is possible when implementing software algorithms in programmable hardware. For this algorithm, further performance increases (as well as reductions in gate count requirements) could likely be obtained by optimizing the algorithm itself (for example by reordering statements to better enable pipelining or by invoking the three stages of the triple DES algorithm in parallel). But for this case such changes would likely not result in a large increase in speed due to the data bandwidth requirements: the amount of data being moved into and out of the processor is relatively large when compared to the amount of actual computations being performed. For other applications, such as image processing and similarly compute-intensive applications, the potential performance increases are far more dramatic and a decision to introduce dedicated or programmable hardware (such as an FPGA) is more obvious. This sample application has demonstrated how an algorithm that is a candidate for hardware acceleration can, with minimal work, be implemented on a mixed hardware/software platform for the purpose of performance evaluation. CoDeveloper’s C to hardware compilation capabilities, coupled with readily-available FPGA tools and reference hardware, simplify and speed the creation of mixed hardware/software solutions. David Pellerin, CTO, Impulse Accelerated Technologies, Inc. Milan Saini, Technical Marketing Manager, Xilinx, Inc. Comments on this article? Send them to comments@fpgajournal.com |
All
material on this site copyright © 2006 techfocus media, inc.
All rights reserved.
FPGA and Structured ASIC Journal Privacy Statement |