HOME :: JOB LISTINGS :: DEMOS :: ARCHIVES :: MEDIA KIT :: SUBSCRIBE


SPONSORED WHITE PAPER

Teja Provides a Faster Path to Packet Processing Performance

When designing a packet processing subsystem, engineers have as a primary goal to make sure that the design can process packets as quickly as they come in. At gigabit line rates, this is no small task. Designing systems that process packets is straightforward; design them to do it quickly is where the magic lies. The biggest bottleneck in the rush to market is the time required to ensure that the design achieves line rate performance.

Gigabit-class packet processing is implemented on a variety of multi-core hardware platforms. Some, like the Intel IXP network processor family, consist of specialized processors in a tailored architecture. Others, like the Broadcom SiByte family, are multiple symmetric RISC processors that can be used to accelerate the packet processing load. Programming this kind of system can be a significant amount of work, and historically can involve a lot of low-level programming and tuning. Teja Technologies has eased this effort significantly by providing access to applications and tools that raise the level of programming to C, and which automate many of the more tedious tasks associated with managing multi-core processing without compromising performance.

Teja also has provided the ability for designers to execute “functional simulation”: this refers to execution of code not on the target hardware (which might not even be ready in the early stages of code development), but on the host computer. Such high-level simulation executes much more quickly than the more detailed cycle-accurate simulators while still processing line network traffic, and provides a quick way to tackle most of the logic bugs in the algorithm. As obvious a capability as this might seem, it is not available from any other provider.

This combination of higher-level programming and functional simulation allow designers to achieve line rate more quickly on the platforms that Teja supports. By making it easier for engineers to try different approaches and compile them quickly, optimization moves more quickly, and a market-ready solution is in place sooner. However, even with this approach, there remains an inflexible element to the process: the architecture of the underlying hardware. Significant time can be spent making sure that the software conforms to the requirements of the architecture, whether that be for efficient use of registers or making sure the code fits in the fixed code store allocation or any other hardware element. If the hardware architecture itself could be adjusted to make the software fit more easily, weeks and months of optimization could be eliminated, further accelerating convergence to line rate.

A Flexible Architecture

The first thing that comes to mind when thinking about flexible hardware is an FPGA. But a couple of issues have kept FPGAs from participating significantly as core packet processors. For one, FPGAs are traditionally designed using a hardware methodology, typically RTL. In those cases where C is used, it is converted to RTL, so even though it starts out looking like software, it’s still a hardware approach. But packet processing is most commonly done in software for reasons of flexibility, history, and accessibility to a wide range of software engineers. There is a strong base of legacy code implementing packet processing algorithms, so moving all of that to a hardware model is an onerous proposition.

What we need is a way to process packets in software on an FPGA. Processor cores are available on FPGAs; two examples are the PowerPC and MicroBlaze cores in the Xilinx Virtex family. The PowerPCs are hard-instantiated in silicon on the Virtex 2 Pro and Virtex 4 FX families. The MicroBlaze is a soft 32-bit RISC processor core that is built out of the logic fabric in a Xilinx FPGA. The Xilinx Embedded Development Kit (EDK) provides a level of abstraction above the usual RTL view to allow designers to assemble processor-oriented systems. It deals with such elements as busses, memories, and peripherals, and takes care of the details associated with translating that down to the gate level.

Packet processing at the gigabit level, however, requires multiple processors. There are two challenges with this. The most significant is that designing an effective multi-core fabric from scratch is non-trivial. Issues of scheduling and resource sharing for high performance require a lot of attention, and any project using this approach would need to dedicate a significant amount of time simply to designing the fabric; the actual application work would be additional. The second challenge is the fact that assembling such a system manually in the EDK can be a lot of work simply due to project size and complexity. What is required is a pre-designed multi-core fabric and tools to simplify the creation of multi-core projects for the EDK.

If we can solve these two problems, then we get access to a methodology for creating a multi-core architecture that is flexible. That means that we can shape the hardware to the needs of the software just as we shape the needs of the software to the hardware. By working both ends towards the middle, we can converge on a high-throughput solution more quickly.

Building a Multi-Core Fabric

Teja has built a scalable multi-core fabric for use in packet processing applications. Because the MicroBlaze processor is a soft core, it can be instantiated as many or few times as needed to build a custom-sized pipeline. Teja has combined this processor with local memory to create a processing engine ( Figure 1) that can be replicated as needed.

Figure 1 . MicroBlaze-based processing engine

Packet processing algorithms are well suited to pipeline structures. Teja has built dedicated IP for hooking these engines together into a parallel pipeline of arbitrary size and configuration. The IP consists of

  • Access into and out of the pipeline
  • Communication blocks for passing tasks along the pipeline
  • A means of attaching hardware accelerators – offloads – to the engine

The complete pipeline is shown in Figure 2.

Figure 2 . Basic multi-core pipeline

The pipeline access blocks interface to the port hardware. In the first instance, it is attached to the Ethernet Media Access Controller; it could similarly be connected to an ATM or other port.

The communication blocks take on the responsibility for moving the task and any associated data from one stage to the next. This allows the processors to stay focused on the actual application code rather than worrying about internal bookkeeping. These blocks also provide load-balancing, to prevent one slow packet from becoming a bottleneck in the entire system. It also allows the kind of irregular pipeline illustrated in Figure 3.

Figure 3 . Irregular multi-core pipeline

Offloads are a key capability of this infrastructure. They allow compute-intensive or high-latency functions to be executed in hardware rather than software. A well-defined interface allows them to be attached to a processing engine; existing logic can be wrapped to match the interface requirements. Alternatively, Teja provides a utility that can create an offload that meets the interface requirements, along with a testbench for validating the offload.

The offloads can be synchronous or asynchronous ( Figure 4). In a synchronous offload, the processor hands off the task to the offload and awaits the result before processing further. For tasks that take a trivial number of clock cycles when accelerated, this is an effective approach. For longer tasks, however, it is better to create asynchronous offloads. After handing a task to an asynchronous offload, the processor will start working on a different packet while the offload is executing. This keeps the processor busy, minimizing stall time. Once the offload is finished, it puts its result back into the processor’s queue, and when the processor pulls it from the queue it will continue processing. Teja manages the state information required to ensure that processing proceeds from the correct point.

Figure 4 . Synchronous and Asynchronous Offloads

Building a Packet Processing Subsystem

A complete subsystem consists not only of the pipeline for processing typical packets, but also of a means for handling the control plane and any “exception” packets in the packet stream. In the Xilinx Virtex 4 FX family, the built-in PowerPC can be used for this. In other families, an external processor can be used. All of the elements of the required subsystem are shown in Figure 5.

Figure 5 . Complete packet processing subsystem

Configuring such a system in a typical design process is straightforward ( Figure 6). A definition of the hardware architecture (most of which is boilerplate; an API makes the pipeline definition easy), software architecture (most of which is boilerplate), the application code, and a mapping (variables to memory; functions to thread “mains”) are fed into Teja’s TejaCC tool. This tool creates a project for the EDK; the EDK can then process the project and arrive at an FPGA bitstream and any external control plane object code.

Figure 6 . Creating an EDK project

All of the key elements for creating this subsystem are provided by Teja, specifically:

  • The FPGA IP required to implement the key building blocks of the multi-core structure
  • APIs for architecture and system configuration, for profiling, for multi-core state management, for interfacing between the control and data planes, etc.
  • Tools to simplify configuration and create the EDK project

In addition, high-level application source code from Teja can jump-start a project by providing line-rate performance on key common code that does not have to be re-invented, but can be customized at the source level.

Together these elements provide the first configurable multi-core engine on an FPGA. It provides a means for network OEMs to achieve line rate more quickly through exploitation of both software and hardware flexibility. More information on this technology is available at www.teja.com/xilinx or by contacting bmoyer@teja.com.

March 30, 2006

[back to top]

Comments on this article? Send them to comments@fpgajournal.com

 
All material on this site copyright © 2006 techfocus media, inc. All rights reserved.
FPGA and Structured ASIC Journal
Privacy Statement