| |
Introduction Orthogonal Frequency Division Multiplexing (OFDM) transceivers are widely used in wireless applications including ETSI DVB-T/H digital terrestrial television transmission and IEEE network standards such as 802.11 (“WiFi”), 802.16 (“WiMAX”), 802.20 (proposed PHY). Such transceivers have large arithmetic processing requirements which can become prohibitive if implemented in software on a DSP processor. However, the highly pipelined nature of much of the processing lends itself well to a hardware implementation. A flexible solution such as an FPGA implementation has the added advantage of allowing late modifications in response to “real world” performance evaluation, or for requirement changes if the initial design is based on a draft specification. This paper describes an FPGA implementation of an OFDM transceiver design based on the core physical layer (PHY) requirements of the “WiMAX” 802.16-2004 OFDM specification. The use of IP cores for already well defined DSP functions can help to reduce the development time. Incorporation of multiple such DSP IP cores into both the design and the design flow is outlined. Various architectural design considerations are discussed, together with how features of the structure of the FPGA can be taken advantage of. Details are given of the final implemented design, the resulting performance and the total FPGA resources used. Background: OFDM Overview Orthogonal Frequency Division Multiplexing (OFDM) is a form of multi-carrier transmission. Multi-carrier methods have been in use since 1957 and OFDM was patented at Bell Labs in 1966. The first multi-carrier radios used a bank of filters to separate the signals but Fast Fourier Transforms (FFTs) have been in use for this purpose since 1971. With the advent of low cost FPGAs with DSP capabilities like the Lattice ECP family, it has become possible to implement the physical layer (PHY) of a complete flexible transceiver design in a single programmable device. Consider a set of “sub-carriers” at equally spaced frequencies: Fspc*n, where Fspc is sub-carrier spacing, n = 0..(Nused-1)) and Nused is total sub-carriers used. If Nused is selected to be just less than or equal to a power of 2, such as 256, the sub-carriers can be efficiently generated in a transmitter using an inverse FFT (IFFT). The receiver then uses aNFFT to separate out the sub-carriers. The sub-carrier spacing Fspc would then be defined as: Fspc = Fs / NFFT where Fs is the baseband sampling rate at the FFT and NFFT is the number of points in the FFT (NFFT >= Nused). In OFDM, the phase and amplitude of each sub-carrier is held at a constant value for a whole “symbol” period (NFFT * Fs) which is equal to the time required to fill the FFT with samples. This is illustrated in figure 1 which shows (the real or imaginary component) of four sub-carriers with frequency of Fspc, Fspc*2, Fspc*3 and Fspc*4 and their sum (bottom curve), over the duration of one symbol period. Figure 1: Four orthogonal carriers combined into an OFDM signal (time domain plots) The term “orthogonal” in OFDM refers to the fact that because these sub-carriers have a constant modulation value for the whole FFT period and are spaced at the FFT’s natural frequency “bin” spacing, there will be no inter-carrier interference (ICI) – in other words, the FFT can perfectly separate the sub-carriers from the sum signal. The IEEE 802.16-2004 OFDM standard defines a PHY that uses 256 sub-carriers which are each modulated with Binary Phase Shift Keying (BPSK), Quadrature Phase Shift Keying (QPSK), or Quadrature Amplitude Modulation with 16 and 64 point constellations (QAM16, QAM64). This means 1, 2, 4 or 6 bits of information can be transmitted over 1 symbol period on each sub-carrier. A number of sub-carriers (8) are allocated for pilot signals and a number of the highest (27) and lowest frequency (28) sub-carriers are null. The pilots are used to aid channel estimation as they are known, a priori, at the receiver. The nulls implement frequency guard bands. The sub-carrier with frequency zero is also null, so there is no DC component in the signal. One of the key channel distortions experienced in non line of sight (NLOS) transmissions is multi-path distortion where the receiver will see several delayed and phase-rotated versions of the transmitted signal. To combat this in OFDM systems, a cyclic prefix (CP) is added before the start of each transmitted symbol to act as a guard period preventing inter-symbol interference (ISI), provided that the delay spread in the channel is less than the guard period. This guard period is specified in terms of the fraction of the number of samples that make up a symbol so 1/8 cyclic prefix means 256/8 = 32 samples. The cyclic prefix contains a copy of the end of the forthcoming symbol – as illustrated in figure 2, which shows the symbol summation from figure 1 with the added cyclic prefix. Figure 2: Adding the Cyclic Prefix to the Symbol Because the cyclic prefix is a repeat of the end of the symbol, even if the start of the FFT symbol period is slightly advanced into the cyclic prefix due to timing uncertainty, then orthogonality will be retained – no inter-carrier interference will be introduced. The cyclic prefix can also be useful for both detecting the start of a symbol and estimating the frequency offset between the transmitter and the receiver. OFDM transmission can either be a continuous stream of symbols (as in the DVB-T system), or sent as a burst of a fixed length of symbols. The transmission in IEEE 802.16 is done in bursts, with each burst consisting of a preamble of one or two symbols followed by a number of data symbols. The preamble has well defined contents and is used for burst detection / timing estimation, channel estimation and carrier frequency offset because it consists of a repeated pattern in the time domain as every other carrier is a null. System Overview Figure 3: OFDM Transceiver System
Figure 3 illustrates the scope of this OFDM base station PHY transceiver FPGA implementation within the scope of the larger system, where Fs = OFDM sample rate at the FFT, which is up to 11.424 MHz for the maximum supported nominal channel bandwidth of 10 MHz. Various configurations of transmitter and receiver radio sub-system are possible, but the discussion of these is beyond the scope of this paper. For this implementation, a single ADC (real-only) low-IF receiver input was chosen to avoid quadrature gain and phase mismatches associated with two (quadrature) ADC inputs. For the transmitter, quadrature DAC outputs to drive a direct conversion (or “zero-IF”) radio were implemented, although a low-IF single DAC output could be implemented with the inclusion of an additional mixer and fixed rate interpolating filter within the FPGA design. A common control and data interface to the MAC layer processing was implemented using a "Wishbone" bus (an open source SoC bus standard) to demonstrates how all the transmit, receive and control data could be handled using a single interface. For a practical PHY transceiver implementation, an interface should be chosen that is most appropriate to embed the transceiver in the desired system, for example to enable communication via a backplane interface, or to a supporting DSP processor or microprocessor. Development of a Design The Lattice ispLEVER (Windows) software suite was used to perform all the required tasks from HDL simulation through synthesis, mapping, place and route to final FPGA PROM programming file generation. Of the selection of third party vendor tools included in the package, Mentor Graphics Modelsim was used for Verilog HDL simulation and Synplicity Synplify was chosen for synthesis. Development of commonly occurring functions was avoided by using a number of IP cores in the HDL development, including Reed-Solomon encoder/decoder, Viterbi decoder, FFT and FIR filters. These IP cores were incorporated into the HDL by simple module instantiation. For simulation, pre-compiled verilog libraries for these cores were just linked in. For mapping, place and route operations, pre-synthesized “NGO” files were supplied for each of the IP cores. Although the entire build flow from synthesis through to FPGA PROM programming file generation could have been run from the ispLEVER GUI, TCL script-driven flow was chosen instead to simplify repeated operation and allow for fine tuning. The required TCL commands for the scripts were generated automatically by switching on the “TCL recorder” within the tool before using the ispLEVER GUI to drive the first run - thus logging to file all the commands that the GUI was running. Additionally, the Synplify synthesis tool output constraint files for the mapping, place and route stages. Exactly the same scripts were used to build the design on both the Windows and UNIX (Solaris) platform versions of the ispLEVER tool suite. Developing and verifying a complex design requires a realistic channel model and a flexible design reference model allowing rapid experimentation with different design approaches and algorithms, but also allowing a route for mapping to HDL and subsequent verification. Matlab (including use of fixed-point quantization) was chosen for this purpose. Where IP cores were to be used for the HDL implementation, these were simulated in Matlab using equivalently-configured functions from the Matlab Communications and Filter Design toolboxes. Fixed point quantization was used in Matlab to decide on appropriate bit-width precisions and scaling for the HDL implementation. By selectively turning quantization on and off in different blocks, associated performance losses could be investigated. The resulting quantized reference model gave results close to that achieved in the HDL implementation. The Matlab model was used to generate extensive test vectors for HDL block-level testing, HDL functional verification simulations and overall performance simulations (run over many more symbols than could be practically simulated in an HDL simulator). Design Overview Design Project Features:
Physical layer base station transceiver specification implemented:
The structure of the transceiver is illustrated in figure 4, with blocks containing IP cores highlighted. Figure 4: OFDM Transceiver FPGA PHY Sub-system structure
TransmitterTransmit byte-wide data is loaded into the transmitter FIFO buffer over the Wishbone interface. The Randomizer serializes this data, then performs an “exclusive-OR” of the data with a pseudo-random binary sequence (PRBS). Data is then forward error correction encoded by the Convolutional Encoder and Reed-Solomon Encoder blocks. The Interleaver re-arranges the bit sequence within each OFDM symbol. The Mapper block modulates the bitstream onto constellations for each sub-carrier: for 64-QAM, 6 bits are modulated onto each constellation. The Carrier Combiner block completes the generation of frequency domain data for each symbol by adding pilot carriers, preambles and null (guard) carriers, then handshakes with the FFT/IFFT block to schedule transforming each OFDM symbol to the time domain. The FFT/IFFT block includes arbitration logic to share the (Inverse) Fast Fourier Transform core between the receiver and transmitter pipelines. The OFDM cyclic prefix is then added by the Cyclic Prefix Inserter before the complex baseband data is output to the zero-IF DACs. ReceiverReal-only samples from the ADC input are mixed with a fixed rate oscillator with complex (cos, sin) output signals to take it to a near-baseband complex signal by the Coarse Mixer. A fixed-rate Decimator then down-samples the signal by 2. The start of each burst is detected by the Preamble Sync block, using the repeated pattern of 128 samples length in the preamble. This block also calculates an initial estimate of the carrier frequency offset and passes this to the next block. The Fine Mixer block uses this, plus a signal from the Phase Tracker, to remove the residual carrier frequency offset. The FFT Window Control block uses an initial timing signal from the Preamble Sync block, plus adjustments from the Phase Tracker, to locate the start of the symbols and handshakes with the FFT/IFFT block to schedule transforming each OFDM symbol to the frequency domain. The Phase Tracker uses the pilot sub-carriers in the OFDM symbol to estimate and remove the effect of any residual carrier frequency offset. It also detects any drift in the FFT window position due to sample rate errors, corrects for this and feeds data back to the FFT Window Control block to track this. The Channel Corrector block uses the preamble symbol data to estimate the channel frequency response and then corrects the effect of this on the data symbols. The Demapper block converts each resulting sub-carrier constellation to a number of “soft” bits associated with the modulation scheme. Each of these bits is represented as 4-bit “soft” bits indicating the confidence that a given received bit is a one or a zero. The De-interleaver re-arranges the bit sequence to reverse the interleaving process in the encoder. Forward error correction is performed on the data by the Viterbi Decoder and Reed-Solomon Decoder blocks. Transmit data is finally recovered by the De-randomizer block which reverses the effect of the Randomizer used in the encoder. Data is then passed to the receiver FIFO buffer where it can be read out over the Wishbone interface. This implementation is described in further detail in the following sections and in [4]. Making full use of ECP FPGA features Multiple clock domains: two of the on-chip PLLs were used to generate and align four separate clock domains for this design. Signals could then be passed between clock domains simply by holding the signal for the clock period of the slower clock. A single global reset signal was used to define on which cycles the clocks would align – allowing each block to know the relationship of its clock to any blocks it was interfacing to. This allowed blocks to use the most natural clock for the function required: typically a sample rate clock, a bit-rate data clock, a byte-rate data clock, or a higher speed clock for compute-intensive functions (e.g. decimator, channel corrector). Using a clock not much faster than is required for each function helps reduce power consumption and makes for easier, less-pipelined design without unnecessary enable signals. A fifth externally sourced (and independent) clock domain was used for the Wishbone interface clock. Embedded Block RAMs (EBR): 9 kbit RAMs, with variable aspect ratios from 9k x 1 bit to 256 x 36 bit. These were used throughout the design to provide fast and efficient storage for large quantities of data, typically complete “data blocks” for each OFDM symbol at various stages in the pipeline. This allowed easy isolation of each pipeline processing block from the next, thereby simplifying block design specification. Distributed RAMs: small 16 x 2 bit RAMs that can be used as an alternative to a single logic slice (effectively replacing 2 LUT4 elements). These were used extensively in the design for very small storage (they are much more efficient than flip-flops). In places, Distributed RAMs were ganged together to create moderate RAM sizes – such a technique allows trade-offs to be made between use of block RAMs or logic slices to optimise the design for the amount of remaining FPGA resources available. DSP Modules: dedicated multipliers, adders and accumulator logic. These are particularly effective at performing complex multiplies using 2 multipliers in a multiply-add/sub configuration over 2 successive clock cycles. Because the adder is built into the DSP module rather than being constructed out of LUTs, multiply-add operations can be run at high speed for very high throughput (this technique is used in the Preamble Sync block discussed in the next section). Table 1: Total Resource Usage for this design mapped to a Lattice ECP33 FPGA
Clock domains: Wishbone interface (ran at ~23 MHz in this design), Fs, 2*Fs, 8*Fs, 12*Fs, where Fs would be set to 11.424 MHz for a 10MHz nominal bandwidth make the fastest clock 137 MHz. Suitably pipelined designs on ECP devices can run at clock speeds in excess of 200 MHz, but this was not required for this design. Architectural Design Considerations In deciding on the overall transceiver architecture for this design, many aspects of the requirement specifications and implementation options were considered. Some of these are outlined below. Fixed down-sampler ratio : it is much easier to implement a down-sampler with a fixed decimation ratio than a variable one. In the 802.16-2004 specification, it states that the base station receiver and transmitter symbol clocks and carrier frequencies should all be locked to the same reference oscillator. It also states that the subscriber station transmit symbol rate should be within 2% of the received symbol rate. Hence, any sampling rate error incurred by using a fixed sample rate in the receiver will only cause a very small distortion over one symbol period. However, the error will cause a drift in the symbol start position over several symbols. This is detected in the Phase Tracker (see below) and corrected for in the FFT Window Control block. FIR Filters (in Decimator and Channel Corrector blocks): the (LUT-based) “Distributed Arithmetic” (DA) FIR filter IP core was used in this design to leave as many DSP blocks as possible available for other functions. If LUT fabric had been in short supply, or high operating frequency was required, the general FIR filter IP core (which uses the DSP block multipliers) might offer a more appropriate alternative. Down-mixing : receiver carrier frequency down-mixing is split into two stages: as the input signal to the ADC is a low-IF signal centred at a nominal frequency of Fs/4, an initial fixed “coarse” down-mix by Fs/4 was trivially implemented by multiplying ADC samples by the repetition of the rotating complex vector sequence { +1, +j, -1, -j }. A variable “fine” mixer placed just before the FFT was then used to correct for the remaining carrier frequency error as estimated by the Preamble Sync and Phase Tracker blocks. Such a scheme halves the number of calculations required to be done by the Decimator and “fine” NCO/mixer blocks and reduces the latency of changes in fine frequency adjustment. The “fine” NCO was implemented quite simply, using an EBR RAM as a direct sin/cos look up table. Where greater precision is required, an NCO IP core is available which provides significant optimisations to increase precision with only limited increase in resource usage. It also offers additional features such as phase dithering. Initial carrier frequency error and burst start estimation ("Preamble Sync(hronization)"): Although other methods were investigated, this was finally implemented based on a method proposed in the paper by Schmidl & Cox [2]. The large number of different calculations required were achieved by time-slicing part of a DSP block, with complex multiplies being efficiently implemented in 1 cycle using the DSP block multiply-add/sub function. Further details of this design are described in a separate white paper [1]. FFT / IFFT functions : these were implemented using a combined FFT/IFFT IP core time-shared between the 2 functions using simple request/grant handshaking control signals. Sufficient buffering either side of the FFT/IFFT was used to allow the receiver and transmitter to operate in full duplex mode without any timing dependency on the other. Phase Tracking : the Phase Tracker block tracks how each of the 8 pilot sub-carriers has been phase rotated since the start of each received burst. From this, estimates are made of the carrier phase error (for correcting the current symbol), the residual carrier frequency (fed back to the Fine Mixer block) and any timing drift in the symbol start position due to sample rate error (fed back to the FFT Window Control block). More robust and accurate estimates could be achieved by selectively emphasizing or suppressing the influence of pilots based on their received amplitude, and/or using a "decision-directed" method to compare received data sub-carriers with sliced versions (as fed back from the demapper) to give a larger number of estimates to average over. Channel Correction : an initial channel estimate is made by using the known reference signal of the received preamble, together with DA-FIR filter IP cores to calculate in-between sub-carrier values and to reduce the effects of noise. Fast calculation of magnitude and phase measurements were required for which the openCore parallel CORDIC IP core was used to minimize development time. For a fixed link, the channel should only change slowly over the length of each burst oFsymbols. However, the design is structured so it could be expanded to introduce adaptive updating of the channel correction coefficients on subsequent symbols in the burst. Power Amplifier Non-Linearity : although beyond the scope of this design project, distortion in the signal caused by non-linearities in the transmitter power amplifier should be considered. The effects of this distortion can be reduced using various pre-distortion and crest factor reduction schemes in the transmitter signal path. Coding and Modulation : the design is structured to support the full range of 802.16-2004 OFDM coding and modulation schemes, though only the 64-QAM-3/4 mode was implemented in the HDL design source code. Design Simulation & Results The quantized Matlab reference transmitter model was used both to validate the performance of the HDL transmitter design and also to generate vectors to test the receiver design with. These vectors would first be processed by a Matlab channel model which simulated the following features:
The IEEE 802.16-2004 specification limits the allowable carrier frequency and sample rate errors at the base station receiver to no more than +/-2%. The 13% value modeled here is equivalent to Doppler shift caused by a transmitter or receiver traveling at 50 km/h with a nominal channel bandwidth of 1.75 MHz (a smaller channel bandwidth makes the effect of the Doppler shift more pronounced). The HDL test bench used ASCII text command files for each test to describe a sequence of operations to open various Matlab-generated vector files for data input stimulus and output data comparison, and to read or write (via the Wishbone interface) memory-mapped control registers and receiver/transmitter data FIFOs. The test bench is described in more detail in [3]. The following groups of simulations were run:
For 64-QAM-3/4 modulation and coding scheme, the receiver sensitivity test specified in IEEE 802.16-2004 (section 8.3.11.1) requires a final bit error rate (BER) of less than 10 -6 at an AWGN SNR of 24.4 dB. Figure 5 shows bit and packet (PER) error rate measurements at the output of various stages of the receiver. Note that plot lines stop where there are zero errors over the total simulation time (4000 symbols, 3.4 million bits) as these points cannot be plotted – thus indicating that the design meets the sensitivity requirement. (“Demapper Sliced” is a measurement of BER at the input to the FEC if the data were to be binary-sliced with a simple threshold slicer.) Figure 5: Receiver Performance ( 64-QAM-3/4) in AWGN channel Conclusions This paper outlines some of the key components and features required for an OFDM transceiver physical layer (PHY) design. It outlines how a complete transceiver design was constructed using multiple IP cores to speed development time. The design demonstrates the feasibility of using the Lattice ECP family of FPGAs for wireless OFDM PHY design in general (and with the performance requirements of the “WiMAX” OFDM PHY in particular). This includes the provision of a suitably DSP-oriented FPGA fabric, generation and support of multiple clocks and the availability of key DSP IP cores. The methodology discussed shows how the ispLEVER tool suite running on a PC contains all the tools required to take a large and complex FPGA design from HDL simulation through to final device programming. This included the straightforward transition between GUI-driven and TCL-scripted design flows and the portability that allows all the same operations (and scripts) to be run under the UNIX or Linux version of ispLEVER. References [1] “Implementing WiMAX OFDM Timing and Frequency Offset Estimation in Lattice FPGAs”, Lattice Semiconductor white paper, 2005 [2] “Robust Frequency and Timing Synchronization for OFDM”, TM Schmidl, DC Cox - IEEE Transactions on Communications, 1997 [3] “OFDM Transceiver HDL Testbench”, Lattice Semiconductor OFDM Transceiver design package, 2005 [4] “OFDM Transceiver Reference Design”, Lattice Semiconductor OFDM Transceiver design package, 2005
Revision A.6 (18 August 2005) Comments on this article? Send them to comments@fpgajournal.com |
All
material on this site copyright © 2006 techfocus media, inc.
All rights reserved.
FPGA and Structured ASIC Journal Privacy Statement |