| |
Upset with Neutrons The cosmos are conspiring against your FPGA. From millions of light years away, they are seeking out our semiconductors, targeting our technology with high-energy particles that can silently flip our flip-flops, reconfigure our routing, and tamper with our look-up tables. Galactic Cosmic Rays (GCRs) are the highest energy particle radiation to reach earth. The majority are shielded from the earth by the earth’s magnetic field. However, some particles still penetrate this protection. Upon entering the atmosphere, they collide with atmospheric gases, producing a wide variety of subatomic particles, including a significant quantity of high-energy neutrons. The greatest density of these neutrons occurs at an altitude of about 60,000 feet. Below that level, they are gradually attenuated by the atmosphere so there is a much lower density at the earth’s surface. Even at ground level, however, neutrons sometimes strike semiconductor devices. When one of these hits a configuration bit of your SRAM FPGA, it can change the routing or alter the behavior of a look-up table, producing incorrect logic until the device is reconfigured. After reading the new 79-page report entitled "Radiation Results of the SER Test of Actel, Xilinx and Altera FPGA Instances" recently published by iRoC technologies, you may be tempted to run to the kitchen to fashion yourself a protective tinfoil hat (as well as a tiny one for your smart-phone.) If you consult your copy of “The WORST-CASE SCENARIO Survival Handbook” the table of contents offers little hope: “How to escape from quicksand,” “How to fend off a shark,” “How to wrestle free from an alligator…” Unfortunately there’s no chapter on “How to avoid neutron-induced single-event-upsets in your SRAM FPGA.” Before you bolt down the door to your bunker, let’s check in with a recognized authority on things that happen to electronics in space. Ken LaBel from NASA Goddard Space Flight Center makes a career of studying radiation effects on semiconductor devices. “In space,” LaBel comments, “we worry a lot about this stuff. The most difficult to deal with are these single particle effects such as single event upsets (SEUs).” A single particle can strike a memory element, causing it to change state randomly. In radiation-intense environments like the upper atmosphere and in space, these events are common, and systems must be designed to compensate for them. On the ground, LaBel says, the problem is far less troublesome. The relatively small number of neutrons reaching devices at sea level makes the likelihood of an SEU-induced error much lower. For the record, however, there are numerous SRAM FPGAs operating successfully, even in the harshest space environments. “The Mars Rovers relied on Xilinx devices to control the descent, including rockets and parachutes,” says Xilinx’s Peter Alfke, “and they also have one Virtex device in each wheel, where they (along with other devices) measure torque that relates to the surface condition.” Systems such as these are typically designed using techniques such as configuration read-back and refresh, and triple-module redundancy to take SEUs into account along with the myriad other difficulties faced by designers developing electronics to work in space. Designers of terrestrial applications may not want to resort to these techniques, but should exercise the same engineering rigor depending on the demands of their application. While there are a number of radiation-induced hazards that affect all semiconductors, SEUs are in a category known as “soft errors,” meaning they don’t do permanent damage to the device; they merely effect a temporary change of state. Other types of errors include hard errors in memory cells such as gate or dielectric rupture; or latchup which can damage or destroy the device. In an FPGA, not every SEU generates a logic error, since any given design uses only a small number of the configuration cells in a device. Also, just because a neutron scores a hit, an upset is not inevitable. FPGA manufacturers use configuration latches which trade-off stability for access speed. Since the time to configure an FPGA is inconsequential in most applications, vendors make efforts to design configuration memory to minimize the potential for SEUs. For our discussion, we’ll focus on the effects at ground level. Those of you designing aerospace systems already know this stuff cold, right? (In fact, if you’re designing avionics and you learn anything new and important from this article, you might want to consider another line of work.) Your products function in an environment where SEUs are a major issue. For the rest of us, designing things that stay here on earth, we’ll try to sort out the engineering and marketing realities of the phenomenon and examine how it should affect our design approach. Since memory cells and flip-flops are susceptible to these events, SRAM-based FPGAs would seem, on the surface, to be highly vulnerable. The vast majority of the SRAM in an FPGA is used for configuration, so the likelihood is that most errors will occur there. Non-volatile FPGAs, such as flash and antifuse, are less susceptible, because the configuration logic is not vulnerable, so only memory elements can be affected. Working in our favor, however, is the fact that probably 90% of configuration SRAM is unused in any given design. Even for configuration that is used, an upset does not always result in a functional error. Since FPGA vendors estimate actual usage at around 10%, one would expect only 1 in 10 SEUs to result in an actual logic error. For some devices in the iRoC test, this ratio holds true, while other architectures show a somewhat higher rate. As engineers, any time we design something, we should identify potential points of failure, and then estimate the likelihood and consequences of failure at each point. The product of these two gives us an idea how important it is to design around that potential problem. In any semiconductor device with volatile memory, SEUs are a potential failure mechanism. In other words, in order to determine how important SEUs are on the ground in our application, we need to know the actual probability that the device we’re using will experience such a failure. We can then apply what we know about the consequences of wrong logic, and decide what precautions (if any) are called for in our design. There have been many attempts to quantify the failure in time (FIT) rate of various types of FPGAs. FIT rates are given in terms of failures per 109 hours of operation. These studies (and the dissemination of their results) represent both an engineering and a marketing battle: the engineering battle, to determine what precautions one should take when designing with various types of programmable logic; the marketing battle, to convince us that a particular device family represents the best engineering trade-off after taking SEUs into account. The latest iRoC study, commissioned by Actel, is designed to measure FIT rates for Actel, Altera, and Xilinx FPGAs. In brief, the devices under test were configured with a combinational circuit and exposed to high densities of neutron radiation in an accelerated environment. This accelerated environment was provided by the neutron test facility at the Los Alamos Neutron Sciences Center (LANSCE). The device configurations were periodically read back to look for configuration errors, and the output vectors were monitored to measure how often a configuration error resulted in a logic error visible at the I/O pins. The failure rates were then adjusted in an attempt to predict the FIT rates for each device at various altitudes/elevations based on standardized estimates of neutron densities at those altitudes. The results of this study showed FIT rates somewhat higher than similar published data for some of the same device families. (Although, given the number of assumptions required in such measurements, the rates were probably not significantly higher.) As one would expect, the SRAM-based devices showed a measurable FIT rate, while the non-volatile FPGAs showed a FIT rate near zero. Since the design-under-test was purely combinational, however, failures in memory elements would not be observed. Because flash and antifuse devices are susceptible to soft-errors only in volatile memory cells, their failure rate in this test would be lower than in a practical design depending upon the amount of memory and registers used. The design of this test measured only configuration errors, to which those technologies are virtually immune. “It is useful to set FIT rates into context,” says Ken O’Neill, Director of Marketing for Military and Aerospace products at Actel. “Most components engineers will seek components with FIT rates less than 100 for typical commercial applications such as telecom, storage, and networking. For high-reliability applications such as medical or military, components engineers seek components with FIT rates in the 10 to 20 range.” Xilinx has taken this issue seriously, and has a series of ongoing experiments they call the “Rosetta” where boards each containing 100 XC2V6000s are monitored for long periods of time at various locations of differing elevations. These non-accelerated experiments are meant to give a more realistic result than those extrapolated from high neutron density testing. “When you make your living on devices that depend on memory,” says Xilinx’s Principal Engineer, Austin Lesea, “you’d better understand any issue that could affect the viability of your technology. We’ve been running tests on SEUs for years and have made significant improvements to our devices as a result.” Xilinx claims that their 90nm FPGAs show approximately 15% better results than previous technologies, despite predictions that 90nm would be more susceptible than larger geometries to SEU problems.
The difficulty with real-environment testing, of course, is that the events are so rare that it takes a large number of devices monitored over a long period of time to come up with statistically significant data. It’s a little like estimating the probability of getting hit by lightning based on the number of times you’ve been struck on your evening walks. You’ll need to interview more than a few friends and figure in their results to get a reasonable answer. By putting 100 of their largest devices in a test environment for hundreds of days, Xilinx is attempting to do just that. What do the results mean? An FPGA with a sea-level FIT rate of 500 (which would be about the middle of the pack in the iRoC study) would experience 500 failures per 109 hours. That translates into about 1 failure in 250 years of operation. Previous studies, such as Xilinx’s Rosetta experiments, have estimated the FIT rate for similar devices at what would be 1 failure in 240-560 years, so the debate over the rate spans a factor somewhere between parity and one-to-four. Remember also that FIT rates scale with density. A device with 10,000 logic elements might be expected to have roughly ten times the FIT rate of a device with 1,000 if all other variables are equal. Why are the results different? The variables are many. First, the test is obviously sensitive to the type of circuit being used. Any test using a contrived design is speculative by nature and may not accurately represent real-world situations. Second, accelerated tests are based on standardized estimates of neutron densities at various altitudes, while real world testing is based on real world densities, which are difficult to measure. If a solar flare passes by during your test, for example, your results may show a measurable impact. Finally, the conditions in accelerated testing do not exactly mirror real-world environments. “Neutron beams are different from the real world in several important ways such as particle energy and rates of exposure,” says, NASA’s LaBel. “It’s hard to correlate the results from acceleration with the natural environment without lots of irradiation testing at various energies and correlating experiments like the Rosetta” Regardless of the variations in FIT rate, how much should those of us designing ground-based systems care? It depends what you’re designing. “People designing ground-based systems with 1,000 FPGAs should be caring right now,” says LaBel. If your system contains 1,000 SRAM FPGAs and could run a year continuously between configuration cycles, you should expect somewhere in the range of one to a dozen failures per year (depending on which test results you believe and the size of your devices) and you need to do something about it. You should analyze the results of a logic failure in any one FPGA and design your system to compensate, either by redundancy, by reconfiguring the devices periodically, or by some other means appropriate to your application. With millions of SRAM-based devices in the field, Altera does not seem unduly concerned. “SEUs in SRAM devices have been a well-understood phenomena for at least a couple of decades now,” says Tim Colleran, VP of Product Marketing at Altera. “We have millions of devices operating successfully in numerous very-high-reliability applications.” Altera says continuous engineering improvement, fault-tolerant design practices, and designer awareness have kept the SEU issue from ever becoming a major concern for their customers. The marketing side of the SEU issue is perhaps as interesting as the science. Manufacturers of non-volatile FPGAs obviously want to highlight the advantages offered by their technology, and manufacturers of SRAM-based FPGAs are determined to defend the reliability of their products. The devices being compared, however, are significantly different in their capabilities, and one would guess they seldom, if ever, compete for the same sockets. Density, I/O configuration, IP, speed, power, and start-up characteristics are so different between, say, Xilinx’s Virtex II family and Actel’s ProASIC Plus family that it isn’t clear why they’d be arguing on the subject. Nonetheless, research, testing, analysis, and press releases continue, and the outcome will hopefully be a better informed, if not somewhat confused, engineering community. Designing robust circuits to tolerate neutron-induced SEU effects is just one of the many challenges of advanced IC technology. The alternative, of course, is to stay with lower complexity technology, but that comes with other reliability concerns such as increased chip count, larger boards, and more interconnects. What will happen next? Well, those galactic cosmic rays will just keep on coming. With so many neutrons flying around, we’ll want to understand what to expect when they pass by (and through) future generations of programmable logic. It appears that the answer is a mixed bag. On the bright side, smaller and smaller targets make collisions proportionally less likely as device geometries shrink. In addition, architectural changes are being incorporated into new generation devices that make them more immune to upset. On the negative page of the ledger, smaller geometries usually mean higher densities, and more targets mean a proportionally higher probability of an error somewhere on the device. Lower supply voltages and corresponding lower charge densities also make upsets more likely. Finally, for transient events, higher operating frequencies mean events that might have passed as spikes at lower frequencies now register as incorrect signals. The parties best qualified to find a solution (and the ones with the most to gain) are clearly the IC vendors themselves. It is in their best interest to quantify the phenomena as accurately as possible, take practical measures at the IC design level, and use system design techniques as required to maintain the practical reliability of their products. By supplying designers with the best possible technology, information, and collateral, they can enable their customers to reach their required system objectives. Renowned physicist Richard Feynman closed a paper on the reliability of space shuttles with: “For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.” So, push past the marketing hype. Practice good engineering. Analyze all known failure mechanisms and design your system to be as fault-tolerant as your application demands. While you’re at it, take off that not-so-stylish foil hat. The neutrons will fly right through it anyway. Kevin Morris, FPGA and Programmable Logic Journal April 20, 2004 Comments on this article? Send them to comments@fpgajournal.com |
All
material on this site copyright © 2006 techfocus media, inc.
All rights reserved.
FPGA and Structured ASIC Journal Privacy Statement |