15 Reasons to Avoid FPGAs

When I first got into computers it was buying a magazine called Micro Sistemas by 1982.

I didn’t have a computer, the only device programmable was my sister’s TI-59 so I resorted to writing my BASIC programs on paper and executing it on my head. It wasnt up to 1985 that I was bought my Apple IIc. I spent countless nights programming Assembly 6502 on a 40-column display on TV (I only got my first green monitor years later).

Years went by and by 2001, one masters and a doctorate later, I found myself facing an interesting device called FPGA while interviewing for software security company Sophos in Milton Kanes, UK. They told me that this strange chip could become anything like a video processing core, an Apple CPU or antivirus engine. Amazing. Since then I have owned several FPGAs myself and worked on commercial projects that used them. Enough to know what they are good for and what they are not. In this article I’ll focus on the later, why NOT to use FPGAs.

I have to make a disclaimer – this article has been written with the general public in mind, which includes managers, business developers and decision makers that do not necessarily have the expert technical background to understand all the intricacies that the subject requires. So I allow myself the equivalent of a poetic licence and cut corners, sometimes violating some strict rules as a trade-off for better clarity.

1. 99% of new FPGA projects are naive

Over the past 5-10 years there was a booming interest in FPGAs in many areas, sometimes derived from the need to execute many operations in parallel, as in Monte Carlo simulations, or to achieve lower latencies than those obtained with userspace driver NICs.

More often than I would be comfortable with I’ve seen projects, entire two-year projects, heading to the trash can. Invariably the main culprit has been the naive desire of a firm to be on bleeding edge of technology. With poor or no planning at all, teams are hired, budgets are allocated but the project goals are typically to translate an existing piece of an application into the FPGA fabric.

Straight translation is a bad idea. I have seen friend’s careers crushed by failures they were not responsible for, money thrown in the trash, too much damage in general. I really hate to see that happening again and I’ve been vocal about this sort of blind effort.

2. The top shops are not using it

This should be enough to weed out most fpga-experts-to-be: anedoctal evidence shows that firms that making millions daily in many fields are not really doing it because of the speed obtained with custom FPGA hardware.

Most are deploying a no nonsense combination of statistics, machine learning and expert use of the x86_64 platform. Simple as that.

3. Some FPGA appliances boost very low latency numbers – but using it remains a problem to be solved

Appliance companies will sell ready-to-run appliance boxes with FPGA solutions inside or even run a custom app inside the switch itself. They will parse the data, normalize it and present it to you in a canonical form that you are going to use in your applications. This is tremendous value added and from the technical perspective I’m always in awe with what these companies accomplished.

From an user’s perspective though, the work of how to use that data remains. The latency numbers are incredible but taking this data into your application AND using it is a completely different problem. You might have a sub-1.0us latency data processing engine but if your application is not prepared to use this data and adds another 5us to further action on it, what is the point?

4. FPGAs were built as ASIC prototypes

The original use for FPGAs in the US military was as ASIC prototypes and that remains their best application. The use of FPGAs as standalone solutions is a newer animal. Despite not being an explicit reason against its use, I list this here because it is very important to keep in mind that FPGAs were created as a tool within the process of designing and testing new ASIC cores.

5. FPGAs are slow

FPGAs are in fact built on top of generic emulation blocks. There is no general design or name for these basic components but together with flip-flops and the routing matrix they form what we call an FPGA.

The software tools will try to translate every line of HDL into a sequence of these basic logic blocks and link them together by a routing matrix. Sometimes a logic block will be used 100%, sometimes only part of it. Therefore there is an inherent waste in the process and this waste is not only in the form of silicon space – it’s in the form of time (and latency) as well.

There is no free lunch: every digital circuit is a blob of the basic electronics components: capacitors, resistors, transistors and, to a less extent, inductors. Every bit (as of in a byte) can be seen as a capacitor that is charged or discharged. Capacitors take time to charge or discharge and the more number of capacitors is placed serially, more time will take to charge them. (As you remember from Physics 101, the bigger the period, the smaller is the frequency.)

The top of line FPGAs are constructed to run at a maximum of 800MHz but typically, for the reasons listed above, you will see yourself settling for the 100-200 MHz range.

6. FPGAs are expensive

This is a slam dunk. Decent FPGA cards for use in 1/10G networks are above the $3000 range. Even low grade development kits start at $600. Apples to apples, compared to the fastest NIC cards, available for under $700, there is a +4x factorthat is hard to ignore economically, especially if the deployment is comprised of hundreds of machines.

7. FPGAs are less efficient than ASICs

On this item I will drive by example. FPGAs saw an unusual application years ago when bitcoin miners looked for faster ways to perform the crypto calculations necessary to create new bitcoins (aka mining). After a period of relative success, the economics of hard core production made sense and custom ASIC chips were produced. When the list below was first produced, I remember most of the appliances were FPGAs and there was one or two ASIC solutions. Nowadays FPGAs barely make the cut and were pushed to a separate list to the end of the page.

https://en.bitcoin.it/wiki/Mining_hardware_comparison

The difference in performance is just staggering, of several orders of magnitude.

8. FPGA development requires a skilled team

The entry barrier to develop successful FPGA solutions is way longer than that of C++ for example. Not only the tools are still amateur tools (imho) in comparison with the Intel ecosystem, but there is also way more to learn than just the language itself.

One of the most know examples is that of computing X^3 or X to the power of three. This is a typical example that is discussed in many verilog/vhdl textbooks and ends up, there are many ways to do it! One can compute X*X*X but the sequence of operations is so large to fit in a single clock cycle that it blows the frequency requirements. So you can compute X*X and then use the result to calculate (X*X)*X in the next cycle, which relaxes the frequency limits, increases the operation length in cycles but then one has to keep track of which cycle the result will be fully available. You see the picture.

In some cases like the expression below that is trivially calculated by GCC/ICC/CLANG, the synthesis is just not possible in Verilog, ie the tools cannot generate a physically feasible sequence of logic blocks that computes the final result.

[C++] double y = 0.1+0.75*(1.0/(1.0+exp((x[i]+40.5)/6.0)));

Therefore even a simple calculation in C/C++ can become the object of days of research and tweaking while in the x86 world a near optimal solution is always available.

A simple operation like a 16-bit addition A+B can produce 80 gates with 0.8ns maximum delay or 400 gates if constrained to a maximum delay of 0.5ns. And the number of bits matter. Eight bits will usually compute much faster than the equivalent sixteen bits version so very often you will be asked by a verilog developer what is the maximum value for a given quantity – he/she is trying to figure out how many bits he/she will allocate in his/her design because likely it’s blowing the maximum delay limits. Again, the lengthiest calculation will drive the frequency calculation (remember, frequency is one over period).

In reality what we see in the market is a very broad mix of professionals that “know” one HDL language, a much stricter set of them have relevant experience, an even smaller set of professionals who actually know how to produce optimized code and I can count in my right hand the number of technologists that can produce effective, integrated solutions.

It is essential for any firm that wants to get into this field to hire professional counselling prior to pulling the trigger on a multi-year project. Such failures dont make anybody a favor. They are bad for the firm, bad for the professionals and bad for the economy.

9. FPGAs can be plagued with metastability issues

I could not find an easy way to explain this concept but here’s an excerpt from Altera’s literature:

“Metastability is a phenomenon that can cause system failure in digital devices, including FPGAs, when a signal is transferred between circuitry in unrelated or asynchronous clock domains.”

C++ programmers would find this related to multithreaded programming when multiple threads are accessing the same piece of data. Enough to say the outcome is usually an insidious sequence of events that are extremely hard to pinpoint.

10. High-level languages constrain the positives of the FPGA

Some companies have produced high-level language translators for FPGA. Nowadays you can program an FPGA as graphical blocks, in a Java-like language, in C or more recently OpenCL has been getting a lot of attention.

OpenCL (and others) is awesome in the sense that it abstracts the whole programming environment and releases development resources from mundane tasks as coding the DMA engines in the application side. However that has a cost, much of the same that we get hit when moving from C++ to Java or Python: lack of control. You loose control over how your algorithm is being built, you loose access to using super efficient ways to compute a given piece (which can be an edge). So these languages are not a panacea, each comes with a cost.

11. Verilog (HDL in general) is not flexible

Much of the argument for this item has been already said above but I’m repeating here to stress the point. Coding in Verilog is no stroll in the park. Coding simple statements can lead to days if not weeks and halt a project.

This is where experience comes to help – knowing what not to do is, in this environment, more important than knowing how to do. Instead of dwelling weeks on how to perform a floating point operation, can we represent our variables as fixed point instead? Can an approximation of a factorial calculation be used instead of the recursive formula? And all this applies to VHDL as well.

12. FPGA development is a long term commitment

FPGA development is an ecosystem as much as C++ or Java is. If a firm is committed to get into the space, you will have to invest in tools, personnel and hardware in the long term. Hiring a couple of people and spending $10k in tools are not a recipe for success.

13. QDR memory is still scarce (and DDR memory has bottlenecks)

I remember when I wrote my first book building algorithm in a Spartan-3 development board (book building is the act of sorting trading orders by price and time in memory), I quickly hit one bottleneck: where to store the individual orders? Ends up, (DDR) memory has an interface such that to store a byte you would have to request permission to write, wait for a flag, enable the write and wait again for confirmation. This is not a one cycle task and every write has to follow the same protocol in sequence.

The result is that if you have an application that is sequential in nature, memory will be your bottleneck.

Of course one can resort to block memory within the FPGA but the quantities are far from the minimal necessary to store anything significant. And QDR memory? Just a few megabytes are available in the top hardware.

The solution for the memory bottleneck is to create fast caches. Yes, you can write your own caches but that is a whole complex development on top of what you already have to do. And you will never get 1% closer to what the main companies already done and available even on a $20, sub-1W silicon chip like the Intel Edison.

Which leads us to:

14. Intel processors are amazing devices

Perhaps the most compelling argument to stay away from this developing field, at least while a breakthrough is not in place, is that Intel processors are just amazing pieces of technology. The amount of R&D time and money already put into the x86 line (or the ARM line for this sake) is enough to wipe out any shortcomings of this technology.

Not only they can run native at stratospheric +5GHz frequencies, the L1/L2 cache policies can be steered by careful C++/Assembly programming. One 64-bit number fetch from main memory can take typically 240 cycles according to the Intel manuals. With careful cache usage, exploiting locality, this drops to 20 cycles from L1 cache, which is equivalent of running at +50 GHz frequency. Add on top of that all the superscalar behaviour (pipelining) that these amazing silicon pieces have since the x386, the bar is set way higher than an FPGA could reach on a given day.

All said, do not get under the impression that we think FPGAs are bad animals. There are many applications that are suited to them, especially if they can be broken down into smaller pieces or exploit application-specific logic. Which brings us to the main reason to embrace FPGAs: at HBQuant we do write such custom applications and also advise clients on how to develop and get the most of them while at the same time avoiding deadly pitfalls.

Henrique is owner of HBQuant, a technology startup, and is obsessed with all things data and more specifically, fpga-based applications among other things geeks are routinely obsessed with. He can be reached at henry-at-vitorian.com

Leave a Reply

Your email address will not be published. Required fields are marked *

*