David Durst's Blog

TL;DR Let's look at a type-directed approach for implementing image processing applications on FPGAs. This approach will enable us to statically verify a design's throughput. If you like this post, check out my PLDI 2020 paper Aetherling, which provides a formal definition of the types and uses them to make trade-offs automatically during the FPGA design process.

Hardware designs on FPGAs offer a different set of performance trade-offs compared to software running on a CPU. Greater parallelism can be achieved on an FPGA by allocating more resources, like adders and multipliers. This is unlike a CPU with a fixed number of cores and vector lanes per cores. However, FPGA have limited resources, so trade-offs must be made between parallelism (or throughput) and resource utilization. In this blog post, we're going to look at the throughput-resource utilization trade-offs for FPGA implementations of image processing operations like image blurring. There are many ways to blur images on an FPGA. Some designs have higher throughput, and others require fewer resources. I'm going to demonstrate a type system for expressing designs with different throughput-resource tradeoffs that enables you to statically determine a design's throughput. (See below for a refresher on the types of resources available on an FPGA.)

FPGA designs for image blurring with different throughput-resources trade-offs

Before we get to the type system, let's look at a few different ways to blur an image on an FPGA. These designs will have different throughout-resources trade-offs. The below video shows one design that processes one pixel per clock cycle. It does so by streaming over the input pixels, accepting one at a time, storing two prior input pixels in registers, and computing the average of the current input and the prior two.

Design a in the image below shows the same one pixel per clock design as the above animation. Designs b and c show two other implementations with different throughput-resource utilization trade-offs. Design b processes two elements per clock, doubling the throughput relative to design a and the animation above. Design b requires double the adders and dividers compared to design a in order to achieve this throughput. Design c saves on adders but can only process one pixel every third clock cycle.

Design a
Design b
Design c
Three different implementations of a blur. The gray nodes accept and emit data every clock. The red nodes accept and emit data every third clock. For simplicity, these designs only do a blur on a 1D image.

You could write up all these different designs in Verilog. Assuming you wrote correct code, they would synthesize to a bitstream, run on an FPGA, and provide the expected throughput. But, it would be challenging to statically verify that the designs achieved the correct performance.

Space-time types encode throughput

Now that we've seen some different designs, we're ready for the types that encode their throughputs. Since we're building application-specific hardware, we can encode the throughput of the designs down to the clock cycle without complex performance issues common in software like multi-threading. I'm going to introduce these types using a simpler example than blur: map add10. This code adds 10 to each element in an input sequence. When implementing a map on an FPGA, there are two basic implementations. The first implementation, map_s add10 (or map in space), is fully parallel and duplicates add10. If the map_s is applied to a sequence of four elements, then four elements arrive on one clock cycle, map_s adds 10 to every element, and the four elements depart on the same clock cycle. The second implementation, map_t add10 (or map in time), is fully sequential and requires only one copy of the add10 circuit. The four input elements arrive on four separate clocks and map_t adds 10 to one element at a time.

map_s add10
map_t add10
Two different implementations of a map on an FPGA.

The types of map_s and map_t encode their throughputs. map_s add10 :: SSeq 4 int -> SSeq 4 int is a function on two space sequences. Each space sequence (SSeq 4 int) encodes both the length of the sequence and that all the elements occur on the same clock cycle. The type signature of map_s add10 encodes it's a function which accepts four ints on one clock and emits them on one clock. This is a throughput of four ints per clock. map_t add10 :: TSeq 4 0 int -> TSeq 4 0 int is a function on two time sequences. Each time sequence (TSeq 4 0 int) encodes that all the elements elements occur on different clocks, one after another. The type signature of map_t add10 encodes it's a function which accepts four ints on separate clocks and emits them on separate clocks. This is a throughput of one int per clock. The fully sequential throughput, like the fully parallel one for map_s add10, is easily expressed using the space-time types.

Space-time types enable expressing more complex throughputs than just fully parallel or sequential

The prior map examples only had fully parallel or sequential throughputs. We need to support other types of throughputs in order to handle all of the image blur designs. We can nest and modify the parameters of the space-time types to express a wider range of throughputs. The two pixel per clock blur design b is partially parallel: it processes more than one pixel per clock but doesn't do the entire image at once. We can encode partially parallel throughputs by composing the TSeq and SSeq types. map_t (map_s add10) :: TSeq 2 0 (SSeq 2 int) -> TSeq 2 0 (SSeq 2 int) has a type signature which encodes a partially parallel throughput. The design accepts and emits two ints per clock for two clocks. The one-third pixel per clock blur design c requires underutilization: it only accepts or emits pixels on some clock cycles. We can encode underutilization by using the second parameter of the TSeq type. map_t add10 :: TSeq 4 1 int has a type signature which encodes underutilization. The design accepts and emits four elements per clock over four clock cycles, but then its input and output interfaces are invalid for the fifth clock cycle.

map_t (map_s add10)
map_t (map_s add10)
Partially parallel and underutilized implementations of a map on an FPGA.

Let's apply the space-time types to the main example: blur

So now that we understand space-time types, let's use them to encode the throughputs of the blur designs. I'll assume that the designs are processing a 4MP image, or 2560x1440 pixels. Design a has a type signature of TSeq (2560*1440) int -> TSeq (2560*1440) int, encoding a throughput of one pixel per clock for 2560x1440 clock cycles. With a clock rate of 250MHz, this design will process the whole image in 15ms. That's fast enough to process 4MP images at 60fps, pretty snazzy! Design b has a type signature of TSeq (2560*1440/2) (SSeq 2 int) -> TSeq (2560*1440/2) (SSeq 2 int), encoding a throughput of two pixels per clock for (2560*1440/2) clock cycles. This design can process a whole image in 7.5ms, assuming a clock rate of 250MHz. If we need to process 4MP images at 120fps, we'll use design b. Finally, if we need to save resources, design c has a type signature of TSeq (2560*1440) (TSeq 1 2 int) -> TSeq (2560*1440) (TSeq 1 2 int), encoding a throughput of one pixel every third clock for (2560*1440*3) clock cycles. We nested the TSeq types in order to create a repeating pattern of valid and invalid clocks, where no data is accepted or emitted. This design can process a whole image in 30ms, assuming a clock rate of 250MHz. That's 30fps, still not too bad.

Wanna know which design to pick? Check out Aetherling.

It's up to you to pick which design. You could demand 120fps and take the two pixel per clock design. Or, you could save on area and live with 30fps. I suggest a more principled approach: why don't you decide how many FPGA resources you can use for the blur, and then pick the fastest design that will fit. To find out more about this approach, check out Aetherling. As an added bonus, Aetherling also demonstrates how to write code once in a high-level, data parallel DSL called the Sequence Language, and then automatically compile to various Space-Time IR designs with different throughput-resources trade-offs!

Appendix: types of resources on an FPGA

Processors are implemented using transistors. Non-reconfigurable processors, like CPUs, GPUs, or ASICs, use the transistors to implement fixed cores and caches. These compute and memory resources are connected via fixed networks-on-chip. FPGAs use the transistors to provide less efficient but more flexible compute and memory resources that can be configured to implement different designs, such as the image blurring examples above.

The image below shows the resources on an Xilinx FPGA and how they are connected. Each Configurable Logic Block (CLB) is a collection of (1) registers and (2) lookup tables (LUTs) that accept a bit vector and emit a single bit depending on the value of that vector. The LUTs are used to implement general purpose compute operations, like addition or bit shifting. The registers store small values and serve as pipeline registers for improving clock rate. Each Block RAM (BRAM) stores 36Kb of data. BRAMs store larger amounts of data than registers for longer periods of time. The Digital Signal Processors (DSPs) implement multiply-add computations at a higher clock rate than is possible using the LUTs. These resources (CLBs, BRAMs, and DSPs) are connected via a reconfigurable network that can be changed to suit the application. For example, a group of CLBs and BRAMs could be connected to implement a custom memory and memory controller for one of the image blur designs.

A Xilinx FPGA has three types of resources: CLBs, BRAMs, and DSPs. The resources, and their connecting network, can be reconfigured depending on the application.