Hardware designs on FPGAs offer a different set of performance trade-offs compared to software running on a CPU. Greater parallelism can be achieved on an FPGA by allocating more resources, like adders and multipliers. This is unlike a CPU with a fixed number of cores and vector lanes per cores. However, FPGA have limited resources, so trade-offs must be made between parallelism (or throughput) and resource utilization. In this blog post, we're going to look at the throughput-resource utilization trade-offs for FPGA implementations of image processing operations like image blurring. There are many ways to blur images on an FPGA. Some designs have higher throughput, and others require fewer resources. I'm going to demonstrate a type system for expressing designs with different throughput-resource tradeoffs that enables you to statically determine a design's throughput. (See below for a refresher on the types of resources available on an FPGA.)
FPGA designs for image blurring with different throughput-resources trade-offs
Before we get to the type system, let's look at a few different ways to blur an image on an FPGA. These designs will have different throughout-resources trade-offs. The below video shows one design that processes one pixel per clock cycle. It does so by streaming over the input pixels, accepting one at a time, storing two prior input pixels in registers, and computing the average of the current input and the prior two.
Design a in the image below shows the same one pixel per clock design as the above animation. Designs b and c show two other implementations with different throughput-resource utilization trade-offs. Design b processes two elements per clock, doubling the throughput relative to design a and the animation above. Design b requires double the adders and dividers compared to design a in order to achieve this throughput. Design c saves on adders but can only process one pixel every third clock cycle.
You could write up all these different designs in Verilog. Assuming you wrote correct code, they would synthesize to a bitstream, run on an FPGA, and provide the expected throughput. But, it would be challenging to statically verify that the designs achieved the correct performance.
Space-time types encode throughput
Now that we've seen some different designs, we're ready
for the types that encode their throughputs. Since we're
building application-specific hardware, we can encode
the throughput of the designs down to the clock cycle
without complex performance issues common in software
like multi-threading. I'm going to introduce these types
using a simpler example than blur: map
add10
. This code adds 10 to each element in an
input sequence. When implementing a map on an FPGA,
there are two basic implementations. The first
implementation, map_s add10
(or
map
in space), is fully parallel and
duplicates add10
. If the map_s
is applied to a sequence of four elements, then four
elements arrive on one clock cycle, map_s
adds 10 to every element, and the four elements depart
on the same clock cycle. The second implementation,
map_t add10
(or map
in time),
is fully sequential and requires only one copy of the
add10
circuit. The four input elements
arrive on four separate clocks and map_t
adds 10 to one element at a time.
The types of map_s
and map_t
encode their throughputs. map_s add10 :: SSeq 4
int -> SSeq 4 int
is a function on two space
sequences. Each space sequence (SSeq 4 int
)
encodes both the length of the sequence and that all the
elements occur on the same clock cycle. The type
signature of map_s add10
encodes it's a
function which accepts four int
s on one
clock and emits them on one clock. This is a throughput
of four int
s per clock. map_t add10 :: TSeq 4 0
int -> TSeq 4 0 int
is a function on two time
sequences. Each time sequence (TSeq 4 0 int
)
encodes that all the elements elements occur on different clocks, one after another. The type
signature of map_t add10
encodes it's a
function which accepts four int
s on separate
clocks and emits them on separate clocks. This is a throughput
of one int
per clock. The fully sequential
throughput, like the fully parallel one for map_s
add10
, is easily expressed using the space-time
types.
Space-time types enable expressing more complex throughputs than just fully parallel or sequential
The prior map
examples only had fully
parallel or sequential throughputs. We need to support
other types of throughputs in order to handle all of the
image blur designs. We can nest and modify the
parameters of the space-time types to express a wider
range of throughputs. The two pixel per clock blur
design b is partially parallel: it processes more than
one pixel per clock but doesn't do the entire image at
once. We can encode partially parallel throughputs by
composing the TSeq
and SSeq
types. map_t (map_s add10) :: TSeq 2 0 (SSeq 2
int) -> TSeq 2 0 (SSeq 2 int)
has a type
signature which encodes a partially parallel throughput.
The design accepts and emits two int
s per
clock for two clocks. The one-third pixel per clock blur
design c requires underutilization: it only accepts or
emits pixels on some clock cycles. We can encode
underutilization by using the second parameter of the
TSeq
type. map_t add10 :: TSeq 4 1
int
has a type signature which encodes
underutilization. The design accepts and emits four
elements per clock over four clock cycles, but then its
input and output interfaces are invalid for the fifth
clock cycle.
Let's apply the space-time types to the main example: blur
So now that we understand space-time types, let's use
them to encode the throughputs of the blur designs. I'll
assume that the designs are processing a 4MP image, or
2560x1440 pixels. Design a has a type signature of
TSeq (2560*1440) int -> TSeq (2560*1440)
int
, encoding a throughput of one pixel per clock
for 2560x1440 clock cycles. With a clock rate of 250MHz,
this design will process the whole image in 15ms. That's
fast enough to process 4MP images at 60fps, pretty
snazzy! Design b has a type signature of TSeq
(2560*1440/2) (SSeq 2 int) -> TSeq (2560*1440/2) (SSeq 2
int)
, encoding a throughput of two pixels per
clock for (2560*1440/2) clock cycles. This design can
process a whole image in 7.5ms, assuming a clock rate of
250MHz. If we need to process 4MP images at 120fps,
we'll use design b. Finally, if we need to save
resources, design c has a type signature of TSeq
(2560*1440) (TSeq 1 2 int) -> TSeq (2560*1440) (TSeq 1 2
int)
, encoding a throughput of one pixel every
third clock for (2560*1440*3) clock cycles. We nested
the TSeq
types in order to create a
repeating pattern of valid and invalid clocks, where no
data is accepted or emitted. This design can process a
whole image in 30ms, assuming a clock rate of 250MHz.
That's 30fps, still not too bad.
Wanna know which design to pick? Check out Aetherling.
It's up to you to pick which design. You could demand 120fps and take the two pixel per clock design. Or, you could save on area and live with 30fps. I suggest a more principled approach: why don't you decide how many FPGA resources you can use for the blur, and then pick the fastest design that will fit. To find out more about this approach, check out Aetherling. As an added bonus, Aetherling also demonstrates how to write code once in a high-level, data parallel DSL called the Sequence Language, and then automatically compile to various Space-Time IR designs with different throughput-resources trade-offs!