Bambu: A Free Framework for the High-Level Synthesis of Complex Applications
Bambu is a free framework aimed at assisting the designer during the high-level synthesis of complex applications, supporting most of the C constructs (e.g., function calls and sharing of the modules, pointer arithmetic and dynamic resolution of memory accesses, accesses to array and structs, parameter passing either by reference or copy, …). Bambu is developed for Linux systems, it is written in C++11, and it can be freely downloaded under GPL license.
Bambu receives as input a behavioral description of the specification, written in C/C++ language, and generates the HDL description of the corresponding RTL implementation as output, which is compatible with commercial RTL synthesis tools, along with a test-bench for the simulation and validation of the behavior. Bambu is designed in an extremely modular way, implementing the different tasks of the HLS process, and specific algorithms, in distinct C++ classes which work on different IRs depending on the synthesis stage.
The whole HLS flow is quite similar to a software compilation flow: it starts from a high-level specification and produces low-level code after a sequence of analysis and optimization steps.
As well as software compilation flow has, three different phases can be identified in the High-Level Synthesis flow: front-end, middle-end and back-end. In the front-end, the input code is parsed and translated in an intermediate representation which will be used in the following parts of the flow. In the middle-end target-independent analyses and optimizations are performed.
Bambu interfaces with compilers: GCC and Clang. The GNU Compiler Collection (GCC) (version 4.5, 4.6, 4.7, 4.8, 4.9, 5, 6, and 8 are currently supported) by means of GCC plugins to extract its internal representation in Static Single Assignment form of the initial C/C++ code. The IR used is the one obtained by GCC after all the target and language-independent optimizations. Similarly, Clang/LLVM has been integrated into Bambu by exploiting a plugin that lowers the LLVM IR into the one used by Bambu. Current Clang versions supported are 4, 5, 6, 7. The data exchanged between the plugins and Bambu is based on an ASCII file easy to parse with flex and bison.
Note however that not all the software code optimizations are profitable when the target is a hardware accelerator. For example, the effects of transformations like function inlining and loop unrolling can impact much more on resource utilization than the same transformation done when a processor is considered.
Starting from the intermediate representation extracted from GCC/Clang, Bambu performs further analyses and builds additional internal representations, such as Call Graph, Control Flow Graphs, Data Flow Graphs and Program Dependence Graphs.
Next, it applies a set of analyses and transformations independently from the target device.
Some of these steps are the same applied in a software compilation flow (e.g., data flow analysis, loop recognition, dead code elimination, constant propagation, etc.).
One relevant specific optimizations performed by Bambu during this phase is the optimization of multiplications and divisions by a constant.
These operations are typically transformed into operations that use only shifts and adds to improve area and timing.
Another analysis performed at this stage is the Bitwidth Analysis that aims to reduce the number of bits required by datapath operators.
This is a very important optimization because it impacts all non-functional requirements (e.g. performance, area, power) of a design, without affecting its behavior.
Differently from general-purpose processor compilers, which are designed to target a processor with a fixed-sized datapath (usually 32 or 64 bits), a hardware compiler can exploit specialization by generating custom-size operators (i.e. functional units) and registers. As a direct consequence, we can select the minimal number of bits required for an operation and/or storage of the specific algorithm, which in turn leads to minimal space used for registers, smaller functional units that translate into less area, less power, and shorter critical paths.
However, this analysis cannot be usually completely automated since it often requires specific knowledge of the algorithm and the input datasets.
Bambu implements the methodology describes in budiu-tr00.pdf integrated with the Value Range information computed by the GCC/Clang compiler.
In this phase the actual High-Level Synthesis of the specification is performed.
Even if the same HDL language can be used to describe architectures implemented for different families of devices, the HLS flow is not target-independent but takes into account information about the target device. Moreover, FPGAs do not have a fixed operating frequency, but this can be decided by the designer or forced by devices (e.g., sensors or actuators) connected to it.
The synthesis process acts on each function separately. The resulting architecture is modular, reflecting the structure of the call graph.
The modules implementing the single functions include two different parts: the control logic and the data-path.
The control logic is modeled as a Finite State Machine which handles the routing of the data within the data-path and the execution of the single operations.
The generated data-path is a custom mux-based architecture optimized on the dimension of the data types to reduce the number of flip-flops and bit-level multiplexers.
It implements all the operations that have to be executed and stores their input and output.
The back-end phase generates the actual hardware architecture by performing the following steps:
Functions Allocation defines the hierarchy of the modules implementing the functions of the specification built.
Bambu is currently able to use and integrate functions described at low-level in Verilog or in VHDL with functions described at high-level in C/C++.
Memories Allocation defines the memories used to store aggregate variables (arrays and structures), global variables, and how the dynamic memory allocation is implemented.
Bambu adopts a novel architecture for memory accesses: it builds a hierarchical data-path directly connected to a dual-port BRAM whenever a local aggregated or a global scalar/aggregate data type is used by the code specified and whenever the accesses can be determined at compile time.
In this case, multiple memory accesses can be performed in parallel.
Otherwise, the memories are interconnected so that it is also possible to support the dynamic resolution of the addresses.
Indeed, the same memory infrastructure can be natively connected to external components (e.g. a local scratch-pad memory or cache) or directly to the bus to access off-chip memory.
Resource allocation associates operations in the specification to Functional Units (FUs) in the resource library. During the middle-end phase the specification is inspected, and operations characteristics identified. Such characteristics include the kind of operation (e.g. addition, multiplication, …), and input/output value types (e.g. integer, float, …).
Floating-point operations are supported through the High-Level Synthesis of a soft-float library containing basic soft-float operations or through FloPoCo, a generator of arithmetic Floating-Point Cores. The allocation step maps them on the set of available FUs: their characterization includes information, such as latency, area, and a number of pipeline stages. Usually, more operation/FU matchings are feasible: in this case, the selection of a proper FU is driven by design constraints. In addition to FUs, also memory resources are allocated. Local data in fact, maybe bound to local memories.
The library of functional units used by Bambu is quite rich and in some cases, it includes several implementations for the same single operation.
Moreover, the library contains functional units that are expressed as templates in a standard hardware description language (i.e. Verilog or VHDL). These templates can be retargeted and customized on the basis of the characteristics of the target technology. In this case, the underlying logic synthesis tool can determine which is the best architecture to implement each function. For example, multipliers can be mapped either on dedicated DSP blocks or implemented with LUTs. To perform aggressive optimizations, each component of the library is annotated with information useful during the entire HLS process, such as resource occupation and latency for executing the operations. Bambu adopts a pre-characterization approach. That is, the performance estimation considers a generic template of the functional unit, which can be parametric with respect to the bitwidths and pipeline stages. Latency and resource occupation are then obtained by synthesizing each configuration and storing the results in the library.
Scheduling of operations is performed by default through a LIST-based algorithm, which is constrained by resource availability. In its basic formulation, the LIST algorithm associates to each operation a priority, according to particular metrics. For example, priority may reflect operations mobility with respect to the critical path. Operations belonging to the critical path have zero-mobility: delaying their execution usually results in an increase of the overall circuit latency. Critical path and mobilities can be obtained analyzing As Soon As Possible (ASAP) and As Late As Possible (ALAP) schedules. The LIST approach proceeds iteratively associating to each control step, operations to be executed. Ready operations (e.g. whose dependencies have been satisfied in previous iterations of the algorithm) are scheduled in the current control step considering resource availability: if multiple ready operations compete for a resource than the one having a higher priority is scheduled. Alternatively, a Speculative scheduling algorithm based on System of Difference Constraints (see Code Transformations Based on Speculative SDC Scheduling paper) is available: this algorithm builds an integer linear programming formulation of the scheduling problem, allowing code motions and speculations of operations into different basic blocks. The solution produced by the ILP solver is then implemented by applying the code motions and the speculations suggested by the ILP solution, then the rest of the High-Level Synthesis flow can be implemented. After the scheduling task, it is possible to build State Transition Graph (STG) accordingly: the STG is adopted for further analysis and to build the final Finite State Machine implementation for the controller.
Operations that execute concurrently, according to the computed schedule, are not allowed to share the same FU instance, thus avoiding resource conflicts. In Bambu, binding is performed through a clique covering algorithm on a weighted compatibility graph. The compatibility graph is built by analyzing the schedule: operations scheduled on different control steps are compatible. Weights express how much is profitable for two operations to share the same hardware resource. They are computed taking into account area/delay trade-offs as a result of sharing; for example, FUs that demand a large area will be more likely shared. Weights computation also considers the cost of interconnections for introducing steering logic, both in terms of area and frequency. Bambu offers several algorithms also for solving the covering problem on generic compatibility/conflict graphs.
Register binding associates storage values to registers, and requires a preliminary analysis step, the Liveness Analysis (LA). LA analyzes the scheduled function and identifies the life intervals of each variable, i.e. the sequence of control steps in which a temporary needs to be stored. Storage values with non-overlapping life intervals may share the same register. In default settings, the Bambu flow computes liveness information through a non-iterative SSA liveness analysis algorithm (see Non-Iterative SSA liveness analysis paper). Register assignment is then reduced to the problem of coloring a conflict graph. The nodes of the graph are storage values, edges represent the conflict relation. Algorithms for a weighted clique covering compatibility graph solving the register binding problem are also available.
Interconnections are bound according to the previous steps: if a resource is shared, then the algorithm introduces steering logic on its inputs. It also identifies the relation between control signals and different operations: such signals are then set by the controller.
During the synthesis process, the final architecture is represented through a hyper-graph, which also highlights the interconnection between modules.
The netlist generation step translates such representation in a Verilog or VHDL description. The process access the resource library, which embeds the Verilog or the VHDL implementation of each allocated module.
Generation of Synthesis and Simulation Scripts
Bambu provides the automatic generation of synthesis and simulation scripts which can be customized by means of XML configuration files. This feature allows the automatic characterization of the resource library, providing technology-aware details during the High-Level Synthesis.
The tools for RTL-synthesis currently supported are:
- Xilinx ISE,
- Xilinx VIVADO
- Altera Quartus
- Lattice Diamond
while the supported simulators are:
- Mentor Modelsim,
- Xilinx ISIM
- Xilinx XSIM
- Verilog Icarus