Designing heterogeneous systems with specialized hardware accelerators, such as FPGAs and ASICs, requires balancing performance and energy efficiency. Early estimation of these metrics is crucial for optimal design decisions before physical prototypes are available. While tools exist for per-component energy modeling, they fall short of providing full-system insights. Combining SimBricks Virtual Prototypes for capturing high-quality workload information with existing energy models makes it possible to estimate full-system energy usage early in development.

Early Performance and Energy Estimates Allow for Optimal Design Decisions

We build increasingly heterogeneous systems that use specialized hardware such as FPGAs and ASICs to offload computations from the general-purpose processor. The key metrics we aim to improve while doing so are full-system performance, such as request throughput or latency, and full-system energy usage, such as the energy consumed per request. Introducing specialized hardware accelerators can provide orders of magnitude improvements for both these metrics.

While developing these systems, including the hardware accelerator, we face a vast space of system- and component-level design choices, with many influencing full-system performance and energy. Further, designs optimized for performance differ from those optimized for energy usage. Due to this performance-energy trade-off, for optimal design choices regarding our concrete use case, we need to be able to estimate performance and energy with confidence early on before physical prototypes are available. While SimBricks Virtual Prototypes enable measuring full-system performance early on, no such tools do so full-system for energy.

Current Prototyping Tools Do Not Provide Full-System Energy Estimates

Power (!= energy, but can be easily calculated from power if the workload duration is known) estimation models already exist, for example, built-in models of commercial tools such as Synopsys PrimePower1, Cadence Genus2, AMD Vivado3, or open-source tools like OpenROAD OpenSTA4, McPat5, and GemStone6. But all these are for single system components only. Even for a simple heterogeneous system consisting of a host and a PCIe-attached hardware accelerator, we would like to know full-system energy estimates spanning the host CPU, its caches, host memory, the accelerator, and accelerator memory (if existent).

Further, these existing power estimation models require high-quality, component-level workload information for accurate results. This has a good reason since a considerable portion of silicon power draw is dynamic and increases with transistors switching more frequently. This workload information has to be component-level though, i.e., it needs to represent how the workload interacts with the component we are estimating. Suppose, for example, we wrote the RTL code for our hardware accelerator, ran synthesis up to place and route, and are now using OpenSTA for power estimation. This concrete power estimation model requires information on how frequently all circuit signals switch. Traditionally, to collect this, engineers would simulate the RTL, while trying to stimulate the hardware accelerator’s inputs accurately compared to what the workload software running in the real system would do. They do so with a testbench, which is laborious because we have to consider low-level interactions between hardware components and can easily incur modeling errors.

Combining Existing Per-Component Energy Models and SimBricks Virtual Prototypes for Full-System Energy Estimates

With SimBricks Virtual Prototypes, we do not have to model the workload since we can directly run the workload’s entire software stack, including all interactions between hardware components. Since we are using simulation, we can easily collect high-quality, component-level workload information without influencing the simulated system. This naturally led me to explore the idea of combining existing power estimation models and SimBricks Virtual Prototypes with my Master’s thesis. For this, I am making heavy use of SimBricks’ modularity in combining different simulators.

Let us go through how to do this step-by-step. The figure below also visualizes the process. First, we pick the power estimation models followed by the simulators for realizing the Virtual Prototype. When picking the simulators, it is important to ensure that they can supply the level of detail in workload information required by the power estimation models. Next, we configure the SimBricks Virtual Prototype to represent our system and run the workload’s software stack using the SimBricks orchestration framework.

alt_text

With this, we are ready to run the SimBricks simulation. We instruct simulators to periodically log the workload information, yielding what I call workload information samples. We need these to capture how the power draw dynamically changes over time due to how the workload software interacts with the system’s hardware. Feeding these workload information samples into the per-component power models in the next step effectively yields per-component power time series, i.e., power consumed by a single component over time.

In a final post-processing step, we sum the per-component power time series into a single full-system power time series, which we also use to compute full-system energy. Under the hood, we need a few additional things like piece-wise constant interpolation because timestamps of the workload information samples across simulators might not align. However, I am omitting the details here. If you are interested, I recommend taking a look at my Master’s thesis7.

If you want to stay in the loop on SimBricks updates, chat about energy estimation, or SimBricks in general, feel free to use the buttons below!