Efficient message passing is one of the core building blocks for SimBricks. As we already touched on in our technical overview, SimBricks builds full-system simulations by connecting multiple component simulators loosely coupled through message passing. In particular, SimBricks uses message passing for exchanging data at component boundaries (e.g. network packets or PCIe transactions) but also for synchronizing simulators for meaningful performance results. This post will dive into how SimBricks implements efficient message passing.
Key Goal: Minimize Overhead in Bottleneck Simulator
Many simulators are already notoriously slow when running individually. Thus, one key goal for our message passing mechanism in SimBricks is to avoid further slowing down these already slow component simulators. This is particularly important for synchronized SimBricks simulations that by construction run as fast as the slowest component. Additionally, SimBricks also needs a message passing mechanism that scales to simulations with many communicating components and is easy to integrate into a diverse range of different existing simulators.
Message Passing Through Shared Memory Queues
SimBricks meets these goals by implementing message passing via optimized, polled shared-memory queues. Simply put, each pair of communicating simulators creates a shared memory region and places a pair of circular queues, one in each direction, in that region. A simulator then sends a message to its peer simply by writing it into the next free slot of its outgoing circular queue. Each simulator periodically checks its incoming queues for new messages and schedules them for processing at the timestamp indicated in the message (remember our synchronization). Note that simulators only need poll for a new message, once ready to process the next message. Since messages are timestamped and their processing scheduled accordingly, this typically means polling for the next message after finishing processing for the previous message.
Polled Shared Memory Message Passing Minimizes Overhead at Bottlenecks
We designed and optimized SimBricks message passing to meet our specific design goals. In particular, our approach minimizes communication overheads between simulators running as separate parallel processes, typically on separate cores. Specifically, polled SHM message passing minimizes overheads where it counts: in bottleneck simulators. While each simulator has to actively poll for messages, the slowest simulator in a system never has to wait for messages from others. Whenever the bottleneck simulator finishes processing the previous message, it checks the next queue slot, already finds the message and immediately schedules it for processing. For sending as well, a bottleneck simulator never has to wait for empty slots in its outgoing queue. Other faster simulators may sometimes have to repeatedly poll and wait for messages, but this, by construction, only occurs when they are not the bottleneck and thus does not slow down simulations. Polled shared memory message passing also generally does not require any expensive operating system involvement for sending and receiving messages, only during initial setup.
Deep Dive: Minimal Cache Coherence Overheads
Polled shared memory message passing also matches today’s multicore architectures very well and, when carefully implemented, minimizes cache coherence overhead. Specifically, optimized polled queues, only communicate the absolute minimum of information between cores. A key design choice here is to place messages directly in fixed-sized slots directly in the queue.
Here we focus on receiving messages: there are two cases to consider, either the sender already placed a message in the slot or not yet. Either way, the receiver’s first access incurs a cache miss (albeit accessing a known location, so prefetching can avoid stalling on a demand miss). If the message is already there, that miss immediately pulls in the first cache line of the message with the critical metadata for further processing. If not, the polling receiver incurs a single miss for the first access to learn that the slot is empty (required information), but future unsuccessful poll attempts will be cached in the local cache and not incur further misses. Only once the sender writes to the cache line (incurring a miss) to place the new message, this local copy will be invalidated at the receiver and the next access will then pull in the required data.
Teaser: Connecting Simulators Across Physical Machines
This post presented our approach to connecting simulators running on a single physical machine. In a future post soon, we are looking forward to discussing how we build on this to efficiently connect simulators running on different physical machines to enable large-scale simulations.
Until then, if you have questions or would like to learn more: