Debugging a Hanging Simulator
A common symptom of a failing simulation is that appears to
hang. Debugging this is especially daunting in FireSim because it’s not immediately
obvious if it’s a bug in the target, or somewhere in the host. To make it easier to
identify the problem, the simulation driver includes a polling watchdog that
tracks for simulation progress, and periodically updates an output file,
heartbeat.csv
, with a target cycle count and a timestamp. When debugging
these issues, we always encourage the use of metasimulation to try
reproducing the failure if possible. We outline three common cases in the
section below.
Case 1: Target hang.
Symptoms: There is no output from the target (i.e., the uartlog
might cease), but simulated time continues to advance (heartbeat.csv
will
be periodically updated). Simulator instrumentation (TracerV, printf) may
continue to produce new output.
Causes: Typically, a bug in the target RTL. However, bridge bugs leading to erroneous token values will also produce this behavior.
Next steps: You can deploy the full suite of FireSim’s debugging tools for failures of this nature, since assertion synthesis, printf synthesis, and other target-side features still function. Assume there is a bug in the target RTL and trace back the failure to a bridge if applicable.
Case 2: Simulator hang due to FPGA-side token starvation.
Symptoms: The driver’s main loop spins freely, as no bridge gets new work to do. As a result, the polling interval quickly elapses and the simulation is torn down due to a lack of forward progress.
Causes: Generally, a bug in a bridge implementation (ex. the BridgeModule has accidentally dequeued a token without producing a new output token; the BridgeModule is waiting on a driver interaction that never occurs).
Next steps: These are the trickiest to solve. Try to identify the bridge that’s responsible by removing unnecessary ones, using an AutoILA, and adding printfs to BridgeDriver sources. Target-side debugging utilities may be used to identify problematic target behavior, but tend not to be useful for identifying the root cause.
Case 3: Simulator hang due to driver-side deadlock.
Symptoms: The loss of all output, notably, heartbeat.csv
ceases to be further updated.
Causes: Generally, a bridge driver bug. For example, the driver may be busy waiting on some output from the FPGA, but the FPGA-hosted part of the simulator has stalled waiting for tokens.
Next Steps: Identify the buggy driver using printfs or attaching to the running simulator using GDB.
Simulator Heartbeat PlusArgs
+heartbeat-polling-interval=<int>
: Specifies the number of round trips through
the simulator main loop before polling the FPGA’s target cycle counter. Disable
the heartbeat by setting this to -1.