GainSight with Accel-Sim GPU Simulator Backend
A fork of the Accel-Sim simulator is used to provide cycle accurate simulation of NVIDIA GPU workloads. The simulator is modified to specifically dump more verbose memory access traces at the interfaces between the streaming multiprocessor (SM) and the two levels of local and shared memory caches.
There are two main ways to interact with the Accel-Sim backend for profiling GPU workloads:
- Direct execution via the Python script
accel_sim.py - Automated execution through provided shell scripts (with Slurm integration) that wrap the Python script
The Python Script: accel_sim.py
The main entry point for profiling with the GPU simulation backend is backend/python-scripts/accel_sim.py.
1 2 3 4 | |
Command-Line Arguments
The main script, accel_sim.py, supports the following arguments:
| Argument | Description | Passed By Slurm Scripts |
|---|---|---|
program |
The program to profile | All |
args |
Arguments to pass to the program | All |
--sample |
Run with kernel sampling using PKS | gpu_trace.sh, gpu_replay.sh |
--arch |
Architecture to simulate (default: "SM90_H100") | All (optional) |
--delete |
Delete the traces directory before running | All (optional) |
--verbose |
Store verbose output from Accel-Sim | All (optional) |
--histogram |
Plot histograms of the simulation results | All (optional) |
--trace-only |
Only generate traces, do not simulate | gpu_trace.sh |
--replay-only |
Only run simulation on existing traces | gpu_replay.sh |
--post-process |
Only run post-processing on simulation output | All (optional) |
--no-write-allocate |
Disable write allocate for the cache | All (optional) |
--ncu-file |
Use a specific Nsight Compute report for sampling | All (optional) |
--rename |
Specify a custom name for output files | All (optional) |
--sample-delete |
Delete traces that are not used for sampling | All (optional) |
Profiling Workflow
The entire profiling process can also be run in a single command, which will automatically handle the necessary steps and arguments. For example:
1 | |
This command will run the entire profiling workflow, including trace generation, kernel sampling, and simulation.
-
Trace Generation (
--trace-only): The target program runs with the NVBit tracer (tracer_tool.so) to capture SASS instructions. If sampling (--sample) is enabled, Nsight Compute (nsight_nvbit.py) may run first to gather kernel statistics.1python3 backend/python-scripts/accel_sim.py --trace-only --sample [options] <program> <args> -
Optional Sampling (
--sample): If--sampleis used during trace generation, Principal Kernel Selection (pks.py) runs after tracing (or after Nsight Compute if--ncu-filewasn't provided) to select representative kernels based on runtime statistics. It modifieskernelslist.gto include only selected kernels. The--sampleflag must also be passed during replay. -
Simulation (
--replay-only): Accel-Sim (accel-sim.out) processes the traces (traces/kernelslist.g) to simulate detailed memory behavior according to the specified architecture (--arch) and cache configuration (--no-write-allocate).1python3 backend/python-scripts/accel_sim.py --replay-only --sample [options] <program> <args> -
Post-processing (Automatic after simulation, or via
--post-process): The simulation output logs are parsed (accel_sim_parser.py) and analyzed by the analytical frontend (frontend/gain_cell_frontend.py) to extract statistics and generate summary files (CSVs, JSON).1 2 3
# Usually runs automatically after replay step # Manual post-processing (example): python3 backend/python-scripts/accel_sim.py --post-process logs/<program>/<program>_<timestamp>.sim_cache.log
The Automated Slurm-Compatible Scripts
For cluster environments using the Slurm workload manager, we provide scripts that handle job submission with appropriate resource requests and all necessary arguments for accel_sim.py.
1 2 3 4 5 6 7 8 9 | |
You can also run the scripts directly using bash instead of sbatch.
To run a full profiling workflow, you need to run the gpu_trace.sh script first, followed by the gpu_replay.sh script. The scripts will automatically handle the necessary arguments and resource allocation.
One example of a complete workflow is provided in the scripts/pipe_cleaner.sh script.
- Cleans previous traces for the example workload.
- Generates traces using
gpu_trace.sh(submits to Slurm ifsbatchis available). - Runs simulation and post-processing using
gpu_replay.sh - Cleans up intermediate output files from the example workload.
1 2 | |
Output Files
All outputs are saved under $PROJECT_ROOT/logs/<program_name>/ (or $PROJECT_ROOT/logs/<rename_value>/ if --rename is used):
traces/: Directory containing generated SASS traces (*.traceg)traces/kernelslist.g: List of kernels to be simulated (original or sampled)traces/kernelslist.old.g: Original list of kernels (if sampling was performed)<name>_<timestamp>.ncu.out: Nsight Compute raw output (if sampling without--ncu-file)<name>_<timestamp>.ncu.csv: Nsight Compute processed statistics (if sampling without--ncu-file)<name>_<timestamp>.pks.log: Principal Kernel Selection log (if sampling)<name>_<timestamp>.sim_cache.log: Detailed cache event log from simulation<name>_<timestamp>.sim.log: Main simulator output log<name>_<timestamp>.sim_verbose.log: Verbose simulator output (if--verbose)<name>_<timestamp>.sim.csv: Kernel-level simulation statistics summary<name>_<timestamp>.sim_l1.csv: L1 cache lifetime statistics<name>_<timestamp>.sim_l2.csv: L2 cache lifetime statistics<name>_<timestamp>.sim_l1.png: L1 cache lifetime histogram (if--histogram)<name>_<timestamp>.sim_l2.png: L2 cache lifetime histogram (if--histogram)<name>_<timestamp>.frontend.json: Analytical frontend JSON output<name>_<timestamp>.postprocess.log: Frontend processing log<name>_<timestamp>.sim_cmd.log: Record of the program and arguments used
Implementation Details
The complete workflow is orchestrated by multiple components:
scripts/gpu_trace.sh,scripts/gpu_replay.sh: Slurm submission scripts.scripts/pipe_cleaner.sh: Example end-to-end workflow script.backend/python-scripts/accel_sim.py: Main Python driver script coordinating all steps.- Calls
NsightNVBitRunner(nsight_nvbit.py) for kernel statistics if sampling. - Calls NVBit tracer (
tracer_tool.so) via subprocess. - Calls
PrincipalKernelSelector(pks.py) for kernel sampling. - Calls Accel-Sim simulator (
accel-sim.out) via subprocess. - Calls
accel_sim_parser.pyfor parsing simulation logs. - Calls
gain_cell_frontend.pyfor analytical post-processing.
- Calls
backend/accel-sim/: Contains the modified Accel-Sim simulator source and build.frontend/gain_cell_frontend.py: Performs analytical assessment of memory behavior from parsed simulation data.
The GPU simulation backend is particularly useful for:
- Detailed memory access pattern analysis at the SASS instruction level.
- Cache line lifetime profiling for L1 and L2 caches.
- Simulating different cache configurations (e.g., write-allocate policy).
- Identifying optimization opportunities for memory subsystems without needing hardware access during the long simulation phase.
See the source code and documentation in backend/python-scripts/ and frontend/ directories for implementation details.
Troubleshooting
If you encounter issues with the profiling process, consider the following:
- Deleting stale traces:
- Use the
--deleteoption to remove old traces before running a new profiling session. - Ensure that the
logs/<program_name>/traces/directory is empty before starting a new run. - Deleting old simulation results:
- Use the
--deleteoption to remove old simulation results before running a new profiling session. - Use
rm -rf logs/<program_name>to delete all logs for a specific program. - Other potential fixes:
- Ensure that the
/tmpdirectory has permissions set to777for the NVBit tracer to work correctly. - Ensure that there is enough space in the
/logsdirectory for the generated traces and simulation results. - Ensure that you have a compatible NVIDIA GPU with no MPS server or other blocking processes running.
- Use the
--verboseoption to enable verbose output from the simulator for debugging purposes. - If you encounter issues with the
--sampleoption, ensure that Nsight Compute is installed and properly configured on your system. The--ncu-fileoption allows you to specify a custom Nsight Compute report for sampling.
Further Information
For more details on the Accel-Sim simulator itself, refer to the original authors of the simulator and the following resources: - The Accel-Sim Framework repository - The GPGPU-Sim Distribution repository