GainSight with Accel-Sim GPU Simulator Backend

A fork of the Accel-Sim simulator is used to provide cycle accurate simulation of NVIDIA GPU workloads. The simulator is modified to specifically dump more verbose memory access traces at the interfaces between the streaming multiprocessor (SM) and the two levels of local and shared memory caches.

There are two main ways to interact with the Accel-Sim backend for profiling GPU workloads:

Direct execution via the Python script accel_sim.py
Automated execution through provided shell scripts (with Slurm integration) that wrap the Python script

The Python Script: `accel_sim.py`

The main entry point for profiling with the GPU simulation backend is backend/python-scripts/accel_sim.py.

# Ensure PROJECT_ROOT is set and conda env is activated
# source setup.sh  # only run outside of Docker container
conda activate gainsight
python3 backend/python-scripts/accel_sim.py [options] <program> <args>

Command-Line Arguments

The main script, accel_sim.py, supports the following arguments:

Argument	Description	Passed By Slurm Scripts
`program`	The program to profile	All
`args`	Arguments to pass to the program	All
`--sample`	Run with kernel sampling using PKS	`gpu_trace.sh`, `gpu_replay.sh`
`--arch`	Architecture to simulate (default: "SM90_H100")	All (optional)
`--delete`	Delete the traces directory before running	All (optional)
`--verbose`	Store verbose output from Accel-Sim	All (optional)
`--histogram`	Plot histograms of the simulation results	All (optional)
`--trace-only`	Only generate traces, do not simulate	`gpu_trace.sh`
`--replay-only`	Only run simulation on existing traces	`gpu_replay.sh`
`--post-process`	Only run post-processing on simulation output	All (optional)
`--no-write-allocate`	Disable write allocate for the cache	All (optional)
`--ncu-file`	Use a specific Nsight Compute report for sampling	All (optional)
`--rename`	Specify a custom name for output files	All (optional)
`--sample-delete`	Delete traces that are not used for sampling	All (optional)

Profiling Workflow

The entire profiling process can also be run in a single command, which will automatically handle the necessary steps and arguments. For example:

python3 backend/python-scripts/accel_sim.py --sample <program> <args>

This command will run the entire profiling workflow, including trace generation, kernel sampling, and simulation.

Trace Generation (--trace-only): The target program runs with the NVBit tracer (tracer_tool.so) to capture SASS instructions. If sampling (--sample) is enabled, Nsight Compute (nsight_nvbit.py) may run first to gather kernel statistics.
1
python3 backend/python-scripts/accel_sim.py --trace-only --sample [options] <program> <args>
Optional Sampling (--sample): If --sample is used during trace generation, Principal Kernel Selection (pks.py) runs after tracing (or after Nsight Compute if --ncu-file wasn't provided) to select representative kernels based on runtime statistics. It modifies kernelslist.g to include only selected kernels. The --sample flag must also be passed during replay.
Simulation (--replay-only): Accel-Sim (accel-sim.out) processes the traces (traces/kernelslist.g) to simulate detailed memory behavior according to the specified architecture (--arch) and cache configuration (--no-write-allocate).
1
python3 backend/python-scripts/accel_sim.py --replay-only --sample [options] <program> <args>
Post-processing (Automatic after simulation, or via --post-process): The simulation output logs are parsed (accel_sim_parser.py) and analyzed by the analytical frontend (frontend/gain_cell_frontend.py) to extract statistics and generate summary files (CSVs, JSON).
1 2 3
# Usually runs automatically after replay step # Manual post-processing (example): python3 backend/python-scripts/accel_sim.py --post-process logs/<program>/<program>_<timestamp>.sim_cache.log

The Automated Slurm-Compatible Scripts

For cluster environments using the Slurm workload manager, we provide scripts that handle job submission with appropriate resource requests and all necessary arguments for accel_sim.py.

# Trace generation (requires GPU)
# Submits a Slurm job using scripts/gpu_trace.sh
# The parameters of --trace-only and --sample are always passed to the script
sbatch scripts/gpu_trace.sh [accel_sim.py options] <program> <args>

# Simulation and analysis (CPU only)
# Submits a Slurm job using scripts/gpu_replay.sh
# The parameters of --replay-only and --sample are always passed to the script
sbatch scripts/gpu_replay.sh [accel_sim.py options] <program> <args>

You can also run the scripts directly using bash instead of sbatch.

To run a full profiling workflow, you need to run the gpu_trace.sh script first, followed by the gpu_replay.sh script. The scripts will automatically handle the necessary arguments and resource allocation. One example of a complete workflow is provided in the scripts/pipe_cleaner.sh script.

Cleans previous traces for the example workload.
Generates traces using gpu_trace.sh (submits to Slurm if sbatch is available).
Runs simulation and post-processing using gpu_replay.sh
Cleans up intermediate output files from the example workload.

# Example complete workflow script (runs locally or submits Slurm jobs if sbatch is available)
bash scripts/pipe_cleaner.sh

Output Files

All outputs are saved under $PROJECT_ROOT/logs/<program_name>/ (or $PROJECT_ROOT/logs/<rename_value>/ if --rename is used):

traces/: Directory containing generated SASS traces (*.traceg)
traces/kernelslist.g: List of kernels to be simulated (original or sampled)
traces/kernelslist.old.g: Original list of kernels (if sampling was performed)
<name>_<timestamp>.ncu.out: Nsight Compute raw output (if sampling without --ncu-file)
<name>_<timestamp>.ncu.csv: Nsight Compute processed statistics (if sampling without --ncu-file)
<name>_<timestamp>.pks.log: Principal Kernel Selection log (if sampling)
<name>_<timestamp>.sim_cache.log: Detailed cache event log from simulation
<name>_<timestamp>.sim.log: Main simulator output log
<name>_<timestamp>.sim_verbose.log: Verbose simulator output (if --verbose)
<name>_<timestamp>.sim.csv: Kernel-level simulation statistics summary
<name>_<timestamp>.sim_l1.csv: L1 cache lifetime statistics
<name>_<timestamp>.sim_l2.csv: L2 cache lifetime statistics
<name>_<timestamp>.sim_l1.png: L1 cache lifetime histogram (if --histogram)
<name>_<timestamp>.sim_l2.png: L2 cache lifetime histogram (if --histogram)
<name>_<timestamp>.frontend.json: Analytical frontend JSON output
<name>_<timestamp>.postprocess.log: Frontend processing log
<name>_<timestamp>.sim_cmd.log: Record of the program and arguments used

Implementation Details

The complete workflow is orchestrated by multiple components:

scripts/gpu_trace.sh, scripts/gpu_replay.sh: Slurm submission scripts.
scripts/pipe_cleaner.sh: Example end-to-end workflow script.
backend/python-scripts/accel_sim.py: Main Python driver script coordinating all steps.
- Calls NsightNVBitRunner (nsight_nvbit.py) for kernel statistics if sampling.
- Calls NVBit tracer (tracer_tool.so) via subprocess.
- Calls PrincipalKernelSelector (pks.py) for kernel sampling.
- Calls Accel-Sim simulator (accel-sim.out) via subprocess.
- Calls accel_sim_parser.py for parsing simulation logs.
- Calls gain_cell_frontend.py for analytical post-processing.
backend/accel-sim/: Contains the modified Accel-Sim simulator source and build.
frontend/gain_cell_frontend.py: Performs analytical assessment of memory behavior from parsed simulation data.

The GPU simulation backend is particularly useful for:

Detailed memory access pattern analysis at the SASS instruction level.
Cache line lifetime profiling for L1 and L2 caches.
Simulating different cache configurations (e.g., write-allocate policy).
Identifying optimization opportunities for memory subsystems without needing hardware access during the long simulation phase.

See the source code and documentation in backend/python-scripts/ and frontend/ directories for implementation details.

Troubleshooting

If you encounter issues with the profiling process, consider the following:

Deleting stale traces:
Use the --delete option to remove old traces before running a new profiling session.
Ensure that the logs/<program_name>/traces/ directory is empty before starting a new run.
Deleting old simulation results:
Use the --delete option to remove old simulation results before running a new profiling session.
Use rm -rf logs/<program_name> to delete all logs for a specific program.
Other potential fixes:
Ensure that the /tmp directory has permissions set to 777 for the NVBit tracer to work correctly.
Ensure that there is enough space in the /logs directory for the generated traces and simulation results.
Ensure that you have a compatible NVIDIA GPU with no MPS server or other blocking processes running.
Use the --verbose option to enable verbose output from the simulator for debugging purposes.
If you encounter issues with the --sample option, ensure that Nsight Compute is installed and properly configured on your system. The --ncu-file option allows you to specify a custom Nsight Compute report for sampling.

Further Information

For more details on the Accel-Sim simulator itself, refer to the original authors of the simulator and the following resources: - The Accel-Sim Framework repository - The GPGPU-Sim Distribution repository