Skip to content

Kernel Sampling for Efficient Backend Simulation in GainSight

Simulating every kernel in large GPU workloads can be computationally prohibitive, especially for detailed, cycle-accurate studies. To address this, GainSight employs a kernel sampling strategy based on Principal Kernel Selection (PKS), adapted from Principal Kernel Analysis (PKA) by Avalos Baddouh et al. [MICRO 2021].

Principal Kernel Selection (PKS) Methodology

Principal Kernel Selection reduces simulation cost by grouping similar kernels and simulating only a small, representative subset. The process can be summarized as follows:

  1. Profiling: Each application is profiled on real hardware using NVIDIA Nsight Compute, collecting microarchitecture-agnostic features such as global loads, stores, and atomic operations. These metrics are chosen to be robust across GPU generations.
  2. Dimensionality Reduction: Principal Component Analysis (PCA) is applied to the collected metrics to reduce dimensionality, focusing on the principal axes of variance.
  3. Clustering: K-means clustering is performed on the PCA-transformed data to group similar kernels. The number of clusters (K) is swept (typically 1–20), and for each K, the error in projected total runtime (or another key metric) is computed. The smallest K that achieves a user-specified error threshold (e.g., 5%) is selected, balancing accuracy and simulation speedup.
  4. Representative Selection: For each cluster, a single representative kernel is chosen. The original PKA work found that selecting the first chronological kernel or the one closest to the cluster centroid yields similar accuracy, but the centroid is often used for practical reasons. The runtime of the representative is scaled by the cluster size to project the total workload.
  5. Simulation: Only the representative kernels are simulated in detail, with their results scaled to represent the full workload. This can reduce the number of simulated kernels from thousands to a handful, with minimal error in key metrics.

PKS is scalable to large workloads and robust across GPU generations. It enables over 100× speedup in simulation for large AI workloads, while maintaining accuracy for memory and performance metrics. The method is also extensible to two-level profiling, where only a subset of kernels are profiled in detail and the rest are classified using lightweight features and machine learning.

Implementation in GainSight (pks.py and Integration)

The file gainsight/backend/python-scripts/pks.py implements the PKS algorithm as follows:

  • Metric Loading: Reads a configuration file specifying which kernel metrics to use.
  • Data Preparation: Loads kernel profiling data from an Nsight Compute report and constructs a DataFrame of metrics for all kernels.
  • PCA and Clustering: Performs PCA on the metrics and applies K-means clustering to group kernels. The number of clusters is automatically selected by minimizing error in a key metric (e.g., L2 write count), similar to the original PKA approach.
  • Centroid Identification: For each cluster, identifies the kernel closest to the centroid in PCA space as the representative.
  • Output Generation: Produces a list of selected kernels (kernelslist.g) and CSV files summarizing the clustering and selection results. Optionally, non-representative kernel trace files can be deleted to save space.
  • Bypass for Small Workloads: If the workload contains few kernels, all are selected without clustering.

Integration with the simulation backend is handled in accel_sim.py, which invokes PKS as part of the simulation workflow. When kernel sampling is enabled, accel_sim.py runs Nsight Compute to generate a profiling report, then calls pks.py to select representative kernels and update the trace list (kernelslist.g). Only these kernels are then simulated by Accel-Sim, greatly reducing simulation time while preserving accuracy for key metrics such as memory lifetimes and write frequencies.

This implementation closely follows the methodology of Avalos Baddouh et al. (MICRO 2021), enabling scalable and accurate simulation of large GPU workloads in the GainSight framework.

References

Cesar Avalos Baddouh, Mahmoud Khairy, Roland N. Green, Mathias Payer, and Timothy G. Rogers. 2021. Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '21). Association for Computing Machinery, New York, NY, USA, 724–737. https://doi.org/10.1145/3466752.3480100