Skip to content

GainSight Accel-Sim Backend Python API Documentation

This page documents the Python scripts in gainsight/backend/python-scripts/ supporting the Accel-Sim backend using mkdocstrings. Please refer to the backend wiki for a summary of implementation details and usage instructions.


Accel-Sim Runner Script

The accel_sim.py script is the main entry point for running the Accel-Sim simulator. It handles command-line arguments, initializes the simulation environment, and manages the execution of the simulation.

Simulation-based, cache line-level analysis of GPU programs using Accel-Sim.

This script runs the Accel-Sim simulator on a given program with the specified arguments, by first generating SASS traces using the NVBit tracer and then running the simulator on the generated traces. The script also provides an option to run the program with kernel sampling using principal kernel selection (PKS) to reduce the number of kernels in the traces.

TODO: Output redirection to log files, error handling, and more detailed documentation.

Usage

python3 accel_sim.py python3 accel_sim.py [--sample][--arch ] [--delete][--verbose]

Pre-requisites
  • The program must exist and be executable.
  • The program must be run within its desired working directory.
  • The Accel-Sim tools must be compiled and available in the environment.
  • The PROJECT_ROOT and CUDA_INSTALL_PATH environment variables must be set to the project root directory and the CUDA installation path, respectively.

Parameters:

Name Type Description Default
program str

The program to profile.

required
args List[str]

The arguments to pass to the program.

required
--sample bool

Run the program with kernel sampling using PKS.

required
--arch str

The architecture to simulate (default: "SM90_H100").

required
--delete bool

Delete the traces directory

required
--verbose bool

Store verbose output from Accel-Sim.

required
Output

The log files are saved under $PROJECT_ROOT/logs/ and contains the following files: - traces: The directory containing the generated SASS traces. - traces/kernelslist.g: The file containing the list of kernels in the traces; if kernel sampling is used, this file will be updated with the selected kernels. - .sim_cache.log: The cache log file containing detailed memory access information. - .sim.log: The main log file containing the output from the simulator with key simulation runtime information. - .sim.csv: The CSV file containing the kernel-level simulation statistics in tabular format. - .sim_l1.csv: The L1 cache lifetime statistics in tabular format. - .sim_l2.csv: The L2 cache lifetime statistics in tabular format. - .frontend.json: JSON output from the frontend post-processing script. - .postprocess.log: Log output from the frontend post-processing script. - .sim_cmd.log: The command used to run the simulation.

AccelSimRunner

Manages the Accel-Sim workflow: tracing, sampling, simulation, and post-processing.

This class encapsulates the steps required to profile a GPU program using Accel-Sim: 1. Optionally run Nsight Compute and Principal Kernel Selection (PKS) for sampling. 2. Run the NVBit tracer to generate SASS instruction traces. 3. Run the Accel-Sim GPGPU simulator on the generated (or sampled) traces. 4. Run post-processing scripts to analyze simulation output and generate reports.

Attributes:

Name Type Description
program str

Absolute path to the executable program or Python interpreter.

program_args List[str]

List of arguments to pass to the program.

arch str

Target GPU architecture for simulation (e.g., "SM90_H100").

verbose bool

Flag to enable verbose logging during simulation.

delete bool

Flag to delete existing trace directory before tracing.

pwd str

The original working directory from where the script was invoked.

original_cwd str

The original current working directory (same as pwd).

log_file_name str

Base name for log files (program_timestamp).

log_file_path str

Path to the directory where logs are stored.

kernelslist_g str

Path to the kernelslist.g file within the trace directory.

ncu_file str

Path to an existing NCU report file to use for sampling.

Source code in python-scripts/accel_sim.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
class AccelSimRunner:
    """Manages the Accel-Sim workflow: tracing, sampling, simulation, and post-processing.

    This class encapsulates the steps required to profile a GPU program using Accel-Sim:
    1. Optionally run Nsight Compute and Principal Kernel Selection (PKS) for sampling.
    2. Run the NVBit tracer to generate SASS instruction traces.
    3. Run the Accel-Sim GPGPU simulator on the generated (or sampled) traces.
    4. Run post-processing scripts to analyze simulation output and generate reports.

    Attributes:
        program (str): Absolute path to the executable program or Python interpreter.
        program_args (List[str]): List of arguments to pass to the program.
        arch (str): Target GPU architecture for simulation (e.g., "SM90_H100").
        verbose (bool): Flag to enable verbose logging during simulation.
        delete (bool): Flag to delete existing trace directory before tracing.
        pwd (str): The original working directory from where the script was invoked.
        original_cwd (str): The original current working directory (same as pwd).
        log_file_name (str): Base name for log files (program_timestamp).
        log_file_path (str): Path to the directory where logs are stored.
        kernelslist_g (str): Path to the kernelslist.g file within the trace directory.
        ncu_file (str): Path to an existing NCU report file to use for sampling.
    """
    def __init__(self, program, args, arch="SM90_H100", verbose=False, delete=False, rename=None, ncu_file=None):
        """Initialize the AccelSimRunner.

        Sets up paths, log naming, and execution environment based on input parameters.

        Args:
            program (str): Path or name of the program to profile.
            args (List[str]): Arguments to pass to the program.
            arch (str): GPU architecture (default: "SM90_H100").
            verbose (bool): Enable verbose simulator logging.
            delete (bool): Remove existing traces directory.
            rename (str): Custom base name for log files.
            ncu_file (str): Existing Nsight Compute report for sampling.

        Returns:
            None: Initializes runner state and prepares directories.
        """
        # Convert program to absolute path if it's a relative path
        self.program = os.path.abspath(program) if not program in [
            "python", "python3", "torchrun"] else program
        self.program_args = args
        self.arch = arch
        self.verbose = verbose
        self.delete = delete
        self.pwd = os.getenv('PWD', os.getcwd())

        # Store original working directory
        self.original_cwd = os.getcwd()

        output_program_name = os.path.basename(program)
        # If the program is python3, we need to use sys.executable
        program_basename = os.path.basename(program)
        if program_basename == "python" or program_basename.startswith("python3") or program_basename == "torchrun":
            output_program_name = ".".join(
                os.path.basename(args[0]).split('.')[:-1])
            self.program = sys.executable
            self.program_args[0] = os.path.abspath(args[0])
        # If the program is a .py file, we need to use sys.executable
        if program.endswith(".py"):
            output_program_name = ".".join(
                os.path.basename(program).split('.')[:-1])
            self.program_args = [os.path.abspath(program)] + args
            self.program = sys.executable
        if rename:
            output_program_name = rename

        # Generate a timestamp for the log file name
        timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        self.log_file_name = f"{output_program_name}_{timestamp}"
        self.log_file_path = os.path.join(
            os.getenv('PROJECT_ROOT', '.'), 'logs', output_program_name)
        os.makedirs(self.log_file_path, exist_ok=True)

        self.kernelslist_g = os.path.join(
            self.log_file_path, 'traces', 'kernelslist.g')

        # Dump program and arguments to a file
        with open(os.path.join(self.log_file_path, f"{self.log_file_name}.sim_cmd.log"), 'w') as f:
            f.write(f"{self.program} {' '.join(self.program_args)}")

        self.ncu_file = ncu_file

    def run_tracer(self, sample=False):
        """Run the NVBit tracer to generate SASS instruction traces for the program.

        Sets up the environment for the NVBit tracer tool, executes the target program
        under the tracer, and then runs the post-processing script to convert raw
        traces into the format required by Accel-Sim (`.traceg` files). Optionally
        deletes the intermediate `.trace` files.

        Args:
            sample (bool): Enable kernel sampling via environment variable. Defaults to False.

        Returns:
            str: Path to the directory containing processed traces.

        Raises:
            FileNotFoundError: If the program executable is not found.
        """
        trace_path = os.path.join(self.log_file_path, 'traces')
        project_root = os.getenv('PROJECT_ROOT', '/gainsight')
        tracer_tool_path = os.getenv('TRACER_TOOLS', os.path.join(
            project_root, 'backend', 'accel-sim', 'util', 'tracer_nvbit', 'tracer_tool'))

        # Print kernelslist_g if it exists
        if os.path.exists(self.kernelslist_g):
            print(f"Kernels list file already exists: {self.kernelslist_g}")
            with open(self.kernelslist_g, 'r') as f:
                print(f.read())

        # Clear the traces directory
        if os.path.exists(os.path.join(trace_path, 'stats.csv')):
            print(
                f"Processed traces from {self.program} have already been generated and saved to {trace_path}")
            return trace_path
        os.makedirs(trace_path, exist_ok=True)

        # Run the tracer
        env = os.environ.copy()
        env['CUDA_INJECTION64_PATH'] = os.path.join(
            tracer_tool_path, 'tracer_tool.so')
        env['SAMPLE'] = '1' if sample else '0'
        env['KERNELSLIST'] = self.kernelslist_g
        env['USER_DEFINED_FOLDERS'] = '1'
        env['TRACES_FOLDER'] = trace_path
        env['TRACE_FILE_COMPRESS'] = '0'
        # Only trace the first 1000 kernels
        env['DYNAMIC_KERNEL_LIMIT_START'] = '0'
        env['DYNAMIC_KERNEL_LIMIT_END'] = '1000'
        env['INSTR_END'] = '200000000'

        # Check if the program exists
        if not os.path.exists(self.program) and not self.program in ["python", "python3", "torchrun"]:
            raise FileNotFoundError(f"Program not found: {self.program}")

        # Use original working directory instead of log file path
        # only run this if kernelslist does not exist
        if not os.path.exists(os.path.join(trace_path, 'kernelslist')):
            print("Running the tracer...")
            subprocess.run(
                [self.program] + self.program_args,
                env=env,
                cwd=self.pwd
                # cwd=self.log_file_path
            )
        else:
            print(
                f"Traces from {self.program} have already been generated and saved to {trace_path}")

        # Process the traces
        print("Processing traces...")
        subprocess.run([
            os.path.join(tracer_tool_path, 'traces-processing',
                         'post-traces-processing'),
            os.path.join(trace_path, 'kernelslist')
        ])

        # Delete the .trace files to save space
        for trace_file in os.listdir(trace_path):
            if trace_file.endswith('.trace') or trace_file.endswith('.trace.xz'):
                os.remove(os.path.join(trace_path, trace_file))
        print(
            f"Traces from {self.program} consisting of {len(os.listdir(trace_path)) - 3} kernel(s) have been saved to {trace_path}")
        return trace_path

    def run_accel_sim(self, no_write_allocate=False):
        """
        Run the Accel-Sim GPGPU simulator on the generated traces.

        Sets up the environment for Accel-Sim, constructs the command line with
        appropriate configuration files (including handling the write-allocate setting),
        and executes the simulator. It captures and redirects the simulator's output
        to different log files (`.sim_cache.log`, `.sim.log`, `.sim_verbose.log`)
        based on regex patterns.

        Args:
            no_write_allocate (bool): Disable cache write-allocate configuration. Defaults to False.

        Returns:
            None: Executes the simulation and writes log files.
        """
        # Get and set the environment variables
        env = os.environ.copy()
        project_root = os.getenv('PROJECT_ROOT', '/gainsight')
        ld_library_path = env.get('LD_LIBRARY_PATH', '')
        accel_sim_root = os.path.join(
            project_root, 'backend', 'accel-sim', 'gpu-simulator')
        env['ACCELSIM_ROOT'] = accel_sim_root
        gpgpusim_root = os.path.join(accel_sim_root, 'gpgpu-sim')
        gpgpusim_config = env.get(
            'GPGPUSIM_CONFIG', 'gcc-11.4.0/cuda-11080/release')
        ld_library_path = re.sub(
            rf'{re.escape(gpgpusim_root)}/lib/[0-9]+/(debug|release):', '', ld_library_path)
        ld_library_path = f"{gpgpusim_root}/lib/{gpgpusim_config}:{ld_library_path}"
        env['LD_LIBRARY_PATH'] = ld_library_path

        # Set arguments for the simulator
        accel_sim_exec = os.path.join(
            accel_sim_root, 'bin', 'release', 'accel-sim.out')
        trace_path = os.path.join(
            self.log_file_path, 'traces', 'kernelslist.g')
        # look for the config files under the config directory of the current script
        if no_write_allocate:
            gpgpu_sim_config = os.path.join(
                os.path.dirname(os.path.abspath(__file__)), 'configs', 'no_write_allocate.config')
        else:
            gpgpu_sim_config = os.path.join(
                os.path.dirname(os.path.abspath(__file__)), 'configs', 'gpgpusim.config')
        accel_sim_config = os.path.join(
            os.path.dirname(os.path.abspath(__file__)), 'configs', 'trace.config')
        cmd = [
            accel_sim_exec,
            '-trace', trace_path,
            '-config', gpgpu_sim_config,
            '-config', accel_sim_config,
            '-gpgpu_max_insn 10000000000'
        ]
        cmd_str = ' '.join(cmd)

        cache_log_path = os.path.join(
            self.log_file_path, f"{self.log_file_name}.sim_cache.log")
        main_log_path = os.path.join(
            self.log_file_path, f"{self.log_file_name}.sim.log")
        verbose_log_path = os.path.join(
            self.log_file_path, f"{self.log_file_name}.sim_verbose.log")

        cache_log_file = open(cache_log_path, 'w')
        main_log_file = open(main_log_path, 'w')
        verbose_log_file = open(
            verbose_log_path, 'w') if self.verbose else None

        # Define the regex patterns
        cache_pattern = r"L1|L2|Processing kernel|kernel name|kernel id|kernel command:CACHE_EVICT_WB:CACHE_WB"

        log_pattern = r"GPGPU-Sim:|gpgpu_simulation_time =|gpgpu_simulation_rate =|gpgpu_silicon_slowdown =|CPU Runtime:|GPU Runtime:|Processing kernel|kernel name|kernel id|kernel command|gpu_sim_cycle|gpu_sim_insn|gpu_ipc|gpu_tot_sim_cycle|gpu_tot_sim_insn|gpu_total_sim_rate|kernel_name"

        env['CMD_STR'] = cmd_str
        env['LOG_FILE_NAME'] = self.log_file_name
        env['LOG_FILE_PATH'] = self.log_file_path
        if self.verbose:
            env['VERBOSE'] = '1'

        # Run the simulator
        script = """#!/usr/bin/env bash
        source $ACCELSIM_ROOT/setup_environment.sh;
        stdbuf -oL -eL $CMD_STR
        """

        process = subprocess.Popen(
            ['bash', '-c', script],
            env=env,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            universal_newlines=True,
            bufsize=1,
            executable='/bin/bash'
        )
        # Process the output line by line
        for line in iter(process.stdout.readline, ''):
            # Write to verbose log if enabled
            if self.verbose:
                verbose_log_file.write(line)
                verbose_log_file.flush()

            # Check for cache log pattern
            if re.search(cache_pattern, line):
                cache_log_file.write(line)
                cache_log_file.flush()

            # Check for sim log pattern
            if re.search(log_pattern, line):
                main_log_file.write(line)
                main_log_file.flush()
                # Print this line to stdout
                print(line, end='')

            sys.stdout.flush()
        # Wait for process to complete
        process.wait()
        # Close the log files
        cache_log_file.close()
        main_log_file.close()
        if self.verbose:
            verbose_log_file.close()

    def run_sampling(self, sample_delete=False):
        """Perform kernel sampling using Nsight Compute and Principal Kernel Selection (PKS).

        If an NCU report (`self.ncu_file`) is not provided, it first runs Nsight Compute
        to generate one. Then, it runs the `PrincipalKernelSelector` on the NCU report
        to identify representative kernels. It renames the original `kernelslist.g`
        (if it exists) to `kernelslist.old.g` and generates a new `kernelslist.g`
        containing only the selected kernels. Optionally deletes trace files not
        corresponding to the selected kernels.

        Args:
            sample_delete (bool): Remove traces not selected by PKS. Defaults to False.

        Returns:
            str: Path to the directory with updated `kernelslist.g`.

        Raises:
            Exception: Propagates errors from NCU or PKS execution.
        """
        trace_path = os.path.join(
            self.log_file_path, 'traces', 'kernelslist.g')
        old_trace_path = os.path.join(
            self.log_file_path, 'traces', 'kernelslist.old.g')

        if os.path.exists(trace_path) and self.delete:
            shutil.rmtree(os.path.dirname(trace_path))

        # Create parent directory if it doesn't exist
        os.makedirs(os.path.join(
            self.log_file_path, 'traces'), exist_ok=True)

        # If old_trace_path exists, then the program has already been run with sampling
        if os.path.exists(old_trace_path) and os.path.exists(trace_path):
            print(
                f"Samples from {self.program} have already been generated and saved to {trace_path}")
            if self.delete:
                os.remove(old_trace_path)
            else:
                return trace_path

        if not self.ncu_file or not os.path.exists(self.ncu_file):
            runner = NsightNVBitRunner()
            runner.init_from_params(
                program_name=self.program,
                program_args=self.program_args,
                log_file_name=self.log_file_name,
                log_file_path=self.log_file_path
            )
            ncu_file = runner.run_ncu(dry_run=False)
        else:
            ncu_file = self.ncu_file
        try:
            pks = PrincipalKernelSelector(
                ncu_input_file=ncu_file,
                output_dir=os.path.join(self.log_file_path, 'traces')
            )
            # Copy the existing kernelslist.g file to kernelslist.old.g
            if os.path.exists(trace_path):
                os.rename(trace_path, os.path.join(
                    self.log_file_path, 'traces', 'kernelslist.old.g'))
            pks.run_pks(trace_path, delete=sample_delete)
        except Exception as e:
            print(f"Error: {e}")
        finally:
            return trace_path

    def run_post_processing(self, config_file, sample=False):
        """Post-process simulation outputs into CSV and JSON reports.

        First, runs the `accel_sim_parser.run_parser` function to generate CSV summaries
        (`*.sim.csv`, `*.sim_l1.csv`, `*.sim_l2.csv`) from the simulation cache log.
        Second, runs the `gain_cell_frontend.py` script to generate a JSON file
        (`*.frontend.json`) for visualization or further analysis.

        Args:
            config_file (str): Path to GPGPU-Sim configuration used during simulation.
            sample (bool): Indicates if sampling was used. Defaults to False.

        Returns:
            None: Outputs `.sim.csv`, `.sim_l1.csv`, `.sim_l2.csv` and `.frontend.json`.
        """
        cache_log_path = os.path.join(
            self.log_file_path, f"{self.log_file_name}.sim.csv")
        run_parser(self.log_file_path, self.log_file_name, config_file)

        project_root = os.getenv('PROJECT_ROOT', '/gainsight')
        frontend_path = os.path.join(project_root, 'frontend')
        frontend_scripts_path = os.path.join(
            frontend_path, 'gain_cell_frontend.py')
        sample_string = "--sample" if sample else ""
        postprocess_log_path = os.path.join(
            self.log_file_path, f"{self.log_file_name}.postprocess.log")
        postprocess_log_file = open(postprocess_log_path, 'w')
        # Run the frontend script with cache_log_path as argument in the working directory of frontend_path
        process = subprocess.Popen(
            [sys.executable, frontend_scripts_path, sample_string, cache_log_path],
            cwd=frontend_path,
            stdout=postprocess_log_file,
            stderr=subprocess.STDOUT,
            universal_newlines=True,
            bufsize=1
        )
        if process.stdout is not None:
            for line in iter(process.stdout.readline, ''):
                postprocess_log_file.write(line)
                postprocess_log_file.flush()
                # Print this line to stdout
                print(line, end='')
                sys.stdout.flush()
        # Wait for process to complete
        process.wait()
        postprocess_log_file.close()

        # Check if the output JSON file exists
        output_json_path = os.path.join(
            self.log_file_path, f"{self.log_file_name}.frontend.json")
        if os.path.exists(output_json_path):
            print(
                f"Frontend JSON file generated: {output_json_path}")
        else:
            print(
                f"Error: Frontend JSON file not generated. Please check the frontend script for errors.")
        return

__init__(program, args, arch='SM90_H100', verbose=False, delete=False, rename=None, ncu_file=None)

Initialize the AccelSimRunner.

Sets up paths, log naming, and execution environment based on input parameters.

Parameters:

Name Type Description Default
program str

Path or name of the program to profile.

required
args List[str]

Arguments to pass to the program.

required
arch str

GPU architecture (default: "SM90_H100").

'SM90_H100'
verbose bool

Enable verbose simulator logging.

False
delete bool

Remove existing traces directory.

False
rename str

Custom base name for log files.

None
ncu_file str

Existing Nsight Compute report for sampling.

None

Returns:

Name Type Description
None

Initializes runner state and prepares directories.

Source code in python-scripts/accel_sim.py
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
def __init__(self, program, args, arch="SM90_H100", verbose=False, delete=False, rename=None, ncu_file=None):
    """Initialize the AccelSimRunner.

    Sets up paths, log naming, and execution environment based on input parameters.

    Args:
        program (str): Path or name of the program to profile.
        args (List[str]): Arguments to pass to the program.
        arch (str): GPU architecture (default: "SM90_H100").
        verbose (bool): Enable verbose simulator logging.
        delete (bool): Remove existing traces directory.
        rename (str): Custom base name for log files.
        ncu_file (str): Existing Nsight Compute report for sampling.

    Returns:
        None: Initializes runner state and prepares directories.
    """
    # Convert program to absolute path if it's a relative path
    self.program = os.path.abspath(program) if not program in [
        "python", "python3", "torchrun"] else program
    self.program_args = args
    self.arch = arch
    self.verbose = verbose
    self.delete = delete
    self.pwd = os.getenv('PWD', os.getcwd())

    # Store original working directory
    self.original_cwd = os.getcwd()

    output_program_name = os.path.basename(program)
    # If the program is python3, we need to use sys.executable
    program_basename = os.path.basename(program)
    if program_basename == "python" or program_basename.startswith("python3") or program_basename == "torchrun":
        output_program_name = ".".join(
            os.path.basename(args[0]).split('.')[:-1])
        self.program = sys.executable
        self.program_args[0] = os.path.abspath(args[0])
    # If the program is a .py file, we need to use sys.executable
    if program.endswith(".py"):
        output_program_name = ".".join(
            os.path.basename(program).split('.')[:-1])
        self.program_args = [os.path.abspath(program)] + args
        self.program = sys.executable
    if rename:
        output_program_name = rename

    # Generate a timestamp for the log file name
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    self.log_file_name = f"{output_program_name}_{timestamp}"
    self.log_file_path = os.path.join(
        os.getenv('PROJECT_ROOT', '.'), 'logs', output_program_name)
    os.makedirs(self.log_file_path, exist_ok=True)

    self.kernelslist_g = os.path.join(
        self.log_file_path, 'traces', 'kernelslist.g')

    # Dump program and arguments to a file
    with open(os.path.join(self.log_file_path, f"{self.log_file_name}.sim_cmd.log"), 'w') as f:
        f.write(f"{self.program} {' '.join(self.program_args)}")

    self.ncu_file = ncu_file

run_accel_sim(no_write_allocate=False)

Run the Accel-Sim GPGPU simulator on the generated traces.

Sets up the environment for Accel-Sim, constructs the command line with appropriate configuration files (including handling the write-allocate setting), and executes the simulator. It captures and redirects the simulator's output to different log files (.sim_cache.log, .sim.log, .sim_verbose.log) based on regex patterns.

Parameters:

Name Type Description Default
no_write_allocate bool

Disable cache write-allocate configuration. Defaults to False.

False

Returns:

Name Type Description
None

Executes the simulation and writes log files.

Source code in python-scripts/accel_sim.py
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
def run_accel_sim(self, no_write_allocate=False):
    """
    Run the Accel-Sim GPGPU simulator on the generated traces.

    Sets up the environment for Accel-Sim, constructs the command line with
    appropriate configuration files (including handling the write-allocate setting),
    and executes the simulator. It captures and redirects the simulator's output
    to different log files (`.sim_cache.log`, `.sim.log`, `.sim_verbose.log`)
    based on regex patterns.

    Args:
        no_write_allocate (bool): Disable cache write-allocate configuration. Defaults to False.

    Returns:
        None: Executes the simulation and writes log files.
    """
    # Get and set the environment variables
    env = os.environ.copy()
    project_root = os.getenv('PROJECT_ROOT', '/gainsight')
    ld_library_path = env.get('LD_LIBRARY_PATH', '')
    accel_sim_root = os.path.join(
        project_root, 'backend', 'accel-sim', 'gpu-simulator')
    env['ACCELSIM_ROOT'] = accel_sim_root
    gpgpusim_root = os.path.join(accel_sim_root, 'gpgpu-sim')
    gpgpusim_config = env.get(
        'GPGPUSIM_CONFIG', 'gcc-11.4.0/cuda-11080/release')
    ld_library_path = re.sub(
        rf'{re.escape(gpgpusim_root)}/lib/[0-9]+/(debug|release):', '', ld_library_path)
    ld_library_path = f"{gpgpusim_root}/lib/{gpgpusim_config}:{ld_library_path}"
    env['LD_LIBRARY_PATH'] = ld_library_path

    # Set arguments for the simulator
    accel_sim_exec = os.path.join(
        accel_sim_root, 'bin', 'release', 'accel-sim.out')
    trace_path = os.path.join(
        self.log_file_path, 'traces', 'kernelslist.g')
    # look for the config files under the config directory of the current script
    if no_write_allocate:
        gpgpu_sim_config = os.path.join(
            os.path.dirname(os.path.abspath(__file__)), 'configs', 'no_write_allocate.config')
    else:
        gpgpu_sim_config = os.path.join(
            os.path.dirname(os.path.abspath(__file__)), 'configs', 'gpgpusim.config')
    accel_sim_config = os.path.join(
        os.path.dirname(os.path.abspath(__file__)), 'configs', 'trace.config')
    cmd = [
        accel_sim_exec,
        '-trace', trace_path,
        '-config', gpgpu_sim_config,
        '-config', accel_sim_config,
        '-gpgpu_max_insn 10000000000'
    ]
    cmd_str = ' '.join(cmd)

    cache_log_path = os.path.join(
        self.log_file_path, f"{self.log_file_name}.sim_cache.log")
    main_log_path = os.path.join(
        self.log_file_path, f"{self.log_file_name}.sim.log")
    verbose_log_path = os.path.join(
        self.log_file_path, f"{self.log_file_name}.sim_verbose.log")

    cache_log_file = open(cache_log_path, 'w')
    main_log_file = open(main_log_path, 'w')
    verbose_log_file = open(
        verbose_log_path, 'w') if self.verbose else None

    # Define the regex patterns
    cache_pattern = r"L1|L2|Processing kernel|kernel name|kernel id|kernel command:CACHE_EVICT_WB:CACHE_WB"

    log_pattern = r"GPGPU-Sim:|gpgpu_simulation_time =|gpgpu_simulation_rate =|gpgpu_silicon_slowdown =|CPU Runtime:|GPU Runtime:|Processing kernel|kernel name|kernel id|kernel command|gpu_sim_cycle|gpu_sim_insn|gpu_ipc|gpu_tot_sim_cycle|gpu_tot_sim_insn|gpu_total_sim_rate|kernel_name"

    env['CMD_STR'] = cmd_str
    env['LOG_FILE_NAME'] = self.log_file_name
    env['LOG_FILE_PATH'] = self.log_file_path
    if self.verbose:
        env['VERBOSE'] = '1'

    # Run the simulator
    script = """#!/usr/bin/env bash
    source $ACCELSIM_ROOT/setup_environment.sh;
    stdbuf -oL -eL $CMD_STR
    """

    process = subprocess.Popen(
        ['bash', '-c', script],
        env=env,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        universal_newlines=True,
        bufsize=1,
        executable='/bin/bash'
    )
    # Process the output line by line
    for line in iter(process.stdout.readline, ''):
        # Write to verbose log if enabled
        if self.verbose:
            verbose_log_file.write(line)
            verbose_log_file.flush()

        # Check for cache log pattern
        if re.search(cache_pattern, line):
            cache_log_file.write(line)
            cache_log_file.flush()

        # Check for sim log pattern
        if re.search(log_pattern, line):
            main_log_file.write(line)
            main_log_file.flush()
            # Print this line to stdout
            print(line, end='')

        sys.stdout.flush()
    # Wait for process to complete
    process.wait()
    # Close the log files
    cache_log_file.close()
    main_log_file.close()
    if self.verbose:
        verbose_log_file.close()

run_post_processing(config_file, sample=False)

Post-process simulation outputs into CSV and JSON reports.

First, runs the accel_sim_parser.run_parser function to generate CSV summaries (*.sim.csv, *.sim_l1.csv, *.sim_l2.csv) from the simulation cache log. Second, runs the gain_cell_frontend.py script to generate a JSON file (*.frontend.json) for visualization or further analysis.

Parameters:

Name Type Description Default
config_file str

Path to GPGPU-Sim configuration used during simulation.

required
sample bool

Indicates if sampling was used. Defaults to False.

False

Returns:

Name Type Description
None

Outputs .sim.csv, .sim_l1.csv, .sim_l2.csv and .frontend.json.

Source code in python-scripts/accel_sim.py
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
def run_post_processing(self, config_file, sample=False):
    """Post-process simulation outputs into CSV and JSON reports.

    First, runs the `accel_sim_parser.run_parser` function to generate CSV summaries
    (`*.sim.csv`, `*.sim_l1.csv`, `*.sim_l2.csv`) from the simulation cache log.
    Second, runs the `gain_cell_frontend.py` script to generate a JSON file
    (`*.frontend.json`) for visualization or further analysis.

    Args:
        config_file (str): Path to GPGPU-Sim configuration used during simulation.
        sample (bool): Indicates if sampling was used. Defaults to False.

    Returns:
        None: Outputs `.sim.csv`, `.sim_l1.csv`, `.sim_l2.csv` and `.frontend.json`.
    """
    cache_log_path = os.path.join(
        self.log_file_path, f"{self.log_file_name}.sim.csv")
    run_parser(self.log_file_path, self.log_file_name, config_file)

    project_root = os.getenv('PROJECT_ROOT', '/gainsight')
    frontend_path = os.path.join(project_root, 'frontend')
    frontend_scripts_path = os.path.join(
        frontend_path, 'gain_cell_frontend.py')
    sample_string = "--sample" if sample else ""
    postprocess_log_path = os.path.join(
        self.log_file_path, f"{self.log_file_name}.postprocess.log")
    postprocess_log_file = open(postprocess_log_path, 'w')
    # Run the frontend script with cache_log_path as argument in the working directory of frontend_path
    process = subprocess.Popen(
        [sys.executable, frontend_scripts_path, sample_string, cache_log_path],
        cwd=frontend_path,
        stdout=postprocess_log_file,
        stderr=subprocess.STDOUT,
        universal_newlines=True,
        bufsize=1
    )
    if process.stdout is not None:
        for line in iter(process.stdout.readline, ''):
            postprocess_log_file.write(line)
            postprocess_log_file.flush()
            # Print this line to stdout
            print(line, end='')
            sys.stdout.flush()
    # Wait for process to complete
    process.wait()
    postprocess_log_file.close()

    # Check if the output JSON file exists
    output_json_path = os.path.join(
        self.log_file_path, f"{self.log_file_name}.frontend.json")
    if os.path.exists(output_json_path):
        print(
            f"Frontend JSON file generated: {output_json_path}")
    else:
        print(
            f"Error: Frontend JSON file not generated. Please check the frontend script for errors.")
    return

run_sampling(sample_delete=False)

Perform kernel sampling using Nsight Compute and Principal Kernel Selection (PKS).

If an NCU report (self.ncu_file) is not provided, it first runs Nsight Compute to generate one. Then, it runs the PrincipalKernelSelector on the NCU report to identify representative kernels. It renames the original kernelslist.g (if it exists) to kernelslist.old.g and generates a new kernelslist.g containing only the selected kernels. Optionally deletes trace files not corresponding to the selected kernels.

Parameters:

Name Type Description Default
sample_delete bool

Remove traces not selected by PKS. Defaults to False.

False

Returns:

Name Type Description
str

Path to the directory with updated kernelslist.g.

Raises:

Type Description
Exception

Propagates errors from NCU or PKS execution.

Source code in python-scripts/accel_sim.py
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
def run_sampling(self, sample_delete=False):
    """Perform kernel sampling using Nsight Compute and Principal Kernel Selection (PKS).

    If an NCU report (`self.ncu_file`) is not provided, it first runs Nsight Compute
    to generate one. Then, it runs the `PrincipalKernelSelector` on the NCU report
    to identify representative kernels. It renames the original `kernelslist.g`
    (if it exists) to `kernelslist.old.g` and generates a new `kernelslist.g`
    containing only the selected kernels. Optionally deletes trace files not
    corresponding to the selected kernels.

    Args:
        sample_delete (bool): Remove traces not selected by PKS. Defaults to False.

    Returns:
        str: Path to the directory with updated `kernelslist.g`.

    Raises:
        Exception: Propagates errors from NCU or PKS execution.
    """
    trace_path = os.path.join(
        self.log_file_path, 'traces', 'kernelslist.g')
    old_trace_path = os.path.join(
        self.log_file_path, 'traces', 'kernelslist.old.g')

    if os.path.exists(trace_path) and self.delete:
        shutil.rmtree(os.path.dirname(trace_path))

    # Create parent directory if it doesn't exist
    os.makedirs(os.path.join(
        self.log_file_path, 'traces'), exist_ok=True)

    # If old_trace_path exists, then the program has already been run with sampling
    if os.path.exists(old_trace_path) and os.path.exists(trace_path):
        print(
            f"Samples from {self.program} have already been generated and saved to {trace_path}")
        if self.delete:
            os.remove(old_trace_path)
        else:
            return trace_path

    if not self.ncu_file or not os.path.exists(self.ncu_file):
        runner = NsightNVBitRunner()
        runner.init_from_params(
            program_name=self.program,
            program_args=self.program_args,
            log_file_name=self.log_file_name,
            log_file_path=self.log_file_path
        )
        ncu_file = runner.run_ncu(dry_run=False)
    else:
        ncu_file = self.ncu_file
    try:
        pks = PrincipalKernelSelector(
            ncu_input_file=ncu_file,
            output_dir=os.path.join(self.log_file_path, 'traces')
        )
        # Copy the existing kernelslist.g file to kernelslist.old.g
        if os.path.exists(trace_path):
            os.rename(trace_path, os.path.join(
                self.log_file_path, 'traces', 'kernelslist.old.g'))
        pks.run_pks(trace_path, delete=sample_delete)
    except Exception as e:
        print(f"Error: {e}")
    finally:
        return trace_path

run_tracer(sample=False)

Run the NVBit tracer to generate SASS instruction traces for the program.

Sets up the environment for the NVBit tracer tool, executes the target program under the tracer, and then runs the post-processing script to convert raw traces into the format required by Accel-Sim (.traceg files). Optionally deletes the intermediate .trace files.

Parameters:

Name Type Description Default
sample bool

Enable kernel sampling via environment variable. Defaults to False.

False

Returns:

Name Type Description
str

Path to the directory containing processed traces.

Raises:

Type Description
FileNotFoundError

If the program executable is not found.

Source code in python-scripts/accel_sim.py
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
def run_tracer(self, sample=False):
    """Run the NVBit tracer to generate SASS instruction traces for the program.

    Sets up the environment for the NVBit tracer tool, executes the target program
    under the tracer, and then runs the post-processing script to convert raw
    traces into the format required by Accel-Sim (`.traceg` files). Optionally
    deletes the intermediate `.trace` files.

    Args:
        sample (bool): Enable kernel sampling via environment variable. Defaults to False.

    Returns:
        str: Path to the directory containing processed traces.

    Raises:
        FileNotFoundError: If the program executable is not found.
    """
    trace_path = os.path.join(self.log_file_path, 'traces')
    project_root = os.getenv('PROJECT_ROOT', '/gainsight')
    tracer_tool_path = os.getenv('TRACER_TOOLS', os.path.join(
        project_root, 'backend', 'accel-sim', 'util', 'tracer_nvbit', 'tracer_tool'))

    # Print kernelslist_g if it exists
    if os.path.exists(self.kernelslist_g):
        print(f"Kernels list file already exists: {self.kernelslist_g}")
        with open(self.kernelslist_g, 'r') as f:
            print(f.read())

    # Clear the traces directory
    if os.path.exists(os.path.join(trace_path, 'stats.csv')):
        print(
            f"Processed traces from {self.program} have already been generated and saved to {trace_path}")
        return trace_path
    os.makedirs(trace_path, exist_ok=True)

    # Run the tracer
    env = os.environ.copy()
    env['CUDA_INJECTION64_PATH'] = os.path.join(
        tracer_tool_path, 'tracer_tool.so')
    env['SAMPLE'] = '1' if sample else '0'
    env['KERNELSLIST'] = self.kernelslist_g
    env['USER_DEFINED_FOLDERS'] = '1'
    env['TRACES_FOLDER'] = trace_path
    env['TRACE_FILE_COMPRESS'] = '0'
    # Only trace the first 1000 kernels
    env['DYNAMIC_KERNEL_LIMIT_START'] = '0'
    env['DYNAMIC_KERNEL_LIMIT_END'] = '1000'
    env['INSTR_END'] = '200000000'

    # Check if the program exists
    if not os.path.exists(self.program) and not self.program in ["python", "python3", "torchrun"]:
        raise FileNotFoundError(f"Program not found: {self.program}")

    # Use original working directory instead of log file path
    # only run this if kernelslist does not exist
    if not os.path.exists(os.path.join(trace_path, 'kernelslist')):
        print("Running the tracer...")
        subprocess.run(
            [self.program] + self.program_args,
            env=env,
            cwd=self.pwd
            # cwd=self.log_file_path
        )
    else:
        print(
            f"Traces from {self.program} have already been generated and saved to {trace_path}")

    # Process the traces
    print("Processing traces...")
    subprocess.run([
        os.path.join(tracer_tool_path, 'traces-processing',
                     'post-traces-processing'),
        os.path.join(trace_path, 'kernelslist')
    ])

    # Delete the .trace files to save space
    for trace_file in os.listdir(trace_path):
        if trace_file.endswith('.trace') or trace_file.endswith('.trace.xz'):
            os.remove(os.path.join(trace_path, trace_file))
    print(
        f"Traces from {self.program} consisting of {len(os.listdir(trace_path)) - 3} kernel(s) have been saved to {trace_path}")
    return trace_path

parse_args()

Parse command-line arguments for the Accel-Sim runner script.

Defines and parses arguments related to program execution, simulation options, tracing, sampling, and post-processing.

Returns:

Type Description

argparse.Namespace: An object containing the parsed command-line arguments. Includes attributes like program, args, sample, arch, delete, verbose, rename, trace_only, replay_only, post_process, sample_delete, no_write_allocate, ncu_file.

Source code in python-scripts/accel_sim.py
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
def parse_args():
    """Parse command-line arguments for the Accel-Sim runner script.

    Defines and parses arguments related to program execution, simulation options,
    tracing, sampling, and post-processing.

    Returns:
        argparse.Namespace: An object containing the parsed command-line arguments.
            Includes attributes like `program`, `args`, `sample`, `arch`, `delete`,
            `verbose`, `rename`, `trace_only`, `replay_only`, `post_process`,
            `sample_delete`, `no_write_allocate`, `ncu_file`.
    """
    # echo "Usage: ./generate_traces.sh [--sample] <program> <args>"
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--sample",
        help="Run the program with kernel sampling",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--arch",
        help="The architecture to simulate",
        type=str,
        default="SM90_H100"
    )
    parser.add_argument(
        "--delete",
        help="Delete the traces directory before running the tracer",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--verbose",
        help="Store verbose output from Accel-Sim",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--rename",
        help="Rename the log file",
        type=str,
        default=None
    )
    parser.add_argument(
        "--trace-only",
        help="Only run the tracer and exit",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--replay-only",
        help="Only run the replay and exit",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--post-process",
        help="Run the post-processing step to generate the simulation results",
        type=str,
        default=None
    )
    parser.add_argument(
        "--sample-delete",
        help="Delete traces that are not used for sampling",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--no-write-allocate",
        help="Disable write allocate for the cache",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--ncu-file",
        help="The Nsight Compute report to use for kernel sampling",
        type=str,
        default=None
    )
    parser.add_argument(
        "program",
        help="The program to profile",
        type=str
    )
    parser.add_argument(
        "args",
        help="Arguments to pass to the program",
        type=str,
        nargs=argparse.REMAINDER
    )
    args = parser.parse_args()
    return args

Sampling via Principal Kernel Selection (PKS)

The pks.py script implements the Principal Kernel Selection (PKS) algorithm for sampling in the Accel-Sim simulator. It provides functions to select a subset of kernels based on their execution characteristics and performance metrics.

Principal Kernel Selection implementation for CUDA kernel profiling.

This module implements the Principal Kernel Selection (PKS) algorithm, which uses PCA and K-means clustering to identify representative CUDA kernels from an NVIDIA NSight Compute report. This helps reduce simulation time by focusing on a smaller set of representative kernels.

The functionalities of this module is an adaptation and reproduction of the following work: Cesar Avalos Baddouh, Mahmoud Khairy, Roland N. Green, Mathias Payer, and Timothy G. Rogers. 2021. Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '21). Association for Computing Machinery, New York, NY, USA, 724–737. https://doi.org/10.1145/3466752.3480100

TODO: Implement real-time monitoring of read/write frequencies, bursts, and lifetime trends.

Typical usage

pks = PrincipalKernelSelector( ncu_input_file="path/to/report.ncu-rep", output_dir="path/to/traces_dir" ) pks.run_pks()

PrincipalKernelSelector

Analyzes CUDA kernel metrics to identify representative kernels.

Implements the Principal Kernel Selection (PKS) algorithm using PCA and K-means clustering to select representative CUDA kernels for simulation.

Parameters:

Name Type Description Default
ncu_input_file str

Path to the NCU report file to analyze.

required
output_dir str

Directory for saving output files. Defaults to script directory.

None

Returns:

Type Description

None

Attributes:

Name Type Description
metrics List[str]

List of collected NCU metrics for analysis.

ncu_data DataFrame

DataFrame containing raw kernel metrics.

pca_df DataFrame

DataFrame with PCA-transformed kernel data.

cluster_count int

Optimal number of clusters determined.

kernel_df DataFrame

DataFrame with kernel details and cluster assignments.

cluster_df DataFrame

DataFrame with cluster details and centroids.

output_dir str

Directory for saving output files.

bypass bool

Flag indicating if PKS should be bypassed.

sum_lts_t_sectors_op_write float

Sum of 'lts__t_sectors_op_write.sum' for all kernels.

Source code in python-scripts/pks.py
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
class PrincipalKernelSelector:
    """Analyzes CUDA kernel metrics to identify representative kernels.

    Implements the Principal Kernel Selection (PKS) algorithm using PCA and
    K-means clustering to select representative CUDA kernels for simulation.

    Args:
        ncu_input_file (str): Path to the NCU report file to analyze.
        output_dir (str, optional): Directory for saving output files. Defaults to script directory.

    Returns:
        None

    Attributes:
        metrics (List[str]): List of collected NCU metrics for analysis.
        ncu_data (pd.DataFrame): DataFrame containing raw kernel metrics.
        pca_df (pd.DataFrame): DataFrame with PCA-transformed kernel data.
        cluster_count (int): Optimal number of clusters determined.
        kernel_df (pd.DataFrame): DataFrame with kernel details and cluster assignments.
        cluster_df (pd.DataFrame): DataFrame with cluster details and centroids.
        output_dir (str): Directory for saving output files.
        bypass (bool): Flag indicating if PKS should be bypassed.
        sum_lts_t_sectors_op_write (float): Sum of 'lts__t_sectors_op_write.sum' for all kernels.
    """

    def __init__(self, ncu_input_file: str, output_dir: str = None):
        """Initialize the Principal Kernel Selector with an NCU report.

        Args:
            ncu_input_file (str): Path to the NCU report file to analyze.
            output_dir (str, optional): Directory for saving output files. Should be the directory
                containing trace files (kernel-1.traceg, etc.). Defaults to the
                directory of this script.

        Raises:
            ValueError: If the input file contains fewer than 2 kernels (unless bypass is triggered).
            FileNotFoundError: If the metrics_list.json file is not found.
            ImportError: If the ncu_report module cannot be imported.
        """
        # Class member variables for main-function state
        self.metrics = None
        self.ncu_data = None
        self.pca_df = None

        # Path to output directory for generated files
        # Defaults to path of this script if not provided or invalid
        self.output_dir = output_dir if output_dir and os.path.exists(output_dir) \
            else os.path.dirname(os.path.realpath(__file__))

        # Number of clusters
        self.cluster_count = None

        # DataFrame containing kernel details and cluster assignments
        self.kernel_df = None

        # DataFrame containing cluster details and centroids
        self.cluster_df = None

        # Load metrics from configuration file
        metrics_path = os.path.join(os.path.dirname(
            os.path.realpath(__file__)), "metrics_list.json")
        with open(metrics_path, "r") as metrics_file:
            metrics = json.load(metrics_file)

        # Flatten metrics dictionary into a list
        concatenated_metrics = []
        for key, value in metrics.items():
            concatenated_metrics.extend(value)
        self.metrics = concatenated_metrics

        # Load NCU report data
        ncu_context = ncu_report.load_report(ncu_input_file)
        ncu_range = ncu_context[0]  # Use first range in report

        if ncu_context.num_ranges() > 1:
            print("Warning: Multiple ranges found in the input file. "
                  "Using the first range.")

        print(
            f"Loaded {ncu_range.num_actions()} kernels from {ncu_input_file}")

        # Ensure we have enough kernels for clustering
        # if ncu_range.num_actions() <= 2:
        #     raise ValueError("The input file must contain at least 2 kernels.")
        self.bypass = ncu_range.num_actions() <= 20

        # Create initial dataframe with kernel information
        kernel_ids = range(ncu_range.num_actions())
        kernel_names = [action.name() for action in ncu_range]

        data = {
            "Kernel Name": kernel_names,
            "Kernel ID": list(kernel_ids),
        }

        # Add metrics data to dataframe
        for metric in self.metrics:
            data[metric] = [action[metric].value() for action in ncu_range]

        self.ncu_data = pd.DataFrame(data)

        # Initialize kernel_df with basic kernel information
        self.kernel_df = self.ncu_data[["Kernel ID", "Kernel Name"]].copy()

        self.sum_lts_t_sectors_op_write = self.ncu_data["lts__t_sectors_op_write.sum"].sum()

    def pca(self, data: pd.DataFrame, var_threshold: float = 0.95) -> Tuple[pd.DataFrame, np.ndarray]:
        """Perform Principal Component Analysis on kernel metrics.

        Args:
            data (pd.DataFrame): DataFrame containing kernel metrics.
            var_threshold (float): Variance threshold for PCA dimensionality reduction. Defaults to 0.95.

        Returns:
            Tuple[pd.DataFrame, np.ndarray]: A tuple containing:
                - pd.DataFrame: DataFrame with transformed data, columns named 'PC0', 'PC1', etc.
                - np.ndarray: Raw numpy array of the transformed data.
        """
        # Create a copy to avoid modifying original data
        data_copy = data.copy()

        # Remove non-metric columns
        data_copy.drop(columns=["Kernel Name", "Kernel ID"], inplace=True)

        # Standardize the data
        data_copy = StandardScaler().fit_transform(data_copy)

        # Apply PCA
        pca_model = PCA(n_components=var_threshold)
        transformed_data = pca_model.fit_transform(data_copy)

        # Create a dataframe with the transformed data
        transformed_df = pd.DataFrame(
            transformed_data,
            columns=[f"PC{i}" for i in range(pca_model.n_components_)]
        )

        return transformed_df, transformed_data

    def kmeans(self, data: pd.DataFrame, n_clusters: int = 3) -> Tuple[np.ndarray, np.ndarray, float]:
        """Perform K-means clustering with a specified number of clusters.

        Calculates cluster assignments, centroids, and a custom score based on the
        difference between the sum of 'lts__t_sectors_op_write.sum' for centroid-representative
        kernels (weighted by cluster size) and the total sum for all kernels.

        Args:
            data (pd.DataFrame): Data to cluster (typically PCA-transformed).
            n_clusters (int): Number of clusters to form. Defaults to 3.

        Returns:
            Tuple[np.ndarray, np.ndarray, float]: A tuple containing:
                - np.ndarray: Array of cluster labels for each data point.
                - np.ndarray: Array of cluster centers (centroids).
                - float: Custom score representing the relative error in 'lts__t_sectors_op_write.sum'.
        """
        # Apply K-means clustering
        kmeans_result = KMeans(
            n_clusters=n_clusters,
            random_state=12,  # For reproducibility
            n_init="auto"  # Use default initialization
        ).fit(data)

        labels = kmeans_result.labels_
        centers = kmeans_result.cluster_centers_

        cluster_lts_t_sectors_op_write = 0

        for i, center in enumerate(centers):
            closest_cluster = i
            # get number of kernels in the cluster
            num_kernels = np.sum(labels == closest_cluster)
            # get the indices of the points in the closest cluster
            indices = np.where(labels == closest_cluster)[0]
            min_distance = np.inf
            closest_point = None
            closest_index = None
            for index in indices:
                # get the principal component values for the point
                point = self.pca_df.iloc[index].values
                # calculate the distance from the point to the center of the cluster
                distance = np.linalg.norm(point - center)
                if distance < min_distance:
                    min_distance = distance
                    closest_point = point
                    closest_index = index
            # get the original data for the closest point
            original_data = self.ncu_data.iloc[closest_index]
            # get lts__t_sectors_op_write.sum
            lts_t_sectors_op_write = original_data["lts__t_sectors_op_write.sum"]
            cluster_lts_t_sectors_op_write += lts_t_sectors_op_write * num_kernels

        # Calculate custom score based on write count difference
        score = np.abs(cluster_lts_t_sectors_op_write -
                       self.sum_lts_t_sectors_op_write) / self.sum_lts_t_sectors_op_write

        # Return labels, centers, and the custom score
        return kmeans_result.labels_, kmeans_result.cluster_centers_, score

    def kmeans_scan(self, data: pd.DataFrame, lower_bound: int = 2, upper_bound: int = 20) -> None:
        """Find optimal number of clusters by scanning a range of values.

        Tries K-means with different numbers of clusters (from lower_bound to upper_bound).
        Selects the number of clusters corresponding to the lowest custom score calculated by `kmeans`.
        If multiple cluster counts yield scores within 5% of the minimum, the one with the
        fewest clusters is chosen. Updates class attributes `cluster_count`, `kernel_df`,
        and `cluster_df` with the results of the best clustering.

        Args:
            data (pd.DataFrame): Data to cluster (typically PCA-transformed).
            lower_bound (int): Minimum number of clusters to try. Defaults to 2.
            upper_bound (int): Maximum number of clusters to try. Defaults to 20.

        Returns:
            None: Updates class attributes with clustering results.
        """
        scores = []
        centers_list = []
        kmeans_labels_list = []

        # Try different numbers of clusters
        for i in range(lower_bound, upper_bound + 1):
            labels, centers, score = self.kmeans(data, i)
            print(f"Number of clusters: {i}, Write count error: {score}")
            scores.append(score)
            centers_list.append(centers)
            kmeans_labels_list.append(labels)

        # Find minimum score
        min_score = min(scores)

        # Use the first clustering within 5% of minimum score
        for i, score in enumerate(scores):
            if score <= 1.05 * min_score:
                self.cluster_count = i + lower_bound

                # Update kernel_df with cluster assignments
                self.kernel_df["Cluster ID"] = kmeans_labels_list[i]

                # Create cluster_df with centers and counts
                centers = centers_list[i]
                cluster_ids = range(self.cluster_count)
                counts = np.bincount(kmeans_labels_list[i])

                # Create dataframe with cluster details
                self.cluster_df = pd.DataFrame({
                    "Cluster ID": cluster_ids,
                    "Kernel Count": counts
                })

                # Ensure integer data types
                self.cluster_df["Cluster ID"] = pd.to_numeric(
                    self.cluster_df["Cluster ID"], errors='coerce').fillna(-1).astype(int)
                self.cluster_df["Kernel Count"] = pd.to_numeric(
                    self.cluster_df["Kernel Count"], errors='coerce').fillna(-1).astype(int)

                # Add center coordinates as separate columns
                for j in range(centers.shape[1]):
                    self.cluster_df[f"Center_PC{j}"] = [
                        centers[k, j] for k in range(self.cluster_count)]
                break

    def select_centroid(self) -> None:
        """Select representative kernels by finding points closest to cluster centers.

        For each cluster identified by K-means, this method finds the kernel
        (data point) in the PCA space that is closest to the cluster's center
        (centroid) using Euclidean distance. It marks this kernel as the
        representative for that cluster. Updates both `kernel_df` and `cluster_df`
        with centroid information (Kernel ID and Name).

        Returns:
            None: Updates `kernel_df` and `cluster_df` with centroid information.
        """
        # For each cluster, find the kernel closest to the center
        centroid_kernel_ids = []

        for cluster_id in range(self.cluster_count):
            # Get indices of kernels in this cluster
            cluster_mask = self.kernel_df["Cluster ID"] == cluster_id
            cluster_kernel_indices = self.kernel_df.index[cluster_mask].tolist(
            )

            # Handle empty clusters
            if not cluster_kernel_indices:
                print(f"Warning: No kernels found in cluster {cluster_id}")
                centroid_kernel_ids.append(None)
                continue

            # Get PCA coordinates for kernels in this cluster
            cluster_data = self.pca_df.iloc[cluster_kernel_indices]

            # Get center coordinates for this cluster
            center_coords = []
            for j in range(self.pca_df.shape[1]):
                center_coords.append(
                    self.cluster_df.loc[cluster_id, f"Center_PC{j}"])

            # Calculate Euclidean distances to center
            distances = np.linalg.norm(
                cluster_data.values - center_coords, axis=1)

            # Find the kernel closest to the center
            closest_idx = cluster_kernel_indices[np.argmin(distances)]
            kernel_id = int(self.kernel_df.loc[closest_idx, "Kernel ID"])
            centroid_kernel_ids.append(kernel_id)

            # Add centroid information to cluster_df
            self.cluster_df.at[cluster_id, "Centroid Kernel ID"] = kernel_id
            self.cluster_df.at[cluster_id, "Centroid Kernel Name"] = \
                self.kernel_df.loc[closest_idx, "Kernel Name"]

        # Update kernel_df with centroid assignments
        for idx, row in self.kernel_df.iterrows():
            cluster_id = row["Cluster ID"]
            centroid_id = self.cluster_df.loc[cluster_id, "Centroid Kernel ID"]
            self.kernel_df.at[idx, "Centroid Kernel ID"] = centroid_id

        # Ensure integer data types throughout
        self.kernel_df["Cluster ID"] = self.kernel_df["Cluster ID"].astype(int)
        self.kernel_df["Centroid Kernel ID"] = pd.to_numeric(
            self.kernel_df["Centroid Kernel ID"], errors='coerce').fillna(-1).astype(int)
        self.kernel_df["Kernel ID"] = self.kernel_df["Kernel ID"].astype(int)
        self.cluster_df["Centroid Kernel ID"] = pd.to_numeric(
            self.cluster_df["Centroid Kernel ID"], errors='coerce').fillna(-1).astype(int)

    def generate_kernelslist(self, delete: bool = False) -> None:
        """Generate a list file containing the selected representative kernels.

        Creates a `kernelslist.g` file in the `output_dir`. This file lists the
        trace file names (e.g., `kernel-1.traceg`, `kernel-5.traceg`) corresponding
        to the selected centroid kernels. Optionally, deletes trace files in the
        `output_dir` that do not correspond to the selected centroids.

        Args:
            delete (bool): If True, deletes trace files (`*.traceg`) in the `output_dir`
                           that are not selected as centroids. Defaults to False.

        Returns:
            None: Writes `kernelslist.g` to the output directory and potentially deletes files.
        """
        # Get unique centroid kernel IDs (already 1-indexed from run_pks)
        centroid_ids = self.cluster_df["Centroid Kernel ID"].unique()
        centroid_ids = sorted([cid for cid in centroid_ids if cid >= 0])

        # Write kernelslist.g file
        with open(os.path.join(self.output_dir, "kernelslist.g"), "w") as f:
            for index in centroid_ids:
                f.write(f"kernel-{index}.traceg\n")

        # Optionally delete non-centroid trace files
        if delete:
            # iterate over all files in the output directory
            for filename in os.listdir(self.output_dir):
                # check if the file is a trace file and not in the centroid list
                if filename.startswith("kernel-") and (filename.endswith(".traceg") or filename.endswith(".traceg.xz")):
                    kernel_id = int(filename.split("-")[1].split(".")[0])
                    if kernel_id not in centroid_ids:
                        os.remove(os.path.join(self.output_dir, filename))
                        print(f"Deleted {filename}")

    def generate_kernels_csv(self) -> None:
        """Generate CSV files containing the kernel and cluster data.

        Creates two CSV files in the `output_dir`:
        - `kernels.csv`: Contains information about all kernels, including their
          original ID, name, assigned cluster ID, and the ID of the centroid
          representing their cluster.
        - `clusters.csv`: Contains information about each cluster, including its ID,
          the number of kernels it contains, and the ID and name of its
          representative centroid kernel.

        Returns:
            None: Writes `kernels.csv` and `clusters.csv` to the output directory.
        """
        # Export kernel_df to CSV
        self.kernel_df.to_csv(os.path.join(
            self.output_dir, "kernels.csv"), index=False)

        # Export cluster_df for reference (without PCA center coordinates)
        cluster_cols = ["Cluster ID", "Kernel Count",
                        "Centroid Kernel ID", "Centroid Kernel Name"]
        cluster_cols_available = [
            col for col in cluster_cols if col in self.cluster_df.columns]
        cluster_df_to_export = self.cluster_df[cluster_cols_available].copy()

        cluster_df_to_export.to_csv(os.path.join(
            self.output_dir, "clusters.csv"), index=False)

    def run_pks(self, output_file: str = None, delete: bool = False) -> None:
        """Execute the complete Principal Kernel Selection workflow.

        Performs PCA on the kernel metrics, finds the optimal number of clusters
        using K-means scanning, selects representative centroid kernels for each cluster,
        and generates output files (`kernelslist.g`, `kernels.csv`, `clusters.csv`)
        summarizing the results. If the number of kernels is small (<= 20),
        it bypasses the PCA and clustering steps, treating each kernel as its own cluster.

        Args:
            output_file (str, optional): Path to the output `kernelslist.g` file.
                If None, defaults to `kernelslist.g` within the `output_dir`.
                The directory of this file will be used as the output directory
                for other generated files (`.csv`). Defaults to None.
            delete (bool): If True, deletes non-centroid trace files during the
                           `generate_kernelslist` step. Defaults to False.

        Returns:
            None: Generates output files with the results of the analysis.
        """

        if not self.bypass:
            # Apply PCA to reduce dimensionality
            self.pca_df, _ = self.pca(self.ncu_data)
            print(f"PCA shape: {self.pca_df.shape}")

            # Find optimal number of clusters
            self.kmeans_scan(self.pca_df)
            print(f"Optimal number of clusters: {self.cluster_count}")

            # Select representative kernels
            self.select_centroid()

        else:
            # Copy Kernel ID as Centroid Kernel ID for bypass case
            self.kernel_df["Centroid Kernel ID"] = self.kernel_df["Kernel ID"]
            # Copy Kernel ID as Cluster ID for bypass case
            self.kernel_df["Cluster ID"] = self.kernel_df["Kernel ID"]
            # Create cluster_df with all kernels as clusters
            self.cluster_df = pd.DataFrame({
                "Cluster ID": self.kernel_df["Kernel ID"],
                "Kernel Count": 1,
                "Centroid Kernel ID": self.kernel_df["Kernel ID"],
                "Centroid Kernel Name": self.kernel_df["Kernel Name"]
            })

        # Adjust for 1-based indexing used by trace files
        self.kernel_df["Kernel ID"] = \
            self.kernel_df["Kernel ID"].astype(int) + 1
        self.kernel_df["Centroid Kernel ID"] = \
            self.kernel_df["Centroid Kernel ID"].astype(int) + 1
        self.cluster_df["Centroid Kernel ID"] = \
            self.cluster_df["Centroid Kernel ID"].astype(int) + 1

        # Generate output files
        if output_file:
            output_dir = os.path.dirname(output_file)
            if output_dir and not os.path.exists(output_dir):
                os.makedirs(output_dir)
            self.output_dir = output_dir if output_dir else self.output_dir

        self.generate_kernelslist(delete=delete)
        self.generate_kernels_csv()

        # Display results summary
        print(
            f"Generated kernelslist.g with {len(self.cluster_df)} representative kernels")
        print(f"Selected {len(self.cluster_df)} out of {len(self.kernel_df)} kernels "
              f"({len(self.cluster_df)/len(self.kernel_df):.1%})")

__init__(ncu_input_file, output_dir=None)

Initialize the Principal Kernel Selector with an NCU report.

Parameters:

Name Type Description Default
ncu_input_file str

Path to the NCU report file to analyze.

required
output_dir str

Directory for saving output files. Should be the directory containing trace files (kernel-1.traceg, etc.). Defaults to the directory of this script.

None

Raises:

Type Description
ValueError

If the input file contains fewer than 2 kernels (unless bypass is triggered).

FileNotFoundError

If the metrics_list.json file is not found.

ImportError

If the ncu_report module cannot be imported.

Source code in python-scripts/pks.py
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
def __init__(self, ncu_input_file: str, output_dir: str = None):
    """Initialize the Principal Kernel Selector with an NCU report.

    Args:
        ncu_input_file (str): Path to the NCU report file to analyze.
        output_dir (str, optional): Directory for saving output files. Should be the directory
            containing trace files (kernel-1.traceg, etc.). Defaults to the
            directory of this script.

    Raises:
        ValueError: If the input file contains fewer than 2 kernels (unless bypass is triggered).
        FileNotFoundError: If the metrics_list.json file is not found.
        ImportError: If the ncu_report module cannot be imported.
    """
    # Class member variables for main-function state
    self.metrics = None
    self.ncu_data = None
    self.pca_df = None

    # Path to output directory for generated files
    # Defaults to path of this script if not provided or invalid
    self.output_dir = output_dir if output_dir and os.path.exists(output_dir) \
        else os.path.dirname(os.path.realpath(__file__))

    # Number of clusters
    self.cluster_count = None

    # DataFrame containing kernel details and cluster assignments
    self.kernel_df = None

    # DataFrame containing cluster details and centroids
    self.cluster_df = None

    # Load metrics from configuration file
    metrics_path = os.path.join(os.path.dirname(
        os.path.realpath(__file__)), "metrics_list.json")
    with open(metrics_path, "r") as metrics_file:
        metrics = json.load(metrics_file)

    # Flatten metrics dictionary into a list
    concatenated_metrics = []
    for key, value in metrics.items():
        concatenated_metrics.extend(value)
    self.metrics = concatenated_metrics

    # Load NCU report data
    ncu_context = ncu_report.load_report(ncu_input_file)
    ncu_range = ncu_context[0]  # Use first range in report

    if ncu_context.num_ranges() > 1:
        print("Warning: Multiple ranges found in the input file. "
              "Using the first range.")

    print(
        f"Loaded {ncu_range.num_actions()} kernels from {ncu_input_file}")

    # Ensure we have enough kernels for clustering
    # if ncu_range.num_actions() <= 2:
    #     raise ValueError("The input file must contain at least 2 kernels.")
    self.bypass = ncu_range.num_actions() <= 20

    # Create initial dataframe with kernel information
    kernel_ids = range(ncu_range.num_actions())
    kernel_names = [action.name() for action in ncu_range]

    data = {
        "Kernel Name": kernel_names,
        "Kernel ID": list(kernel_ids),
    }

    # Add metrics data to dataframe
    for metric in self.metrics:
        data[metric] = [action[metric].value() for action in ncu_range]

    self.ncu_data = pd.DataFrame(data)

    # Initialize kernel_df with basic kernel information
    self.kernel_df = self.ncu_data[["Kernel ID", "Kernel Name"]].copy()

    self.sum_lts_t_sectors_op_write = self.ncu_data["lts__t_sectors_op_write.sum"].sum()

generate_kernels_csv()

Generate CSV files containing the kernel and cluster data.

Creates two CSV files in the output_dir: - kernels.csv: Contains information about all kernels, including their original ID, name, assigned cluster ID, and the ID of the centroid representing their cluster. - clusters.csv: Contains information about each cluster, including its ID, the number of kernels it contains, and the ID and name of its representative centroid kernel.

Returns:

Name Type Description
None None

Writes kernels.csv and clusters.csv to the output directory.

Source code in python-scripts/pks.py
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
def generate_kernels_csv(self) -> None:
    """Generate CSV files containing the kernel and cluster data.

    Creates two CSV files in the `output_dir`:
    - `kernels.csv`: Contains information about all kernels, including their
      original ID, name, assigned cluster ID, and the ID of the centroid
      representing their cluster.
    - `clusters.csv`: Contains information about each cluster, including its ID,
      the number of kernels it contains, and the ID and name of its
      representative centroid kernel.

    Returns:
        None: Writes `kernels.csv` and `clusters.csv` to the output directory.
    """
    # Export kernel_df to CSV
    self.kernel_df.to_csv(os.path.join(
        self.output_dir, "kernels.csv"), index=False)

    # Export cluster_df for reference (without PCA center coordinates)
    cluster_cols = ["Cluster ID", "Kernel Count",
                    "Centroid Kernel ID", "Centroid Kernel Name"]
    cluster_cols_available = [
        col for col in cluster_cols if col in self.cluster_df.columns]
    cluster_df_to_export = self.cluster_df[cluster_cols_available].copy()

    cluster_df_to_export.to_csv(os.path.join(
        self.output_dir, "clusters.csv"), index=False)

generate_kernelslist(delete=False)

Generate a list file containing the selected representative kernels.

Creates a kernelslist.g file in the output_dir. This file lists the trace file names (e.g., kernel-1.traceg, kernel-5.traceg) corresponding to the selected centroid kernels. Optionally, deletes trace files in the output_dir that do not correspond to the selected centroids.

Parameters:

Name Type Description Default
delete bool

If True, deletes trace files (*.traceg) in the output_dir that are not selected as centroids. Defaults to False.

False

Returns:

Name Type Description
None None

Writes kernelslist.g to the output directory and potentially deletes files.

Source code in python-scripts/pks.py
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
def generate_kernelslist(self, delete: bool = False) -> None:
    """Generate a list file containing the selected representative kernels.

    Creates a `kernelslist.g` file in the `output_dir`. This file lists the
    trace file names (e.g., `kernel-1.traceg`, `kernel-5.traceg`) corresponding
    to the selected centroid kernels. Optionally, deletes trace files in the
    `output_dir` that do not correspond to the selected centroids.

    Args:
        delete (bool): If True, deletes trace files (`*.traceg`) in the `output_dir`
                       that are not selected as centroids. Defaults to False.

    Returns:
        None: Writes `kernelslist.g` to the output directory and potentially deletes files.
    """
    # Get unique centroid kernel IDs (already 1-indexed from run_pks)
    centroid_ids = self.cluster_df["Centroid Kernel ID"].unique()
    centroid_ids = sorted([cid for cid in centroid_ids if cid >= 0])

    # Write kernelslist.g file
    with open(os.path.join(self.output_dir, "kernelslist.g"), "w") as f:
        for index in centroid_ids:
            f.write(f"kernel-{index}.traceg\n")

    # Optionally delete non-centroid trace files
    if delete:
        # iterate over all files in the output directory
        for filename in os.listdir(self.output_dir):
            # check if the file is a trace file and not in the centroid list
            if filename.startswith("kernel-") and (filename.endswith(".traceg") or filename.endswith(".traceg.xz")):
                kernel_id = int(filename.split("-")[1].split(".")[0])
                if kernel_id not in centroid_ids:
                    os.remove(os.path.join(self.output_dir, filename))
                    print(f"Deleted {filename}")

kmeans(data, n_clusters=3)

Perform K-means clustering with a specified number of clusters.

Calculates cluster assignments, centroids, and a custom score based on the difference between the sum of 'lts__t_sectors_op_write.sum' for centroid-representative kernels (weighted by cluster size) and the total sum for all kernels.

Parameters:

Name Type Description Default
data DataFrame

Data to cluster (typically PCA-transformed).

required
n_clusters int

Number of clusters to form. Defaults to 3.

3

Returns:

Type Description
Tuple[ndarray, ndarray, float]

Tuple[np.ndarray, np.ndarray, float]: A tuple containing: - np.ndarray: Array of cluster labels for each data point. - np.ndarray: Array of cluster centers (centroids). - float: Custom score representing the relative error in 'lts__t_sectors_op_write.sum'.

Source code in python-scripts/pks.py
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
def kmeans(self, data: pd.DataFrame, n_clusters: int = 3) -> Tuple[np.ndarray, np.ndarray, float]:
    """Perform K-means clustering with a specified number of clusters.

    Calculates cluster assignments, centroids, and a custom score based on the
    difference between the sum of 'lts__t_sectors_op_write.sum' for centroid-representative
    kernels (weighted by cluster size) and the total sum for all kernels.

    Args:
        data (pd.DataFrame): Data to cluster (typically PCA-transformed).
        n_clusters (int): Number of clusters to form. Defaults to 3.

    Returns:
        Tuple[np.ndarray, np.ndarray, float]: A tuple containing:
            - np.ndarray: Array of cluster labels for each data point.
            - np.ndarray: Array of cluster centers (centroids).
            - float: Custom score representing the relative error in 'lts__t_sectors_op_write.sum'.
    """
    # Apply K-means clustering
    kmeans_result = KMeans(
        n_clusters=n_clusters,
        random_state=12,  # For reproducibility
        n_init="auto"  # Use default initialization
    ).fit(data)

    labels = kmeans_result.labels_
    centers = kmeans_result.cluster_centers_

    cluster_lts_t_sectors_op_write = 0

    for i, center in enumerate(centers):
        closest_cluster = i
        # get number of kernels in the cluster
        num_kernels = np.sum(labels == closest_cluster)
        # get the indices of the points in the closest cluster
        indices = np.where(labels == closest_cluster)[0]
        min_distance = np.inf
        closest_point = None
        closest_index = None
        for index in indices:
            # get the principal component values for the point
            point = self.pca_df.iloc[index].values
            # calculate the distance from the point to the center of the cluster
            distance = np.linalg.norm(point - center)
            if distance < min_distance:
                min_distance = distance
                closest_point = point
                closest_index = index
        # get the original data for the closest point
        original_data = self.ncu_data.iloc[closest_index]
        # get lts__t_sectors_op_write.sum
        lts_t_sectors_op_write = original_data["lts__t_sectors_op_write.sum"]
        cluster_lts_t_sectors_op_write += lts_t_sectors_op_write * num_kernels

    # Calculate custom score based on write count difference
    score = np.abs(cluster_lts_t_sectors_op_write -
                   self.sum_lts_t_sectors_op_write) / self.sum_lts_t_sectors_op_write

    # Return labels, centers, and the custom score
    return kmeans_result.labels_, kmeans_result.cluster_centers_, score

kmeans_scan(data, lower_bound=2, upper_bound=20)

Find optimal number of clusters by scanning a range of values.

Tries K-means with different numbers of clusters (from lower_bound to upper_bound). Selects the number of clusters corresponding to the lowest custom score calculated by kmeans. If multiple cluster counts yield scores within 5% of the minimum, the one with the fewest clusters is chosen. Updates class attributes cluster_count, kernel_df, and cluster_df with the results of the best clustering.

Parameters:

Name Type Description Default
data DataFrame

Data to cluster (typically PCA-transformed).

required
lower_bound int

Minimum number of clusters to try. Defaults to 2.

2
upper_bound int

Maximum number of clusters to try. Defaults to 20.

20

Returns:

Name Type Description
None None

Updates class attributes with clustering results.

Source code in python-scripts/pks.py
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
def kmeans_scan(self, data: pd.DataFrame, lower_bound: int = 2, upper_bound: int = 20) -> None:
    """Find optimal number of clusters by scanning a range of values.

    Tries K-means with different numbers of clusters (from lower_bound to upper_bound).
    Selects the number of clusters corresponding to the lowest custom score calculated by `kmeans`.
    If multiple cluster counts yield scores within 5% of the minimum, the one with the
    fewest clusters is chosen. Updates class attributes `cluster_count`, `kernel_df`,
    and `cluster_df` with the results of the best clustering.

    Args:
        data (pd.DataFrame): Data to cluster (typically PCA-transformed).
        lower_bound (int): Minimum number of clusters to try. Defaults to 2.
        upper_bound (int): Maximum number of clusters to try. Defaults to 20.

    Returns:
        None: Updates class attributes with clustering results.
    """
    scores = []
    centers_list = []
    kmeans_labels_list = []

    # Try different numbers of clusters
    for i in range(lower_bound, upper_bound + 1):
        labels, centers, score = self.kmeans(data, i)
        print(f"Number of clusters: {i}, Write count error: {score}")
        scores.append(score)
        centers_list.append(centers)
        kmeans_labels_list.append(labels)

    # Find minimum score
    min_score = min(scores)

    # Use the first clustering within 5% of minimum score
    for i, score in enumerate(scores):
        if score <= 1.05 * min_score:
            self.cluster_count = i + lower_bound

            # Update kernel_df with cluster assignments
            self.kernel_df["Cluster ID"] = kmeans_labels_list[i]

            # Create cluster_df with centers and counts
            centers = centers_list[i]
            cluster_ids = range(self.cluster_count)
            counts = np.bincount(kmeans_labels_list[i])

            # Create dataframe with cluster details
            self.cluster_df = pd.DataFrame({
                "Cluster ID": cluster_ids,
                "Kernel Count": counts
            })

            # Ensure integer data types
            self.cluster_df["Cluster ID"] = pd.to_numeric(
                self.cluster_df["Cluster ID"], errors='coerce').fillna(-1).astype(int)
            self.cluster_df["Kernel Count"] = pd.to_numeric(
                self.cluster_df["Kernel Count"], errors='coerce').fillna(-1).astype(int)

            # Add center coordinates as separate columns
            for j in range(centers.shape[1]):
                self.cluster_df[f"Center_PC{j}"] = [
                    centers[k, j] for k in range(self.cluster_count)]
            break

pca(data, var_threshold=0.95)

Perform Principal Component Analysis on kernel metrics.

Parameters:

Name Type Description Default
data DataFrame

DataFrame containing kernel metrics.

required
var_threshold float

Variance threshold for PCA dimensionality reduction. Defaults to 0.95.

0.95

Returns:

Type Description
Tuple[DataFrame, ndarray]

Tuple[pd.DataFrame, np.ndarray]: A tuple containing: - pd.DataFrame: DataFrame with transformed data, columns named 'PC0', 'PC1', etc. - np.ndarray: Raw numpy array of the transformed data.

Source code in python-scripts/pks.py
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
def pca(self, data: pd.DataFrame, var_threshold: float = 0.95) -> Tuple[pd.DataFrame, np.ndarray]:
    """Perform Principal Component Analysis on kernel metrics.

    Args:
        data (pd.DataFrame): DataFrame containing kernel metrics.
        var_threshold (float): Variance threshold for PCA dimensionality reduction. Defaults to 0.95.

    Returns:
        Tuple[pd.DataFrame, np.ndarray]: A tuple containing:
            - pd.DataFrame: DataFrame with transformed data, columns named 'PC0', 'PC1', etc.
            - np.ndarray: Raw numpy array of the transformed data.
    """
    # Create a copy to avoid modifying original data
    data_copy = data.copy()

    # Remove non-metric columns
    data_copy.drop(columns=["Kernel Name", "Kernel ID"], inplace=True)

    # Standardize the data
    data_copy = StandardScaler().fit_transform(data_copy)

    # Apply PCA
    pca_model = PCA(n_components=var_threshold)
    transformed_data = pca_model.fit_transform(data_copy)

    # Create a dataframe with the transformed data
    transformed_df = pd.DataFrame(
        transformed_data,
        columns=[f"PC{i}" for i in range(pca_model.n_components_)]
    )

    return transformed_df, transformed_data

run_pks(output_file=None, delete=False)

Execute the complete Principal Kernel Selection workflow.

Performs PCA on the kernel metrics, finds the optimal number of clusters using K-means scanning, selects representative centroid kernels for each cluster, and generates output files (kernelslist.g, kernels.csv, clusters.csv) summarizing the results. If the number of kernels is small (<= 20), it bypasses the PCA and clustering steps, treating each kernel as its own cluster.

Parameters:

Name Type Description Default
output_file str

Path to the output kernelslist.g file. If None, defaults to kernelslist.g within the output_dir. The directory of this file will be used as the output directory for other generated files (.csv). Defaults to None.

None
delete bool

If True, deletes non-centroid trace files during the generate_kernelslist step. Defaults to False.

False

Returns:

Name Type Description
None None

Generates output files with the results of the analysis.

Source code in python-scripts/pks.py
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
def run_pks(self, output_file: str = None, delete: bool = False) -> None:
    """Execute the complete Principal Kernel Selection workflow.

    Performs PCA on the kernel metrics, finds the optimal number of clusters
    using K-means scanning, selects representative centroid kernels for each cluster,
    and generates output files (`kernelslist.g`, `kernels.csv`, `clusters.csv`)
    summarizing the results. If the number of kernels is small (<= 20),
    it bypasses the PCA and clustering steps, treating each kernel as its own cluster.

    Args:
        output_file (str, optional): Path to the output `kernelslist.g` file.
            If None, defaults to `kernelslist.g` within the `output_dir`.
            The directory of this file will be used as the output directory
            for other generated files (`.csv`). Defaults to None.
        delete (bool): If True, deletes non-centroid trace files during the
                       `generate_kernelslist` step. Defaults to False.

    Returns:
        None: Generates output files with the results of the analysis.
    """

    if not self.bypass:
        # Apply PCA to reduce dimensionality
        self.pca_df, _ = self.pca(self.ncu_data)
        print(f"PCA shape: {self.pca_df.shape}")

        # Find optimal number of clusters
        self.kmeans_scan(self.pca_df)
        print(f"Optimal number of clusters: {self.cluster_count}")

        # Select representative kernels
        self.select_centroid()

    else:
        # Copy Kernel ID as Centroid Kernel ID for bypass case
        self.kernel_df["Centroid Kernel ID"] = self.kernel_df["Kernel ID"]
        # Copy Kernel ID as Cluster ID for bypass case
        self.kernel_df["Cluster ID"] = self.kernel_df["Kernel ID"]
        # Create cluster_df with all kernels as clusters
        self.cluster_df = pd.DataFrame({
            "Cluster ID": self.kernel_df["Kernel ID"],
            "Kernel Count": 1,
            "Centroid Kernel ID": self.kernel_df["Kernel ID"],
            "Centroid Kernel Name": self.kernel_df["Kernel Name"]
        })

    # Adjust for 1-based indexing used by trace files
    self.kernel_df["Kernel ID"] = \
        self.kernel_df["Kernel ID"].astype(int) + 1
    self.kernel_df["Centroid Kernel ID"] = \
        self.kernel_df["Centroid Kernel ID"].astype(int) + 1
    self.cluster_df["Centroid Kernel ID"] = \
        self.cluster_df["Centroid Kernel ID"].astype(int) + 1

    # Generate output files
    if output_file:
        output_dir = os.path.dirname(output_file)
        if output_dir and not os.path.exists(output_dir):
            os.makedirs(output_dir)
        self.output_dir = output_dir if output_dir else self.output_dir

    self.generate_kernelslist(delete=delete)
    self.generate_kernels_csv()

    # Display results summary
    print(
        f"Generated kernelslist.g with {len(self.cluster_df)} representative kernels")
    print(f"Selected {len(self.cluster_df)} out of {len(self.kernel_df)} kernels "
          f"({len(self.cluster_df)/len(self.kernel_df):.1%})")

select_centroid()

Select representative kernels by finding points closest to cluster centers.

For each cluster identified by K-means, this method finds the kernel (data point) in the PCA space that is closest to the cluster's center (centroid) using Euclidean distance. It marks this kernel as the representative for that cluster. Updates both kernel_df and cluster_df with centroid information (Kernel ID and Name).

Returns:

Name Type Description
None None

Updates kernel_df and cluster_df with centroid information.

Source code in python-scripts/pks.py
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
def select_centroid(self) -> None:
    """Select representative kernels by finding points closest to cluster centers.

    For each cluster identified by K-means, this method finds the kernel
    (data point) in the PCA space that is closest to the cluster's center
    (centroid) using Euclidean distance. It marks this kernel as the
    representative for that cluster. Updates both `kernel_df` and `cluster_df`
    with centroid information (Kernel ID and Name).

    Returns:
        None: Updates `kernel_df` and `cluster_df` with centroid information.
    """
    # For each cluster, find the kernel closest to the center
    centroid_kernel_ids = []

    for cluster_id in range(self.cluster_count):
        # Get indices of kernels in this cluster
        cluster_mask = self.kernel_df["Cluster ID"] == cluster_id
        cluster_kernel_indices = self.kernel_df.index[cluster_mask].tolist(
        )

        # Handle empty clusters
        if not cluster_kernel_indices:
            print(f"Warning: No kernels found in cluster {cluster_id}")
            centroid_kernel_ids.append(None)
            continue

        # Get PCA coordinates for kernels in this cluster
        cluster_data = self.pca_df.iloc[cluster_kernel_indices]

        # Get center coordinates for this cluster
        center_coords = []
        for j in range(self.pca_df.shape[1]):
            center_coords.append(
                self.cluster_df.loc[cluster_id, f"Center_PC{j}"])

        # Calculate Euclidean distances to center
        distances = np.linalg.norm(
            cluster_data.values - center_coords, axis=1)

        # Find the kernel closest to the center
        closest_idx = cluster_kernel_indices[np.argmin(distances)]
        kernel_id = int(self.kernel_df.loc[closest_idx, "Kernel ID"])
        centroid_kernel_ids.append(kernel_id)

        # Add centroid information to cluster_df
        self.cluster_df.at[cluster_id, "Centroid Kernel ID"] = kernel_id
        self.cluster_df.at[cluster_id, "Centroid Kernel Name"] = \
            self.kernel_df.loc[closest_idx, "Kernel Name"]

    # Update kernel_df with centroid assignments
    for idx, row in self.kernel_df.iterrows():
        cluster_id = row["Cluster ID"]
        centroid_id = self.cluster_df.loc[cluster_id, "Centroid Kernel ID"]
        self.kernel_df.at[idx, "Centroid Kernel ID"] = centroid_id

    # Ensure integer data types throughout
    self.kernel_df["Cluster ID"] = self.kernel_df["Cluster ID"].astype(int)
    self.kernel_df["Centroid Kernel ID"] = pd.to_numeric(
        self.kernel_df["Centroid Kernel ID"], errors='coerce').fillna(-1).astype(int)
    self.kernel_df["Kernel ID"] = self.kernel_df["Kernel ID"].astype(int)
    self.cluster_df["Centroid Kernel ID"] = pd.to_numeric(
        self.cluster_df["Centroid Kernel ID"], errors='coerce').fillna(-1).astype(int)

Sampling Utilities: NVIDIA NSight Compute Coarse-Grained Profiling

The two scripts nsight_nvbit.py and ncu_exec.py are used for coarse-grained profiling of GPU kernels using NVIDIA NSight Compute. The profile results are used in the sampling process to select the most representative kernels for simulation.

Class to run Nsight Compute and NVBit on a given program with the specified arguments.

Provides methods to execute NVIDIA Nsight Compute (ncu) and a custom NVBit tool for profiling GPU applications. It handles environment setup, command construction, and log file management.

The ability to run this script as a standalone program is deprecated. Please use ncu_exec.py instead.

NsightNVBitRunner

A runner class to execute Nsight Compute and NVBit profiling tools.

Manages the configuration and execution of Nsight Compute (ncu) and a custom NVBit tool (ncu-nvbit.so) on a specified target program. It sets up necessary environment variables, constructs command lines, runs the tools, and manages output log files.

Attributes:

Name Type Description
program str

Path to the executable program or interpreter.

program_args List[str]

Arguments for the target program.

log_file_name str

Base name for generated log files (e.g., program_timestamp).

log_file_path str

Directory path where log files are stored.

mangled bool

Flag indicating whether to use mangled kernel names (primarily for NVBit).

Source code in python-scripts/nsight_nvbit.py
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
class NsightNVBitRunner:
    """
    A runner class to execute Nsight Compute and NVBit profiling tools.

    Manages the configuration and execution of Nsight Compute (ncu) and a custom
    NVBit tool (`ncu-nvbit.so`) on a specified target program. It sets up
    necessary environment variables, constructs command lines, runs the tools,
    and manages output log files.

    Attributes:
        program (str): Path to the executable program or interpreter.
        program_args (List[str]): Arguments for the target program.
        log_file_name (str): Base name for generated log files (e.g., program_timestamp).
        log_file_path (str): Directory path where log files are stored.
        mangled (bool): Flag indicating whether to use mangled kernel names (primarily for NVBit).
    """
    def __init__(self):
        """Initialize the NsightNVBitRunner with default None values."""
        # Initialize member variables that will hold the state from main.
        self.program = None
        self.program_args = None
        self.log_file_name = None
        self.log_file_path = None
        self.mangled = True

    def init_from_params(self, program_name, program_args, log_file_name, log_file_path, mangled=True):
        """
        Initialize the runner with explicitly provided parameters.

        Args:
            program_name (str): The program executable or interpreter name/path.
            program_args (List[str]): List of arguments for the program.
            log_file_name (str): Base name for log files.
            log_file_path (str): Directory path for log files.
            mangled (bool): Whether to use mangled kernel names. Defaults to True.

        Returns:
            None
        """
        # Initialize the class with the provided arguments.
        self.program = program_name
        self.program_args = program_args
        self.log_file_name = log_file_name
        self.log_file_path = log_file_path
        self.mangled = mangled 

    def init_from_program(self, program_name, program_args, mangled=True):
        """
        Initialize the runner based on program name and arguments.

        Determines the actual executable (handling Python scripts), generates
        log file names and paths based on the program name and timestamp, and
        prints a configuration summary.

        Args:
            program_name (str): The program executable or script name/path.
            program_args (List[str]): List of arguments for the program/script.
            mangled (bool): Whether to use mangled kernel names. Defaults to True.

        Returns:
            None
        """
        self.program = os.path.realpath(program_name)
        output_program_name = os.path.basename(program_name)
        self.program_args = program_args
        self.mangled = mangled
        if program_name == "python" or program_name.startswith("python3") or program_name == "torchrun":
            self.program = sys.executable
            output_program_name = os.path.basename(
                self.program_args[0]).replace(".py", "")
        if self.program.endswith(".py"):
            output_program_name = os.path.basename(
                self.program).replace(".py", "")
            self.program_args = [self.program] + self.program_args
            self.program = sys.executable

        # Create a log file to store the profiling results
        basename = os.path.basename(output_program_name)
        timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        self.log_file_name = f"{basename}_{timestamp}"
        self.log_file_path = os.path.join(
            os.getenv('PROJECT_ROOT', '.'), 'logs', basename)
        os.makedirs(self.log_file_path, exist_ok=True)

        # Print configuration summary
        print("*" * 27 + " Profiling Configuration " + "*" * 27)
        print(
            f"Running program: {self.program} with arguments: {' '.join(self.program_args)}")
        print(f"Log file name: {self.log_file_name}")
        print(f"Log file path: {self.log_file_path}")
        print("*" * 80 + "\n")

    def get_metrics(self):
        """
        Load and return the list of Nsight Compute metrics from a JSON file.

        Reads metrics specified in 'metrics_list.json' located in the same
        directory as this script. The JSON file should contain a dictionary where
        values are lists of metric names.

        Returns:
            str: A comma-separated string of all metric names found in the JSON file.

        Raises:
            FileNotFoundError: If 'metrics_list.json' is not found.
            ValueError: If 'metrics_list.json' is empty or contains no metrics.
            json.JSONDecodeError: If 'metrics_list.json' is not valid JSON.
        """
        metrics_path = os.path.join(
            os.path.dirname(__file__), "metrics_list.json")
        with open(metrics_path, 'r') as metrics_file:
            metrics = json.load(metrics_file)
        metrics_str = ""
        for key, value in metrics.items():
            metrics_str += ",".join(value) + ","
        if metrics_str == "":
            raise ValueError(
                "Error: No metrics found in the metrics_list.json file")
        return metrics_str[:-1]

    def run_nvbit(self, dry_run=True):
        """
        Run the custom NVBit tool (`ncu-nvbit.so`) on the target program.

        Sets up the environment (CUDA_INJECTION64_PATH, etc.), constructs the
        command, and executes the program under the NVBit tool. If `dry_run` is False,
        it captures the output to a `.nvbit` log file.

        Args:
            dry_run (bool): If True, print the command without executing it.
                            Defaults to True.

        Returns:
            str: The full path to the generated `.nvbit` output file.

        Raises:
            FileNotFoundError: If the `ncu-nvbit.so` library is not found and cannot be compiled.
            subprocess.CalledProcessError: If the NVBit execution fails (when dry_run=False).
        """
        nvbit_lib = os.path.join(
            os.getenv('PROJECT_ROOT', '.'),
            'backend', 'ncu-nvbit', 'ncu-nvbit.so')
        if not os.path.isfile(nvbit_lib):
            print("Compiling NVBit")
            subprocess.run(['make', '-C',
                            os.path.join(os.getenv('PROJECT_ROOT', '.'), 'backend', 'ncu-nvbit')])
        nvbit_output_file = os.path.join(
            self.log_file_path, f"{self.log_file_name}.nvbit")
        nvbit_env = os.environ.copy()
        nvbit_env_dict = {
            'NOBANNER': '1',
            'MANGLED_NAMES': '1' if self.mangled else '0',
            'CUDA_INJECTION64_PATH': nvbit_lib,
            'PATH': os.getenv('CUDA_INSTALL_PATH', '/usr/local/cuda') + '/bin:' + os.getenv('PATH', '')
        }
        for key, value in nvbit_env_dict.items():
            nvbit_env[key] = value
        nvbit_env_str = ' '.join(
            [f'{k}={v}' for k, v in nvbit_env_dict.items()])
        print(f"Running NVBit with the following command: \
              {nvbit_env_str} {self.program} {' '.join(self.program_args)}")
        if not dry_run:
            with open(nvbit_output_file, 'w') as log_file:
                subprocess.run([self.program] + self.program_args, stdout=log_file,
                               stderr=subprocess.STDOUT, env=nvbit_env)
            print(f"Check {nvbit_output_file} for the NVBit output")
        return nvbit_output_file

    def run_ncu(self, dry_run=True):
        """
        Run NVIDIA Nsight Compute (ncu) on the target program.

        Constructs the `ncu` command line with specified metrics (from `get_metrics`),
        configuration flags (e.g., `--force-overwrite`, `--replay-mode`), and output
        options (`--export`). Sets the `TMPDIR` environment variable. If `dry_run` is False,
        it executes `ncu` and captures its command-line output to a `.exec_ncu.log` file.
        The main report is saved to a `.ncu-rep` file.

        Args:
            dry_run (bool): If True, print the command without executing it.
                            Defaults to True.

        Returns:
            str: The full path to the generated `.ncu-rep` report file.

        Raises:
            FileNotFoundError: If the `ncu` executable is not found in the expected CUDA path.
            subprocess.CalledProcessError: If the `ncu` execution fails (when dry_run=False).
        """
        ncu = os.path.join(os.getenv('CUDA_INSTALL_PATH',
                           '/usr/local/cuda'), 'bin', 'ncu')
        if not os.path.isfile(ncu):
            raise FileNotFoundError(
                f"Error: Nsight Compute CLI not found at {ncu}")
        ncu_output_file = os.path.join(
            self.log_file_path, f"{self.log_file_name}.ncu-rep")
        ncu_log_file = os.path.join(
            self.log_file_path, f"{self.log_file_name}.exec_ncu.log")
        export_args = [
            '--rename-kernels-export', 'yes',
            '--rename-kernels-path', os.path.join(
                self.log_file_path, f"{self.log_file_name}.exec.kernels"),
            '--export', ncu_output_file
        ]
        ncu_args = [
            '--config-file', 'off',
            '--force-overwrite',
            '--clock-control', 'none',
            '--rename-kernels', 'off',
            '--replay-mode', 'kernel',
            '--launch-count', '1000'
        ]
        temp_dir = os.path.join(os.getenv('PROJECT_ROOT', '.'), '.tmp')
        os.makedirs(temp_dir, exist_ok=True)
        ncu_env = os.environ.copy()
        ncu_env['TMPDIR'] = temp_dir
        ncu_env['USER_DEFINED_FOLDERS'] = '1'
        cmd = [ncu] + ncu_args + ['--metrics', self.get_metrics()] + \
            export_args + [self.program] + self.program_args
        print(
            f"Running Nsight Compute with the following command: {' '.join(cmd)}")
        if not dry_run:
            with open(ncu_log_file, 'w') as log_file:
                subprocess.run(cmd, stdout=log_file,
                               stderr=subprocess.STDOUT, env=ncu_env)
            print(f"Check {ncu_output_file} for the Nsight Compute report")
        return ncu_output_file

__init__()

Initialize the NsightNVBitRunner with default None values.

Source code in python-scripts/nsight_nvbit.py
49
50
51
52
53
54
55
56
def __init__(self):
    """Initialize the NsightNVBitRunner with default None values."""
    # Initialize member variables that will hold the state from main.
    self.program = None
    self.program_args = None
    self.log_file_name = None
    self.log_file_path = None
    self.mangled = True

get_metrics()

Load and return the list of Nsight Compute metrics from a JSON file.

Reads metrics specified in 'metrics_list.json' located in the same directory as this script. The JSON file should contain a dictionary where values are lists of metric names.

Returns:

Name Type Description
str

A comma-separated string of all metric names found in the JSON file.

Raises:

Type Description
FileNotFoundError

If 'metrics_list.json' is not found.

ValueError

If 'metrics_list.json' is empty or contains no metrics.

JSONDecodeError

If 'metrics_list.json' is not valid JSON.

Source code in python-scripts/nsight_nvbit.py
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
def get_metrics(self):
    """
    Load and return the list of Nsight Compute metrics from a JSON file.

    Reads metrics specified in 'metrics_list.json' located in the same
    directory as this script. The JSON file should contain a dictionary where
    values are lists of metric names.

    Returns:
        str: A comma-separated string of all metric names found in the JSON file.

    Raises:
        FileNotFoundError: If 'metrics_list.json' is not found.
        ValueError: If 'metrics_list.json' is empty or contains no metrics.
        json.JSONDecodeError: If 'metrics_list.json' is not valid JSON.
    """
    metrics_path = os.path.join(
        os.path.dirname(__file__), "metrics_list.json")
    with open(metrics_path, 'r') as metrics_file:
        metrics = json.load(metrics_file)
    metrics_str = ""
    for key, value in metrics.items():
        metrics_str += ",".join(value) + ","
    if metrics_str == "":
        raise ValueError(
            "Error: No metrics found in the metrics_list.json file")
    return metrics_str[:-1]

init_from_params(program_name, program_args, log_file_name, log_file_path, mangled=True)

Initialize the runner with explicitly provided parameters.

Parameters:

Name Type Description Default
program_name str

The program executable or interpreter name/path.

required
program_args List[str]

List of arguments for the program.

required
log_file_name str

Base name for log files.

required
log_file_path str

Directory path for log files.

required
mangled bool

Whether to use mangled kernel names. Defaults to True.

True

Returns:

Type Description

None

Source code in python-scripts/nsight_nvbit.py
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def init_from_params(self, program_name, program_args, log_file_name, log_file_path, mangled=True):
    """
    Initialize the runner with explicitly provided parameters.

    Args:
        program_name (str): The program executable or interpreter name/path.
        program_args (List[str]): List of arguments for the program.
        log_file_name (str): Base name for log files.
        log_file_path (str): Directory path for log files.
        mangled (bool): Whether to use mangled kernel names. Defaults to True.

    Returns:
        None
    """
    # Initialize the class with the provided arguments.
    self.program = program_name
    self.program_args = program_args
    self.log_file_name = log_file_name
    self.log_file_path = log_file_path
    self.mangled = mangled 

init_from_program(program_name, program_args, mangled=True)

Initialize the runner based on program name and arguments.

Determines the actual executable (handling Python scripts), generates log file names and paths based on the program name and timestamp, and prints a configuration summary.

Parameters:

Name Type Description Default
program_name str

The program executable or script name/path.

required
program_args List[str]

List of arguments for the program/script.

required
mangled bool

Whether to use mangled kernel names. Defaults to True.

True

Returns:

Type Description

None

Source code in python-scripts/nsight_nvbit.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def init_from_program(self, program_name, program_args, mangled=True):
    """
    Initialize the runner based on program name and arguments.

    Determines the actual executable (handling Python scripts), generates
    log file names and paths based on the program name and timestamp, and
    prints a configuration summary.

    Args:
        program_name (str): The program executable or script name/path.
        program_args (List[str]): List of arguments for the program/script.
        mangled (bool): Whether to use mangled kernel names. Defaults to True.

    Returns:
        None
    """
    self.program = os.path.realpath(program_name)
    output_program_name = os.path.basename(program_name)
    self.program_args = program_args
    self.mangled = mangled
    if program_name == "python" or program_name.startswith("python3") or program_name == "torchrun":
        self.program = sys.executable
        output_program_name = os.path.basename(
            self.program_args[0]).replace(".py", "")
    if self.program.endswith(".py"):
        output_program_name = os.path.basename(
            self.program).replace(".py", "")
        self.program_args = [self.program] + self.program_args
        self.program = sys.executable

    # Create a log file to store the profiling results
    basename = os.path.basename(output_program_name)
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    self.log_file_name = f"{basename}_{timestamp}"
    self.log_file_path = os.path.join(
        os.getenv('PROJECT_ROOT', '.'), 'logs', basename)
    os.makedirs(self.log_file_path, exist_ok=True)

    # Print configuration summary
    print("*" * 27 + " Profiling Configuration " + "*" * 27)
    print(
        f"Running program: {self.program} with arguments: {' '.join(self.program_args)}")
    print(f"Log file name: {self.log_file_name}")
    print(f"Log file path: {self.log_file_path}")
    print("*" * 80 + "\n")

run_ncu(dry_run=True)

Run NVIDIA Nsight Compute (ncu) on the target program.

Constructs the ncu command line with specified metrics (from get_metrics), configuration flags (e.g., --force-overwrite, --replay-mode), and output options (--export). Sets the TMPDIR environment variable. If dry_run is False, it executes ncu and captures its command-line output to a .exec_ncu.log file. The main report is saved to a .ncu-rep file.

Parameters:

Name Type Description Default
dry_run bool

If True, print the command without executing it. Defaults to True.

True

Returns:

Name Type Description
str

The full path to the generated .ncu-rep report file.

Raises:

Type Description
FileNotFoundError

If the ncu executable is not found in the expected CUDA path.

CalledProcessError

If the ncu execution fails (when dry_run=False).

Source code in python-scripts/nsight_nvbit.py
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
def run_ncu(self, dry_run=True):
    """
    Run NVIDIA Nsight Compute (ncu) on the target program.

    Constructs the `ncu` command line with specified metrics (from `get_metrics`),
    configuration flags (e.g., `--force-overwrite`, `--replay-mode`), and output
    options (`--export`). Sets the `TMPDIR` environment variable. If `dry_run` is False,
    it executes `ncu` and captures its command-line output to a `.exec_ncu.log` file.
    The main report is saved to a `.ncu-rep` file.

    Args:
        dry_run (bool): If True, print the command without executing it.
                        Defaults to True.

    Returns:
        str: The full path to the generated `.ncu-rep` report file.

    Raises:
        FileNotFoundError: If the `ncu` executable is not found in the expected CUDA path.
        subprocess.CalledProcessError: If the `ncu` execution fails (when dry_run=False).
    """
    ncu = os.path.join(os.getenv('CUDA_INSTALL_PATH',
                       '/usr/local/cuda'), 'bin', 'ncu')
    if not os.path.isfile(ncu):
        raise FileNotFoundError(
            f"Error: Nsight Compute CLI not found at {ncu}")
    ncu_output_file = os.path.join(
        self.log_file_path, f"{self.log_file_name}.ncu-rep")
    ncu_log_file = os.path.join(
        self.log_file_path, f"{self.log_file_name}.exec_ncu.log")
    export_args = [
        '--rename-kernels-export', 'yes',
        '--rename-kernels-path', os.path.join(
            self.log_file_path, f"{self.log_file_name}.exec.kernels"),
        '--export', ncu_output_file
    ]
    ncu_args = [
        '--config-file', 'off',
        '--force-overwrite',
        '--clock-control', 'none',
        '--rename-kernels', 'off',
        '--replay-mode', 'kernel',
        '--launch-count', '1000'
    ]
    temp_dir = os.path.join(os.getenv('PROJECT_ROOT', '.'), '.tmp')
    os.makedirs(temp_dir, exist_ok=True)
    ncu_env = os.environ.copy()
    ncu_env['TMPDIR'] = temp_dir
    ncu_env['USER_DEFINED_FOLDERS'] = '1'
    cmd = [ncu] + ncu_args + ['--metrics', self.get_metrics()] + \
        export_args + [self.program] + self.program_args
    print(
        f"Running Nsight Compute with the following command: {' '.join(cmd)}")
    if not dry_run:
        with open(ncu_log_file, 'w') as log_file:
            subprocess.run(cmd, stdout=log_file,
                           stderr=subprocess.STDOUT, env=ncu_env)
        print(f"Check {ncu_output_file} for the Nsight Compute report")
    return ncu_output_file

run_nvbit(dry_run=True)

Run the custom NVBit tool (ncu-nvbit.so) on the target program.

Sets up the environment (CUDA_INJECTION64_PATH, etc.), constructs the command, and executes the program under the NVBit tool. If dry_run is False, it captures the output to a .nvbit log file.

Parameters:

Name Type Description Default
dry_run bool

If True, print the command without executing it. Defaults to True.

True

Returns:

Name Type Description
str

The full path to the generated .nvbit output file.

Raises:

Type Description
FileNotFoundError

If the ncu-nvbit.so library is not found and cannot be compiled.

CalledProcessError

If the NVBit execution fails (when dry_run=False).

Source code in python-scripts/nsight_nvbit.py
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
def run_nvbit(self, dry_run=True):
    """
    Run the custom NVBit tool (`ncu-nvbit.so`) on the target program.

    Sets up the environment (CUDA_INJECTION64_PATH, etc.), constructs the
    command, and executes the program under the NVBit tool. If `dry_run` is False,
    it captures the output to a `.nvbit` log file.

    Args:
        dry_run (bool): If True, print the command without executing it.
                        Defaults to True.

    Returns:
        str: The full path to the generated `.nvbit` output file.

    Raises:
        FileNotFoundError: If the `ncu-nvbit.so` library is not found and cannot be compiled.
        subprocess.CalledProcessError: If the NVBit execution fails (when dry_run=False).
    """
    nvbit_lib = os.path.join(
        os.getenv('PROJECT_ROOT', '.'),
        'backend', 'ncu-nvbit', 'ncu-nvbit.so')
    if not os.path.isfile(nvbit_lib):
        print("Compiling NVBit")
        subprocess.run(['make', '-C',
                        os.path.join(os.getenv('PROJECT_ROOT', '.'), 'backend', 'ncu-nvbit')])
    nvbit_output_file = os.path.join(
        self.log_file_path, f"{self.log_file_name}.nvbit")
    nvbit_env = os.environ.copy()
    nvbit_env_dict = {
        'NOBANNER': '1',
        'MANGLED_NAMES': '1' if self.mangled else '0',
        'CUDA_INJECTION64_PATH': nvbit_lib,
        'PATH': os.getenv('CUDA_INSTALL_PATH', '/usr/local/cuda') + '/bin:' + os.getenv('PATH', '')
    }
    for key, value in nvbit_env_dict.items():
        nvbit_env[key] = value
    nvbit_env_str = ' '.join(
        [f'{k}={v}' for k, v in nvbit_env_dict.items()])
    print(f"Running NVBit with the following command: \
          {nvbit_env_str} {self.program} {' '.join(self.program_args)}")
    if not dry_run:
        with open(nvbit_output_file, 'w') as log_file:
            subprocess.run([self.program] + self.program_args, stdout=log_file,
                           stderr=subprocess.STDOUT, env=nvbit_env)
        print(f"Check {nvbit_output_file} for the NVBit output")
    return nvbit_output_file

Execution-based, kernel-level, runtime cache analysis using Nsight Compute and NVBit.

This script profiles a specified program using NVIDIA Nsight Compute CLI and a custom NVBit tool. It runs both tools on the target program, parses their outputs, computes derived cache metrics (like lifetime, frequency, utilization), and generates CSV reports and optional plots.

Usage

python3 ncu_exec.py [--dry-run][--mangled] [--histogram]

Pre-requisites
  • The program must exist and be executable.
  • The PROJECT_ROOT and CUDA_INSTALL_PATH environment variables must be set.
  • The Nsight Compute CLI must be installed and available in CUDA_INSTALL_PATH.
  • The NVBit library (ncu-nvbit.so) must be compiled and available in $PROJECT_ROOT/backend/ncu-nvbit.

Parameters:

Name Type Description Default
program str

The program to profile.

required
args List[str]

Arguments to pass to the program.

required
--dry-run bool

Print commands without running tools. Defaults to False.

required
--mangled bool

Use mangled kernel names in output. Defaults to False.

required
--histogram bool

Generate plots of computed metrics. Defaults to False.

required
Output

Log files are saved under $PROJECT_ROOT/logs/<program_name>/<program_name>_<timestamp>: - .ncu-rep: Raw Nsight Compute report file. - .nvbit: Raw NVBit log file. - .exec_ncu.log: Nsight Compute CLI command output log. - .exec_cmd.log: Command used to run the script. - .exec.kernels: (If generated by ncu) Kernel name mapping file. - .exec.csv: Computed metrics for each kernel (CSV format). - .exec_l1.png: (Optional) L1 cache metrics plots. - .exec_l2.png: (Optional) L2 cache metrics plots.

compute_kernel_metrics(kernel_id, kernel_action, unique_sector_counts)

Compute derived cache metrics for a single kernel based on NCU and NVBit data.

Calculates metrics like cache active time, read/write frequency, lifetime, and utilization for both L1 and L2 caches using raw counter values from the Nsight Compute report (kernel_action) and unique sector counts from NVBit data (unique_sector_counts).

Parameters:

Name Type Description Default
kernel_id int

The ID (index) of the kernel being processed.

required
kernel_action IAction

The Nsight Compute action object containing metrics for this kernel.

required
unique_sector_counts List[int]

A list containing [l1_unique_sectors, l2_unique_sectors] for this kernel.

required

Returns:

Type Description

Optional[pd.DataFrame]: A one-row DataFrame containing the computed metrics for the kernel, or None if the kernel execution time or relevant cache access times are zero. Columns include "Kernel ID", "Function Name", "Total Cycles", "Kernel Execution Time", "L1 Active Time", "L1 Read Frequency", "L1 Write Frequency", "L1 Lifetime", "L1 Utilization", "L2 Active Time", "L2 Read Frequency", "L2 Write Frequency", "L2 Lifetime", "L2 Utilization". Times are in microseconds, frequencies in MHz.

Source code in python-scripts/ncu_exec.py
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
def compute_kernel_metrics(kernel_id, kernel_action, unique_sector_counts):
    """
    Compute derived cache metrics for a single kernel based on NCU and NVBit data.

    Calculates metrics like cache active time, read/write frequency, lifetime, and
    utilization for both L1 and L2 caches using raw counter values from the Nsight
    Compute report (`kernel_action`) and unique sector counts from NVBit data
    (`unique_sector_counts`).

    Args:
        kernel_id (int): The ID (index) of the kernel being processed.
        kernel_action (ncu_report.IAction): The Nsight Compute action object containing metrics for this kernel.
        unique_sector_counts (List[int]): A list containing `[l1_unique_sectors, l2_unique_sectors]` for this kernel.

    Returns:
        Optional[pd.DataFrame]: A one-row DataFrame containing the computed metrics for the kernel,
                                or None if the kernel execution time or relevant cache access times are zero.
                                Columns include "Kernel ID", "Function Name", "Total Cycles",
                                "Kernel Execution Time", "L1 Active Time", "L1 Read Frequency",
                                "L1 Write Frequency", "L1 Lifetime", "L1 Utilization",
                                "L2 Active Time", "L2 Read Frequency", "L2 Write Frequency",
                                "L2 Lifetime", "L2 Utilization". Times are in microseconds,
                                frequencies in MHz.
    """
    # Get kernel execution time
    function_name = kernel_action.name()
    print(f"Processing kernel {kernel_id} ({function_name})...")
    kernel_cyc_avg = kernel_action['sm__cycles_elapsed.avg'].value()
    kernel_execution_time = kernel_cyc_avg * CYCLE_TIME
    if kernel_execution_time == 0:
        return None

    # Get number of unique L1 and L2 sectors
    l1_unique_sector_count = unique_sector_counts[0]
    l2_unique_sector_count = unique_sector_counts[1]

    # Get L1 load and store access metrics
    l1_load_access_global = kernel_action['l1tex__t_requests_pipe_lsu_mem_global_op_ld.avg'].value(
    )
    l1_load_access_local = kernel_action['l1tex__t_requests_pipe_lsu_mem_local_op_ld.avg'].value(
    )
    l1_store_access_global = kernel_action['l1tex__t_requests_pipe_lsu_mem_global_op_st.avg'].value(
    )
    l1_store_access_local = kernel_action['l1tex__t_requests_pipe_lsu_mem_local_op_st.avg'].value(
    )
    l1_store_hit_global = kernel_action['l1tex__t_sectors_pipe_lsu_mem_global_op_st_lookup_hit.avg'].value(
    )
    l1_store_hit_local = kernel_action['l1tex__t_sectors_pipe_lsu_mem_local_op_st_lookup_hit.avg'].value(
    )

    # Get other L1 cache metrics
    l1_fetch = kernel_action['l1tex__m_xbar2l1tex_read_sectors_mem_lg_op_ld.avg'].value(
    )
    l1_read_from_xbar = kernel_action['l1tex__m_xbar2l1tex_read_sectors.avg'].value(
    )
    l1_write_to_xbar = kernel_action['l1tex__m_l1tex2xbar_write_sectors.avg'].value(
    )
    l1_access_cycles = kernel_action['l1tex__cycles_active.avg'].value()
    l1_access_time = l1_access_cycles * CYCLE_TIME
    if l1_access_time == 0:
        print(f"\tWarning: Kernel {kernel_id} has no L1 cache accesses.")
        return None

    # Calculate L1 read and write frequencies
    l1_read_count = l1_load_access_global + l1_load_access_local + l1_write_to_xbar
    l1_read_freq = l1_read_count / l1_access_time
    l1_write_count = l1_store_access_global + \
        l1_store_access_local + l1_read_from_xbar
    l1_write_freq = l1_write_count / l1_access_time
    print(f"\tL1 read count: {l1_read_count}")
    print(
        f"\tL1 read frequency: {1/l1_read_freq:.2f} ns/read or {1e3 * l1_read_freq:.2f} MHz")
    print(f"\tL1 write count: {l1_write_count}")
    print(
        f"\tL1 write frequency: {1/l1_write_freq:.2f} ns/write or {1e3 * l1_write_freq:.2f} MHz")

    # Report L1 store hit rate
    l1_store_access = l1_store_access_global + l1_store_access_local
    l1_store_hit = l1_store_hit_global + l1_store_hit_local
    if l1_store_access == 0:
        print(f"\tWarning: Kernel {kernel_id} has no L1 store accesses.")
        l1_store_hit_rate = 0
    else:
        l1_store_hit_rate = l1_store_hit / l1_store_access
    print(f"\tL1 store hit rate: {l1_store_hit_rate * 100:.2f}%")

    # Calculate L1 lifetime
    l1_sector_count = l1_unique_sector_count if l1_unique_sector_count < L1_SECTOR_COUNT else L1_SECTOR_COUNT
    if l1_store_hit + l1_fetch == 0:
        print(
            f"\tWarning: Kernel {kernel_id} has no L1 store hits or fetches.")
        l1_lifetime = 0
        l1_refreshes = 0
    else:
        l1_lifetime = l1_access_time * \
            l1_sector_count / (l1_store_hit + l1_fetch)
        l1_refreshes = math.floor(l1_lifetime / RETENTION_TIME)
    print(f"\tL1 lifetime: {l1_lifetime:.2f} ns")

    # Get L2 cache metrics
    l2_write_access = kernel_action['lts__t_sectors_op_write.sum'].value()
    l2_write_hit = kernel_action['lts__t_sectors_op_write_lookup_hit.sum'].value(
    )
    l2_read_requests = kernel_action['lts__t_requests_op_read.sum'].value()
    l2_write_requests = kernel_action['lts__t_requests_op_write.sum'].value()
    l2_fetch_device = kernel_action['lts__t_sectors_aperture_device_lookup_miss.sum'].value(
    )
    l2_fetch_sysmem = kernel_action['lts__t_sectors_aperture_sysmem_lookup_miss.sum'].value(
    )
    l2_fetch_peer = kernel_action['lts__t_sectors_aperture_peer_lookup_miss.sum'].value(
    )
    l2_fetch = l2_fetch_device + l2_fetch_sysmem + l2_fetch_peer
    l2_access_cycles = kernel_action['lts__cycles_active.avg'].value()
    l2_access_time = l2_access_cycles * CYCLE_TIME
    if l2_access_time == 0:
        print(f"\tWarning: Kernel {kernel_id} has no L2 cache accesses.")
        return None

    # calculate L2 read and write frequencies
    l2_read_freq = l2_read_requests / l2_access_time
    l2_write_freq = l2_write_requests / l2_access_time
    print(f"\tL2 read count: {l2_read_requests}")
    print(
        f"\tL2 read frequency: {1/l2_read_freq:.2f} ns/read or {1e3 * l2_read_freq:.2f} MHz")
    print(f"\tL2 write count: {l2_write_requests}")
    print(
        f"\tL2 write frequency: {1/l2_write_freq:.2f} ns/write or {1e3 * l2_write_freq:.2f} MHz")
    # Report L2 write hit rate
    l2_write_hit_rate = l2_write_hit / l2_write_access
    print(f"\tL2 write hit rate: {l2_write_hit_rate * 100:.2f}%")
    # Calculate L2 lifetime
    l2_sector_count = l2_unique_sector_count if l2_unique_sector_count < L2_SECTOR_COUNT else L2_SECTOR_COUNT
    if l2_write_hit + l2_fetch == 0:
        print(
            f"\tWarning: Kernel {kernel_id} has no L2 write hits or fetches.")
        l2_lifetime = 0
        l2_refreshes = 0
    else:
        l2_lifetime = l2_access_time * \
            l2_sector_count / (l2_write_hit + l2_fetch)
        l2_refreshes = math.floor(l2_lifetime / RETENTION_TIME)
    print(f"\tL2 lifetime: {l2_lifetime:.2f} ns")
    print()

    # Add results to DataFrame
    return pd.DataFrame([{
        "Kernel ID": kernel_id,
        "Function Name": function_name,
        "Total Cycles": kernel_cyc_avg,
        "Kernel Execution Time": kernel_execution_time,
        "L1 Active Time": l1_access_time / 1e3,  # convert to microseconds
        "L1 Read Frequency": l1_read_freq * 1e3,  # convert to MHz
        "L1 Write Frequency": l1_write_freq * 1e3,  # convert to MHz
        "L1 Lifetime": l1_lifetime / 1e3,  # convert to microseconds
        # "L1 Refreshes": l1_refreshes,
        "L1 Utilization": l1_sector_count / L1_SECTOR_COUNT,
        "L2 Active Time": l2_access_time / 1e3,  # convert to microseconds
        "L2 Read Frequency": l2_read_freq * 1e3,  # convert to MHz
        "L2 Write Frequency": l2_write_freq * 1e3,  # convert to MHz
        "L2 Lifetime": l2_lifetime / 1e3,  # convert to microseconds
        # "L2 Refreshes": l2_refreshes,
        "L2 Utilization": l2_sector_count / L2_SECTOR_COUNT
    }])

get_correspondence_table(results, mangled)

Extract and print a mapping between kernel IDs and kernel names.

Creates a DataFrame containing either the mangled or unmangled kernel names based on the mangled flag, alongside their corresponding kernel IDs. Prints this table to the console.

Parameters:

Name Type Description Default
results DataFrame

DataFrame containing computed metrics and kernel names (Mangled Names, Unmangled Names columns).

required
mangled bool

If True, use 'Mangled Names' column; otherwise, use 'Unmangled Names'.

required

Returns:

Type Description

pd.DataFrame: A DataFrame with two columns: 'Kernel ID' and either 'Mangled Names' or 'Unmangled Names'.

Source code in python-scripts/ncu_exec.py
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
def get_correspondence_table(results, mangled):
    """
    Extract and print a mapping between kernel IDs and kernel names.

    Creates a DataFrame containing either the mangled or unmangled kernel names
    based on the `mangled` flag, alongside their corresponding kernel IDs. Prints
    this table to the console.

    Args:
        results (pd.DataFrame): DataFrame containing computed metrics and kernel names
                                (Mangled Names, Unmangled Names columns).
        mangled (bool): If True, use 'Mangled Names' column; otherwise, use 'Unmangled Names'.

    Returns:
        pd.DataFrame: A DataFrame with two columns: 'Kernel ID' and either
                      'Mangled Names' or 'Unmangled Names'.
    """
    if mangled:
        correspondence_table = results[['Kernel ID', 'Mangled Names']]
    else:
        correspondence_table = results[['Kernel ID', 'Unmangled Names']]
    pd.set_option('display.max_colwidth', 120)
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    print("Correspondence table:")
    print(correspondence_table)
    print()
    return correspondence_table

parse_arguments()

Parse command-line arguments for the NCU/NVBit execution script.

Defines and parses arguments for specifying the target program, its arguments, and options like dry run, mangled names, and histogram generation.

Parameters:

Name Type Description Default
None

Uses defined CLI options.

required

Returns:

Type Description

argparse.Namespace: Parsed arguments including: program (str): Program to profile. args (List[str]): Arguments to the program. dry_run (bool): Print commands without running tools. mangled (bool): Use mangled kernel names. histogram (bool): Generate metric plots.

Source code in python-scripts/ncu_exec.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
def parse_arguments():
    """Parse command-line arguments for the NCU/NVBit execution script.

    Defines and parses arguments for specifying the target program, its arguments,
    and options like dry run, mangled names, and histogram generation.

    Args:
        None: Uses defined CLI options.

    Returns:
        argparse.Namespace: Parsed arguments including:
            program (str): Program to profile.
            args (List[str]): Arguments to the program.
            dry_run (bool): Print commands without running tools.
            mangled (bool): Use mangled kernel names.
            histogram (bool): Generate metric plots.
    """
    parser = argparse.ArgumentParser(
        description="Process files generated by the Nsight Compute/NVBit backend")
    # parser.add_argument(
    #     "input_file", help="Path to the input file", action="store")
    parser.add_argument('program', type=str, help='Program to profile')
    parser.add_argument('args', nargs=argparse.REMAINDER,
                        help='Arguments to the program')
    parser.add_argument('--dry-run', action='store_true',
                        help='Print the command without running it', default=False)
    parser.add_argument("--mangled", action="store_true",
                        help="Use mangled kernel names", default=False)
    parser.add_argument("--histogram", action="store_true",
                        help="Generate histograms of the metrics", default=False)
    return parser.parse_args()

plot_l1_metrics(results, basename)

Create and save plots summarizing L1 cache metrics across kernels.

Generates a PNG image file containing bar plots for L1 Utilization, L1 Lifetime, and L1 Refreshes (calculated based on a fixed retention time).

Parameters:

Name Type Description Default
results DataFrame

DataFrame containing computed metrics for all kernels (output of compute_kernel_metrics).

required
basename str

The base path and filename prefix for the output PNG file (e.g., /path/to/logs/program_timestamp). The suffix .exec_l1.png will be appended.

required

Returns:

Name Type Description
None

Saves the plot to a file.

Source code in python-scripts/ncu_exec.py
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
def plot_l1_metrics(results, basename):
    """
    Create and save plots summarizing L1 cache metrics across kernels.

    Generates a PNG image file containing bar plots for L1 Utilization, L1 Lifetime,
    and L1 Refreshes (calculated based on a fixed retention time).

    Args:
        results (pd.DataFrame): DataFrame containing computed metrics for all kernels
                                (output of `compute_kernel_metrics`).
        basename (str): The base path and filename prefix for the output PNG file
                        (e.g., `/path/to/logs/program_timestamp`). The suffix `.exec_l1.png`
                        will be appended.

    Returns:
        None: Saves the plot to a file.
    """
    fig, ax = plt.subplots(1, 3, figsize=(20, 9))

    # sns.barplot(y="Kernel ID", x="L1 Active Time", data=results, ax=ax[0, 0], orient='h')
    # ax[0, 0].set_title("L1 Active Time (μs)")

    sns.barplot(y="Kernel ID", x="L1 Utilization",
                data=results, ax=ax[0], orient='h')
    ax[0].set_title("L1 Utilization")

    # sns.barplot(y="Kernel ID", x="L1 Read Frequency", data=results, ax=ax[1, 0], orient='h', color='b')
    # ax[1, 0].set_title("L1 Read Frequency (MHz)")
    # ax[1, 0].set_xlabel("L1 Read Frequency (MHz)", color='b')

    # sns.barplot(y="Kernel ID", x="L1 Write Frequency", data=results, ax=ax[1, 1], orient='h', color='r')
    # ax[1, 1].set_title("L1 Write Frequency (MHz)")
    # ax[1, 1].set_xlabel("L1 Write Frequency (MHz)", color='r')

    sns.barplot(y="Kernel ID", x="L1 Lifetime",
                data=results, ax=ax[1], orient='h')
    ax[1].set_title("L1 Lifetime (μs)")

    sns.barplot(y="Kernel ID", x="L1 Refreshes",
                data=results, ax=ax[2], orient='h')
    ax[2].set_title("L1 Refreshes for 77 μs retention time")

    plt.tight_layout()
    plt.savefig(basename + ".exec_l1.png")
    plt.clf()

plot_l2_metrics(results, basename)

Create and save plots summarizing L2 cache metrics across kernels.

Generates a PNG image file containing bar plots for L2 Utilization and L2 Lifetime. Adds a horizontal line indicating the assumed retention time on the Lifetime plot.

Parameters:

Name Type Description Default
results DataFrame

DataFrame containing computed metrics for all kernels (output of compute_kernel_metrics).

required
basename str

The base path and filename prefix for the output PNG file (e.g., /path/to/logs/program_timestamp). The suffix .exec_l2.png will be appended.

required

Returns:

Name Type Description
None

Saves the plot to a file.

Source code in python-scripts/ncu_exec.py
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
def plot_l2_metrics(results, basename):
    """
    Create and save plots summarizing L2 cache metrics across kernels.

    Generates a PNG image file containing bar plots for L2 Utilization and L2 Lifetime.
    Adds a horizontal line indicating the assumed retention time on the Lifetime plot.

    Args:
        results (pd.DataFrame): DataFrame containing computed metrics for all kernels
                                (output of `compute_kernel_metrics`).
        basename (str): The base path and filename prefix for the output PNG file
                        (e.g., `/path/to/logs/program_timestamp`). The suffix `.exec_l2.png`
                        will be appended.

    Returns:
        None: Saves the plot to a file.
    """
    font_size = 25
    fig, ax = plt.subplots(2, 1, figsize=(16, 8))

    # sns.barplot(y="Kernel ID", x="L2 Active Time", data=results, ax=ax[0, 0], orient='h')
    # ax[0, 0].set_title("L2 Active Time (μs)")
    sns.set(font_scale=2.2)
    sns.barplot(x="Kernel ID", y="L2 Utilization", data=results, ax=ax[0])
    ax[0].set_title("L2 Utilization")
    ax[0].tick_params(axis='x', labelsize=font_size)
    ax[0].tick_params(axis='y', labelsize=font_size)
    # change the x-axis label to be more readable
    ax[0].set_xlabel("Kernel ID", fontsize=font_size)
    ax[0].set_ylabel("Utilization", fontsize=font_size)

    # sns.barplot(y="Kernel ID", x="L2 Read Frequency", data=results, ax=ax[1, 0], orient='h')
    # ax[1, 0].set_title("L2 Read Frequency (MHz)")

    # sns.barplot(y="Kernel ID", x="L2 Write Frequency", data=results, ax=ax[1, 1], orient='h')
    # ax[1, 1].set_title("L2 Write Frequency (MHz)")

    sns.barplot(x="Kernel ID", y="L2 Lifetime", data=results, ax=ax[1])
    ax[1].set_title("L2 Lifetime (μs)")
    # Add horizontal line for 77 μs retention time with label
    ax[1].axhline(y=77, color='r', linestyle='--',
                  label='77 μs retention time')
    ax[1].legend()
    plt.xticks(fontsize=font_size)
    plt.yticks(fontsize=font_size)
    ax[1].set_xlabel("Kernel ID", fontsize=font_size)
    ax[1].set_ylabel("Lifetime (μs)", fontsize=font_size)

    plt.tight_layout()
    plt.savefig(basename + ".exec_l2.png")
    plt.clf()

read_nvbit_data(nvbit_input_file, kernel_count)

Read and parse the NVBit log file to extract unique sector counts and kernel names.

Parses the .nvbit output file generated by the custom NVBit tool. It extracts the number of unique L1 and L2 cache sectors accessed by each kernel, as well as the mangled and unmangled names for each kernel ID.

Parameters:

Name Type Description Default
nvbit_input_file str

Path to the NVBit log file (.nvbit).

required
kernel_count int

The expected number of kernels (obtained from NCU report).

required

Returns:

Type Description

Tuple[List[List[int]], List[str], List[str]]: A tuple containing: - List[List[int]]: A list where each inner list contains [l1_unique_sectors, l2_unique_sectors] for a kernel ID. - List[str]: A list of unmangled kernel names indexed by kernel ID. - List[str]: A list of mangled kernel names indexed by kernel ID.

Raises:

Type Description
FileNotFoundError

If nvbit_input_file does not exist.

AssertionError

If the kernel count reported in the NVBit log does not match kernel_count.

Source code in python-scripts/ncu_exec.py
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
def read_nvbit_data(nvbit_input_file, kernel_count):
    """
    Read and parse the NVBit log file to extract unique sector counts and kernel names.

    Parses the `.nvbit` output file generated by the custom NVBit tool. It extracts
    the number of unique L1 and L2 cache sectors accessed by each kernel, as well
    as the mangled and unmangled names for each kernel ID.

    Args:
        nvbit_input_file (str): Path to the NVBit log file (`.nvbit`).
        kernel_count (int): The expected number of kernels (obtained from NCU report).

    Returns:
        Tuple[List[List[int]], List[str], List[str]]: A tuple containing:
            - List[List[int]]: A list where each inner list contains `[l1_unique_sectors, l2_unique_sectors]` for a kernel ID.
            - List[str]: A list of unmangled kernel names indexed by kernel ID.
            - List[str]: A list of mangled kernel names indexed by kernel ID.

    Raises:
        FileNotFoundError: If `nvbit_input_file` does not exist.
        AssertionError: If the kernel count reported in the NVBit log does not match `kernel_count`.
    """
    nvbit_lines = None
    try:
        with open(nvbit_input_file, newline='') as input_file:
            nvbit_lines = input_file.readlines()
    except FileNotFoundError:
        print(f"Error: File '{nvbit_input_file}' not found.")
        exit(1)

    unmangled_names = []
    mangled_names = []
    nvbit_results = []

    # initialize the results with empty values
    for i in range(kernel_count):
        # Zero out metrics for L1 and L2 cache
        nvbit_results.append([0, 0])
        unmangled_names.append("")
        mangled_names.append("")

    for line in nvbit_lines:
        # Assert correct number of kernels
        if "MEMTRACE: Size of id_kernel_map" in line:
            # Line format: MEMTRACE: Size of id_kernel_map: 308
            kernel_count_nvbit = int(line.split(": ")[-1])
            assert kernel_count == kernel_count_nvbit, f"Kernel count mismatch: {kernel_count} != {kernel_count_nvbit}"
        # Process L1 and L2 cache metrics
        if "L1" in line:
            # Line format: MEMTRACE: Kernel ID 57 - L1 unique sectors: 453
            kernel_id = int(line.split("Kernel ID ")[1].split(" -")[0])
            nvbit_results[kernel_id][0] = int(
                line.split("L1 unique sectors: ")[1])
        elif "L2" in line:
            # Line format: MEMTRACE: Kernel ID 0 - L2 unique sectors: 542
            kernel_id = int(line.split("Kernel ID ")[1].split(" -")[0])
            nvbit_results[kernel_id][1] = int(
                line.split("L2 unique sectors: ")[1])
        # Process kernel names
        elif "Mangled name" in line:
            kernel_id = int(line.split("Kernel ID ")[1].split(" -")[0])
            mangled_name = line.split("Mangled name: ")[1].split(" -")[0]
            unmangled_name = line.split("Unmangled name: ")[1]
            mangled_names[kernel_id] = mangled_name
            unmangled_names[kernel_id] = unmangled_name
    return nvbit_results, unmangled_names, mangled_names

Parser for Accel-Sim Trace Files

The accel_sim_parser.py script is responsible for parsing the trace files generated by the Accel-Sim simulator. It extracts relevant information about data lifetime, read and write operations, and other performance metrics from the trace files.

Module for parsing Accel-Sim simulation logs and generating cache lifetime statistics.

This module provides classes and functions to process GPGPU-Sim simulation logs (specifically the cache access logs generated when running Accel-Sim) to calculate cache line lifetime metrics. It parses log lines, tracks cache line residency, computes lifetimes, and aggregates statistics per kernel and across the entire run. It outputs results in CSV format.

LifetimeType

Bases: object

Represents the lifetime of a cache line (sector) at a specific address.

Stores the start and end simulation cycles for a cache line's residency.

Attributes:

Name Type Description
address int

The memory address of the cache line.

start int

The simulation cycle when the cache line entered the cache.

end Optional[int]

The simulation cycle when the cache line was evicted or the last cycle it was accessed before the simulation ended. Initially None.

Source code in python-scripts/accel_sim_parser.py
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
class LifetimeType(object):
    """
    Represents the lifetime of a cache line (sector) at a specific address.

    Stores the start and end simulation cycles for a cache line's residency.

    Attributes:
        address (int): The memory address of the cache line.
        start (int): The simulation cycle when the cache line entered the cache.
        end (Optional[int]): The simulation cycle when the cache line was evicted
                             or the last cycle it was accessed before the simulation ended.
                             Initially None.
    """

    def __init__(self, address, start, end):
        """
        Initialize a LifetimeType object.

        Args:
            address (int): The memory address.
            start (int): The start cycle.
            end (Optional[int]): The end cycle (can be None initially).
        """
        self.address = address
        self.start = start
        self.end = end

    def __dict__(self):
        """
        Return a dictionary representation of the lifetime entry.

        Returns:
            dict: A dictionary with 'address' (hex), 'start', and 'end' keys.
        """
        return {
            # convert address to hex
            "address": hex(int(self.address)),
            "start": self.start,
            "end": self.end
        }

    def calculate_lifetime(self):
        """
        Calculate the duration of the cache line lifetime in cycles.

        Returns:
            int: The difference between the end and start cycles.

        Raises:
            TypeError: If `end` or `start` is None or not a number.
        """
        return self.end - self.start

__dict__()

Return a dictionary representation of the lifetime entry.

Returns:

Name Type Description
dict

A dictionary with 'address' (hex), 'start', and 'end' keys.

Source code in python-scripts/accel_sim_parser.py
76
77
78
79
80
81
82
83
84
85
86
87
88
def __dict__(self):
    """
    Return a dictionary representation of the lifetime entry.

    Returns:
        dict: A dictionary with 'address' (hex), 'start', and 'end' keys.
    """
    return {
        # convert address to hex
        "address": hex(int(self.address)),
        "start": self.start,
        "end": self.end
    }

__init__(address, start, end)

Initialize a LifetimeType object.

Parameters:

Name Type Description Default
address int

The memory address.

required
start int

The start cycle.

required
end Optional[int]

The end cycle (can be None initially).

required
Source code in python-scripts/accel_sim_parser.py
63
64
65
66
67
68
69
70
71
72
73
74
def __init__(self, address, start, end):
    """
    Initialize a LifetimeType object.

    Args:
        address (int): The memory address.
        start (int): The start cycle.
        end (Optional[int]): The end cycle (can be None initially).
    """
    self.address = address
    self.start = start
    self.end = end

calculate_lifetime()

Calculate the duration of the cache line lifetime in cycles.

Returns:

Name Type Description
int

The difference between the end and start cycles.

Raises:

Type Description
TypeError

If end or start is None or not a number.

Source code in python-scripts/accel_sim_parser.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def calculate_lifetime(self):
    """
    Calculate the duration of the cache line lifetime in cycles.

    Returns:
        int: The difference between the end and start cycles.

    Raises:
        TypeError: If `end` or `start` is None or not a number.
    """
    return self.end - self.start

SimulationParser

Parses Accel-Sim cache log lines for a single kernel to calculate lifetime statistics.

Processes log lines associated with one kernel launch, tracks cache line entries and exits based on load/store operations and cache status (hit/miss), considering the configured cache policies (write-allocate, write-back). It calculates the lifetime for each cache line instance and aggregates statistics.

Attributes:

Name Type Description
GPU_FREQ int

GPU frequency in MHz.

CYCLE_TIME float

GPU cycle time in nanoseconds.

log_path str

Path to the log directory.

log_file str

Base name of the log file.

kernel_name str

Name of the kernel being processed.

kernel_id int

ID of the kernel being processed.

sim_cycles int

Simulation cycles for this kernel.

sim_insn int

Instructions executed by this kernel.

ipc float

Instructions per cycle for this kernel.

total_sim_cycles int

Total simulation cycles up to this kernel.

total_sim_insn int

Total instructions executed up to this kernel.

log_lines list[str]

Log lines from .sim_cache.log for this kernel.

sector_size int

Cache line size in bytes.

l1_size int

L1 cache size in bytes.

l1_cache_lines float

Number of cache lines in L1.

l2_size int

L2 cache size in bytes.

l2_cache_lines float

Number of cache lines in L2.

l1_lifetimes list[LifetimeType]

List of completed L1 lifetime objects.

l2_lifetimes list[LifetimeType]

List of completed L2 lifetime objects.

l1_most_recent_read dict[int, int]

Maps L1 address to the cycle of its most recent read hit.

l2_most_recent_read dict[int, int]

Maps L2 address to the cycle of its most recent read hit.

l1_current_lifetime_index dict[int, int]

Maps L1 address to the index of its currently active lifetime in the internal list.

l2_current_lifetime_index dict[int, int]

Maps L2 address to the index of its currently active lifetime in the internal list.

l1_lifetime_cycles ndarray

Array of completed L1 lifetimes in cycles.

l1_lifetime_ns ndarray

Array of completed L1 lifetimes in nanoseconds.

l2_lifetime_cycles ndarray

Array of completed L2 lifetimes in cycles.

l2_lifetime_ns ndarray

Array of completed L2 lifetimes in nanoseconds.

l1_read_count int

Total L1 read operations.

l1_write_count int

Total L1 write operations.

l2_read_count int

Total L2 read operations.

l2_write_count int

Total L2 write operations.

l1_read_cycles list[int]

List of unique cycles with L1 reads.

l1_write_cycles list[int]

List of unique cycles with L1 writes.

l2_read_cycles list[int]

List of unique cycles with L2 reads.

l2_write_cycles list[int]

List of unique cycles with L2 writes.

l1_read_cycle_count int

Count of unique cycles with L1 reads.

l1_write_cycle_count int

Count of unique cycles with L1 writes.

l2_read_cycle_count int

Count of unique cycles with L2 reads.

l2_write_cycle_count int

Count of unique cycles with L2 writes.

l1_unique_addrs int

Count of unique addresses seen in L1.

l2_unique_addrs int

Count of unique addresses seen in L2.

l1_zero_count int

Count of L1 lifetimes calculated as zero or incomplete.

l2_zero_count int

Count of L2 lifetimes calculated as zero or incomplete.

l1_lifetimes_count int

Count of valid, non-zero L1 lifetimes calculated.

l2_lifetimes_count int

Count of valid, non-zero L2 lifetimes calculated.

l1_write_policy WritePolicy

L1 write policy enum.

l1_write_allocation WriteAllocation

L1 write allocation enum.

l2_write_policy WritePolicy

L2 write policy enum.

l2_write_allocation WriteAllocation

L2 write allocation enum.

Parameters:

Name Type Description Default
kernel dict

Dictionary containing kernel metadata and log lines from read_cache_log.

required
log_file_path str

Path to the log directory.

required
log_file_base str

Base name of the log files.

required
kernel_name str

Actual kernel name. Defaults to None.

None
config_file_path str

Path to the GPGPU-Sim config file. Required.

None

Raises:

Type Description
AssertionError

If config_file_path is None.

FileNotFoundError

If the config file cannot be read.

Returns:

Name Type Description
None

Initializes parser state for lifetime analysis.

Source code in python-scripts/accel_sim_parser.py
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
class SimulationParser:
    """
    Parses Accel-Sim cache log lines for a single kernel to calculate lifetime statistics.

    Processes log lines associated with one kernel launch, tracks cache line entries
    and exits based on load/store operations and cache status (hit/miss), considering
    the configured cache policies (write-allocate, write-back). It calculates the
    lifetime for each cache line instance and aggregates statistics.

    Attributes:
        GPU_FREQ (int): GPU frequency in MHz.
        CYCLE_TIME (float): GPU cycle time in nanoseconds.
        log_path (str): Path to the log directory.
        log_file (str): Base name of the log file.
        kernel_name (str): Name of the kernel being processed.
        kernel_id (int): ID of the kernel being processed.
        sim_cycles (int): Simulation cycles for this kernel.
        sim_insn (int): Instructions executed by this kernel.
        ipc (float): Instructions per cycle for this kernel.
        total_sim_cycles (int): Total simulation cycles up to this kernel.
        total_sim_insn (int): Total instructions executed up to this kernel.
        log_lines (list[str]): Log lines from `.sim_cache.log` for this kernel.
        sector_size (int): Cache line size in bytes.
        l1_size (int): L1 cache size in bytes.
        l1_cache_lines (float): Number of cache lines in L1.
        l2_size (int): L2 cache size in bytes.
        l2_cache_lines (float): Number of cache lines in L2.
        l1_lifetimes (list[LifetimeType]): List of completed L1 lifetime objects.
        l2_lifetimes (list[LifetimeType]): List of completed L2 lifetime objects.
        l1_most_recent_read (dict[int, int]): Maps L1 address to the cycle of its most recent read hit.
        l2_most_recent_read (dict[int, int]): Maps L2 address to the cycle of its most recent read hit.
        l1_current_lifetime_index (dict[int, int]): Maps L1 address to the index of its currently active lifetime in the internal list.
        l2_current_lifetime_index (dict[int, int]): Maps L2 address to the index of its currently active lifetime in the internal list.
        l1_lifetime_cycles (np.ndarray): Array of completed L1 lifetimes in cycles.
        l1_lifetime_ns (np.ndarray): Array of completed L1 lifetimes in nanoseconds.
        l2_lifetime_cycles (np.ndarray): Array of completed L2 lifetimes in cycles.
        l2_lifetime_ns (np.ndarray): Array of completed L2 lifetimes in nanoseconds.
        l1_read_count (int): Total L1 read operations.
        l1_write_count (int): Total L1 write operations.
        l2_read_count (int): Total L2 read operations.
        l2_write_count (int): Total L2 write operations.
        l1_read_cycles (list[int]): List of unique cycles with L1 reads.
        l1_write_cycles (list[int]): List of unique cycles with L1 writes.
        l2_read_cycles (list[int]): List of unique cycles with L2 reads.
        l2_write_cycles (list[int]): List of unique cycles with L2 writes.
        l1_read_cycle_count (int): Count of unique cycles with L1 reads.
        l1_write_cycle_count (int): Count of unique cycles with L1 writes.
        l2_read_cycle_count (int): Count of unique cycles with L2 reads.
        l2_write_cycle_count (int): Count of unique cycles with L2 writes.
        l1_unique_addrs (int): Count of unique addresses seen in L1.
        l2_unique_addrs (int): Count of unique addresses seen in L2.
        l1_zero_count (int): Count of L1 lifetimes calculated as zero or incomplete.
        l2_zero_count (int): Count of L2 lifetimes calculated as zero or incomplete.
        l1_lifetimes_count (int): Count of valid, non-zero L1 lifetimes calculated.
        l2_lifetimes_count (int): Count of valid, non-zero L2 lifetimes calculated.
        l1_write_policy (WritePolicy): L1 write policy enum.
        l1_write_allocation (WriteAllocation): L1 write allocation enum.
        l2_write_policy (WritePolicy): L2 write policy enum.
        l2_write_allocation (WriteAllocation): L2 write allocation enum.

    Args:
        kernel (dict): Dictionary containing kernel metadata and log lines from `read_cache_log`.
        log_file_path (str): Path to the log directory.
        log_file_base (str): Base name of the log files.
        kernel_name (str, optional): Actual kernel name. Defaults to None.
        config_file_path (str, optional): Path to the GPGPU-Sim config file. Required.

    Raises:
        AssertionError: If `config_file_path` is None.
        FileNotFoundError: If the config file cannot be read.

    Returns:
        None: Initializes parser state for lifetime analysis.
    """

    def __init__(self, kernel: dict, log_file_path: str, log_file_base: str, kernel_name: str = None, config_file_path: str = None):
        """
        Initialize the SimulationParser for a specific kernel.

        Args:
            kernel (dict): Dictionary containing kernel metadata and log lines from `read_cache_log`.
            log_file_path (str): Path to the log directory.
            log_file_base (str): Base name of the log files.
            kernel_name (str, optional): Actual kernel name (e.g., from `kernels.csv`). Defaults to None.
            config_file_path (str, optional): Path to the GPGPU-Sim config file. Required.

        Raises:
            AssertionError: If `config_file_path` is None.
            FileNotFoundError: If the config file cannot be read by `read_config_file`.
        """
        # Constants
        # Lovelace GPU has frequency of 2235
        self.GPU_FREQ = int(os.getenv('GPU_FREQ', 1593))
        # Time in ns for one cycle
        self.CYCLE_TIME = 1e9 / (self.GPU_FREQ * 1e6)

        self.log_path = log_file_path
        self.log_file = log_file_base

        # Kernel identifiers
        self.kernel_name = kernel_name
        self.kernel_id = kernel["kernel_id"]
        self.sim_cycles = kernel["gpu_sim_cycle"]
        self.sim_insn = kernel["gpu_sim_insn"]
        self.ipc = kernel["gpu_ipc"]
        self.total_sim_cycles = kernel["gpu_tot_sim_cycle"]
        self.total_sim_insn = kernel["gpu_tot_sim_insn"]

        # Log lines for this kernel
        self.log_lines = kernel["lines"]

        self.sector_size = 32
        self.l1_size = kernel.get("l1_config", 128 * 1024)
        self.l1_cache_lines = self.l1_size / self.sector_size
        self.l2_size = convert_size(os.getenv('L2_SIZE', "50MB"))
        self.l2_cache_lines = self.l2_size / self.sector_size

        # Data structures for cache lifetimes replaced by LifetimeType lists
        self.l1_lifetimes = []  # list of LifetimeType objects for L1
        self.l2_lifetimes = []  # list of LifetimeType objects for L2
        self.l1_most_recent_read = {}
        self.l2_most_recent_read = {}
        # Lookup tables remain unchanged:
        self.l1_current_lifetime_index = {}
        self.l2_current_lifetime_index = {}

        self.l1_lifetime_cycles = []
        self.l1_lifetime_ns = np.array([], dtype=np.float64)
        self.l2_lifetime_cycles = []
        self.l2_lifetime_ns = np.array([], dtype=np.float64)

        # Counters for instructions
        self.l1_read_cycles = []
        self.l1_read_cycle_count = 0
        self.l1_read_count = 0
        self.l1_write_cycles = []
        self.l1_write_cycle_count = 0
        self.l1_write_count = 0
        self.l2_read_cycles = []
        self.l2_read_cycle_count = 0
        self.l2_read_count = 0
        self.l2_write_cycles = []
        self.l2_write_cycle_count = 0
        self.l2_write_count = 0
        self.l1_unique_addrs = 0
        self.l2_unique_addrs = 0

        # Cache configurations
        assert config_file_path is not None, "Config file path must be provided."
        l1_config, l2_config = read_config_file(config_file_path)
        self.l1_write_policy = l1_config["write_policy"]
        self.l1_write_allocation = l1_config["write_allocation"]
        self.l2_write_policy = l2_config["write_policy"]
        self.l2_write_allocation = l2_config["write_allocation"]

    def process_cycle(self, line: str) -> int:
        """
        Extract the simulation cycle number from a log line.

        Args:
            line (str): A log line containing 'Cycle <num>'.

        Returns:
            int: The extracted cycle number.

        Raises:
            ValueError: If the cycle number cannot be parsed as an integer.
            IndexError: If the line format is unexpected.
        """
        cycle_str = line.split("Cycle ")[1].split()[0]
        cycle = int(cycle_str.strip(":"))
        return cycle

    def process_line(self, line: str):
        """
        Process a single cache log line to update lifetime tracking.

        Parses the line to identify L1 or L2 cache accesses, address, cycle,
        status (hit/miss), and operation type (load/store). Updates the internal
        lifetime tracking structures (`l1_lifetimes`, `l2_lifetimes`,
        `l1_current_lifetime_index`, `l2_current_lifetime_index`,
        `l1_most_recent_read`, `l2_most_recent_read`) based on the access and
        configured cache policies. Also updates read/write counters.

        Args:
            line (str): The cache log line to process.

        Returns:
            None: Modifies internal state.
        """
        # Process L1 cache lines
        # GPGPU-Sim Cycle 11097: Load instr from L1D cache at SM 0 bank 3 addr 2d61c4e0 status 2
        if "L1D cache at SM" in line:
            # Get the number after "status" to determine if it's a hit or miss
            status = int(line.split("status ")[1].split()[0])
            if status >= 3:
                return
            # Get the cycle number
            cycle = int(line.split("Cycle ")[1].split()[0].strip(":"))
            # Get the address after "addr" and convert to decimal int
            address = int(line.split("addr ")[1].split()[0], 16)

            # look up the index in the three lifetime lists
            index = self.l1_current_lifetime_index.get(address, None)

            l1_write = "Store" in line
            l1_read = "Load" in line

            # End the lifetime of the cache line
            if (status == 2 or l1_write) and index is not None:
                start = self.l1_lifetimes[index].start
                # If an active lifetime exists, end it using the latest read
                last_read = self.l1_most_recent_read.get(address, start)
                self.l1_lifetimes[index].end = last_read if last_read >= start else None

            # Decide if we need to create a new lifetime entry:
            # - For a miss (status==2):
            #   * If there is no active entry, create one. If NO_WRITE_ALLOCATE, only create one if this is not a store.
            #   * Otherwise, always create a new entry.
            # - For a store hit:
            #   * Create a new entry
            # Determine whether to create a new lifetime entry in a single expression:
            new_entry = (
                # Cache miss
                (status == 2 and
                 # cold-start miss
                 (index is not None or
                  # Start lifetime on read or write as determined by write allocation policy
                  (self.l1_write_allocation == WriteAllocation.WRITE_ALLOCATE or l1_read)))
                # Cache hit and write to existing entry
                or (l1_write and index is not None))

            # If a new lifetime entry is needed, create it.
            if new_entry:
                new_lt = LifetimeType(address, cycle, None)
                self.l1_lifetimes.append(new_lt)
                self.l1_current_lifetime_index[address] = len(
                    self.l1_lifetimes) - 1

            # Process the instruction type
            if l1_write:
                self.l1_write_count += 1
                if cycle not in self.l1_write_cycles:
                    self.l1_write_cycles.append(cycle)
            elif l1_read:
                self.l1_read_count += 1
                # store the most recent read if this is a hit
                if status == 0:
                    self.l1_most_recent_read[address] = cycle
                if cycle not in self.l1_read_cycles:
                    self.l1_read_cycles.append(cycle)
            else:
                print(f"Warning: Unknown L1D instruction type: {line}")
                return

        # Process L2 cache lines
        elif "L2 Address" in line:
            # GPGPU-Sim Cycle 12969: MEMORY_SUBPARTITION_UNIT -  2 - Store Request to L2 Address=71082d62bb60, status=0
            status = int(line.split("status=")[1].split()[0])
            if status >= 3:
                return
            cycle = int(line.split("Cycle ")[1].split()[0].strip(":"))
            address = int(line.split("Address=")[1].split(",")[0], 16)
            # Get the bank after MEMORY_SUBPARTITION_UNIT
            bank = int(line.split("MEMORY_SUBPARTITION_UNIT - ")
                       [1].split(" - ")[0].strip())
            # if bank != 0:
            #     return

            # look up the index in the three lifetime lists
            index = self.l2_current_lifetime_index.get(address, None)

            l2_write = "Store" in line
            l2_read = "Load" in line

            # End lifetime if needed
            if (status == 2 or l2_write) and index is not None:
                start = self.l2_lifetimes[index].start
                # If an active lifetime exists, end it using the latest read
                last_read = self.l2_most_recent_read.get(address, start)
                self.l2_lifetimes[index].end = last_read if last_read >= start else None
            # Decide if we need to create a new lifetime entry:
            # - For a miss (status==2):
            #   * If there is no active entry, create one. If NO_WRITE_ALLOCATE, only create one if this is not a store.
            #   * Otherwise, always create a new entry.
            # - For a store hit:
            #   * Create a new entry
            # Determine whether to create a new lifetime entry in a single expression:
            new_entry = (
                # Cache miss
                (status == 2 and
                    # cold-start miss
                    (index is not None or
                     # Start lifetime on read or write as determined by write allocation policy
                     (self.l2_write_allocation == WriteAllocation.WRITE_ALLOCATE or l2_read)))
                # Cache hit and write to existing entry
                or (l2_write and index is not None))
            # If a new lifetime entry is needed, create it.
            if new_entry:
                new_lt = LifetimeType(address, cycle, None)
                self.l2_lifetimes.append(new_lt)
                self.l2_current_lifetime_index[address] = len(
                    self.l2_lifetimes) - 1

            # Process the instruction type
            if l2_write:
                self.l2_write_count += 1
                if cycle not in self.l2_write_cycles:
                    self.l2_write_cycles.append(cycle)
            elif l2_read:
                self.l2_read_count += 1
                if status == 0:
                    self.l2_most_recent_read[address] = cycle
                if cycle not in self.l2_read_cycles:
                    self.l2_read_cycles.append(cycle)
            else:
                print(f"Warning: Unknown L2 instruction type: {line}")
                return

    def parse_log_file(self) -> list:
        """
        Parse all log lines for the kernel and finalize lifetime calculations.

        Iterates through `self.log_lines`, calling `process_line` for each.
        After processing all lines, it finalizes any remaining active lifetimes
        using the `l1_most_recent_read` and `l2_most_recent_read` dictionaries.
        Calculates lifetime durations in cycles and nanoseconds, storing them in
        `l1_lifetime_cycles`, `l1_lifetime_ns`, etc. Updates final counts for
        unique addresses and zero/incomplete lifetimes.

        Returns:
            list: A list containing the state needed to continue parsing for the
                  next kernel: `[l1_lifetimes_incomplete, l2_lifetimes_incomplete,
                  l1_most_recent_read, l2_most_recent_read]`.
                  `l1_lifetimes_incomplete` and `l2_lifetimes_incomplete` are lists
                  of `LifetimeType` objects whose `end` cycle is still None.
        """

        print(len(self.l1_lifetimes), len(self.l2_lifetimes))

        line_count = 0
        for line in self.log_lines:
            if line_count % 10000 == 0:
                print(
                    f"\tProcessing line {line_count} of {len(self.log_lines)}")
            self.process_line(line)
            line_count += 1

        # Finalize L1 lifetime entries
        valid_l1 = []
        self.l1_zero_count = 0
        for index, lt in enumerate(self.l1_lifetimes):
            if lt.end is not None:
                if lt.end > lt.start:
                    valid_l1.append(lt)
                elif lt.end == lt.start:
                    self.l1_zero_count += 1
            elif self.l1_most_recent_read.get(lt.address) is not None:
                lt.end = self.l1_most_recent_read[lt.address]
                if lt.end > lt.start:
                    valid_l1.append(lt)
                elif lt.end == lt.start:
                    self.l1_zero_count += 1

        self.l1_lifetime_cycles = np.array(
            [lt.calculate_lifetime() for lt in valid_l1])
        # Convert to nanoseconds
        self.l1_lifetime_ns = self.CYCLE_TIME * self.l1_lifetime_cycles

        # Finalize L2 lifetime entries
        valid_l2 = []
        self.l2_zero_count = 0
        for lt in self.l2_lifetimes:
            if lt.end is not None:
                if lt.end > lt.start:
                    valid_l2.append(lt)
                elif lt.end == lt.start:
                    self.l2_zero_count += 1
            elif self.l2_most_recent_read.get(lt.address) is not None:
                lt.end = self.l2_most_recent_read[lt.address]
                if lt.end > lt.start:
                    valid_l2.append(lt)
                elif lt.end == lt.start:
                    self.l2_zero_count += 1

        self.l2_lifetime_cycles = np.array(
            [lt.calculate_lifetime() for lt in valid_l2])
        # Convert to nanoseconds
        self.l2_lifetime_ns = self.CYCLE_TIME * self.l2_lifetime_cycles

        # Update read/write counters and unique addresses
        self.l1_read_cycle_count = len(self.l1_read_cycles)
        self.l1_write_cycle_count = len(self.l1_write_cycles)
        self.l2_read_cycle_count = len(self.l2_read_cycles)
        self.l2_write_cycle_count = len(self.l2_write_cycles)
        self.l1_unique_addrs = len({lt.address for lt in self.l1_lifetimes})
        self.l2_unique_addrs = len({lt.address for lt in self.l2_lifetimes})

        l1_lifetimes_export = [
            lt for lt in self.l1_lifetimes if lt.end is None]
        l2_lifetimes_export = [
            lt for lt in self.l2_lifetimes if lt.end is None]
        self.l1_zero_count += len(l1_lifetimes_export)
        self.l2_zero_count += len(l2_lifetimes_export)
        self.l1_lifetimes = valid_l1
        self.l2_lifetimes = valid_l2
        self.l1_lifetimes_count = len(self.l1_lifetimes)
        self.l2_lifetimes_count = len(self.l2_lifetimes)

        print(
            f"Kernel {self.kernel_name} (ID: {self.kernel_id}): Processed {len(self.log_lines)} lines."
        )
        print(
            f"Built L1 Lifetime DataFrame with {len(self.l1_lifetime_cycles)} entries."
        )
        print(
            f"Built L2 Lifetime DataFrame with {len(self.l2_lifetime_cycles)} entries."
        )
        print(
            f"Exporting {len(l1_lifetimes_export)} L1 lifetimes and {len(l2_lifetimes_export)} L2 lifetimes.")

        return [
            l1_lifetimes_export,
            l2_lifetimes_export,
            self.l1_most_recent_read,
            self.l2_most_recent_read
        ]

    def get_instruction_stats(self):
        """
        Calculate instruction frequencies and cache utilization statistics.

        Computes L1/L2 load/store frequencies (in MHz) based on the number of
        unique cycles with corresponding operations and the total simulation cycles
        for the kernel. Also calculates overall load/store frequencies and
        L1/L2 cache utilization based on unique addresses accessed.

        Returns:
            dict: A dictionary containing various statistics:
                - 'kernel_name', 'kernel_id'
                - 'l1_load_count', 'l1_store_count': Total L1 operations.
                - 'l1_read_cycles', 'l1_write_cycles': Unique cycles with L1 ops.
                - 'l1_load_frequency', 'l1_store_frequency': Frequencies in MHz.
                - 'l2_load_count', 'l2_store_count': Total L2 operations.
                - 'l2_read_cycles', 'l2_write_cycles': Unique cycles with L2 ops.
                - 'l2_load_frequency', 'l2_store_frequency': Frequencies in MHz.
                - 'l1_unique_addrs', 'l2_unique_addrs': Unique addresses accessed.
                - 'l1_utilization', 'l2_utilization': Cache utilization ratios.
                - 'load_count', 'store_count': Total unique cycles with loads/stores.
                - 'load_frequency', 'store_frequency': Overall frequencies in MHz.
                - 'l1_zero_count', 'l2_zero_count': Counts of zero/incomplete lifetimes.
                - 'l1_lifetimes_count', 'l2_lifetimes_count': Counts of valid lifetimes.
        """
        if self.sim_cycles > 0:
            l1_load_freq = self.GPU_FREQ * self.l1_read_cycle_count / self.sim_cycles
            l1_store_freq = self.GPU_FREQ * self.l1_write_cycle_count / self.sim_cycles
            l2_load_freq = self.GPU_FREQ * self.l2_read_cycle_count / self.sim_cycles
            l2_store_freq = self.GPU_FREQ * self.l2_write_cycle_count / self.sim_cycles
            # Get the union between the L1 and L2 read cycle sets
            load_cycles = set(self.l1_read_cycles).union(
                set(self.l2_read_cycles))
            # Get the union between the L1 and L2 write cycle sets
            store_cycles = set(self.l1_write_cycles).union(
                set(self.l2_write_cycles))
            load_freq = self.GPU_FREQ * len(load_cycles) / self.sim_cycles
            store_freq = self.GPU_FREQ * len(store_cycles) / self.sim_cycles
        else:
            l1_load_freq = 0
            l1_store_freq = 0
            l2_load_freq = 0
            l2_store_freq = 0
            load_freq = 0
            store_freq = 0

        return {
            'kernel_name': self.kernel_name,
            'kernel_id': self.kernel_id,
            # 'l1_load_count': self.l1_read_cycle_count,
            # 'l1_store_count': self.l1_write_cycle_count,
            'l1_load_count': self.l1_read_count,
            'l1_store_count': self.l1_write_count,
            'l1_read_cycles': self.l1_read_cycle_count,
            'l1_write_cycles': self.l1_write_cycle_count,
            'l1_load_frequency': l1_load_freq,
            'l1_store_frequency': l1_store_freq,
            # 'l2_load_count': self.l2_read_cycle_count,
            # 'l2_store_count': self.l2_write_cycle_count,
            'l2_load_count': self.l2_read_count,
            'l2_store_count': self.l2_write_count,
            'l2_read_cycles': self.l2_read_cycle_count,
            'l2_write_cycles': self.l2_write_cycle_count,
            'l2_load_frequency': l2_load_freq,
            'l2_store_frequency': l2_store_freq,
            "l1_unique_addrs": self.l1_unique_addrs,
            "l2_unique_addrs": self.l2_unique_addrs,
            "l1_utilization": self.l1_unique_addrs / self.l1_cache_lines,
            "l2_utilization": self.l2_unique_addrs / self.l2_cache_lines,
            'load_count': self.l1_read_cycle_count + self.l2_read_cycle_count,
            'store_count': self.l1_write_cycle_count + self.l2_write_cycle_count,
            'load_frequency': load_freq,
            'store_frequency': store_freq,
            'l1_zero_count': self.l1_zero_count,
            'l2_zero_count': self.l2_zero_count,
            'l1_lifetimes_count': self.l1_lifetimes_count,
            'l2_lifetimes_count': self.l2_lifetimes_count,
        }

    def fine_grained_df(self):
        """
        Create DataFrames containing fine-grained lifetime data for L1 and L2 caches.

        Generates two pandas DataFrames, one for L1 and one for L2, containing
        individual cache line lifetime entries. Each row represents a completed
        lifetime instance.

        Returns:
            Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing:
                - l1_df: DataFrame with columns 'kernel_id', 'address' (hex),
                         'lifetime_cycles', 'lifetime_ns'.
                - l2_df: DataFrame with columns 'kernel_id', 'address' (hex),
                         'lifetime_cycles', 'lifetime_ns'.
        """
        # Use valid lifetimes computed in parse_log_file
        min_l1_len = len(self.l1_lifetime_cycles)
        min_l2_len = len(self.l2_lifetime_cycles)
        l1_addresses = [hex(int(lt.address)) for lt in self.l1_lifetimes]
        l1_df = pd.DataFrame({
            "kernel_id": self.kernel_id,
            "address": l1_addresses[:min_l1_len],
            "lifetime_cycles": self.l1_lifetime_cycles[:min_l1_len],
            "lifetime_ns": self.l1_lifetime_ns[:min_l1_len]
        })
        l2_addresses = [hex(int(lt.address)) for lt in self.l2_lifetimes]
        l2_df = pd.DataFrame({
            "kernel_id": self.kernel_id,
            "address": l2_addresses[:min_l2_len],
            "lifetime_cycles": self.l2_lifetime_cycles[:min_l2_len],
            "lifetime_ns": self.l2_lifetime_ns[:min_l2_len]
        })
        return l1_df, l2_df

    def coarse_grained_dict(self):
        """
        Generate a dictionary containing coarse-grained, kernel-level statistics.

        Aggregates lifetime data (mean, median, 90th percentile, max) and combines
        it with instruction statistics from `get_instruction_stats()` and cache
        configuration details into a single dictionary summarizing the kernel's behavior.

        Returns:
            dict: A dictionary containing aggregated statistics for the kernel,
                  including lifetime metrics (mean, median, etc. in microseconds),
                  instruction counts and frequencies, utilization, cache configuration,
                  and lifetime counts. Keys match the column names used for the
                  coarse-grained CSV output.
        """
        # Check if lifetime data is empty
        if len(self.l1_lifetime_ns) == 0:
            l1_mean, l1_median, l1_90, l1_max = 0, 0, 0, 0
        else:
            l1_mean = np.mean(self.l1_lifetime_ns)
            l1_median = np.median(self.l1_lifetime_ns)
            l1_90 = np.percentile(self.l1_lifetime_ns, 90)
            l1_max = np.max(self.l1_lifetime_ns)
        if len(self.l2_lifetime_ns) == 0:
            l2_mean, l2_median, l2_90, l2_max = 0, 0, 0, 0
        else:
            l2_mean = np.mean(self.l2_lifetime_ns)
            l2_median = np.median(self.l2_lifetime_ns)
            l2_90 = np.percentile(self.l2_lifetime_ns, 90)
            l2_max = np.max(self.l2_lifetime_ns)
        stats = self.get_instruction_stats()
        return {
            "Kernel ID": self.kernel_id,
            "Mangled Names": self.kernel_name,
            "L1 Lifetime": l1_mean / 1e3,
            "L1 Lifetime Median": l1_median / 1e3,
            "L1 Lifetime 90%-tile": l1_90 / 1e3,
            "L1 Lifetime Max": l1_max / 1e3,
            "L2 Lifetime": l2_mean / 1e3,
            "L2 Lifetime Median": l2_median / 1e3,
            "L2 Lifetime 90%-tile": l2_90 / 1e3,
            "L2 Lifetime Max": l2_max / 1e3,
            "L1 Read Count": stats["l1_load_count"],
            "L1 Read Cycles": stats["l1_read_cycles"],
            "L1 Write Count": stats["l1_store_count"],
            "L1 Write Cycles": stats["l1_write_cycles"],
            "L2 Read Count": stats["l2_load_count"],
            "L2 Read Cycles": stats["l2_read_cycles"],
            "L2 Write Count": stats["l2_store_count"],
            "L2 Write Cycles": stats["l2_write_cycles"],
            "L1 Zero Count": stats["l1_zero_count"],
            "L2 Zero Count": stats["l2_zero_count"],
            "L1 Read Frequency": stats["l1_load_frequency"],
            "L1 Write Frequency": stats["l1_store_frequency"],
            "L2 Read Frequency": stats["l2_load_frequency"],
            "L2 Write Frequency": stats["l2_store_frequency"],
            "L1 Utilization": stats["l1_utilization"],
            "L1 Size": self.l1_size,
            "L1 Unique Addresses": stats["l1_unique_addrs"],
            "L2 Utilization": stats["l2_utilization"],
            "L2 Size": self.l2_size,
            "L2 Unique Addresses": stats["l2_unique_addrs"],
            "Total Read Frequency": stats["load_frequency"],
            "Total Write Frequency": stats["store_frequency"],
            "Total Cycles": self.sim_cycles,
            "L1 Write Policy": self.l1_write_policy.name,
            "L1 Write Allocation": self.l1_write_allocation.name,
            "L2 Write Policy": self.l2_write_policy.name,
            "L2 Write Allocation": self.l2_write_allocation.name,
            "L1 Lifetime Count": self.l1_lifetimes_count,
            "L2 Lifetime Count": self.l2_lifetimes_count,
        }

    def import_cache_states(self, cache_states):
        """
        Import cache state from the previous kernel to continue lifetime tracking.

        Takes the state returned by `parse_log_file` from the previously processed
        kernel and initializes the current parser instance with it. This allows
        lifetimes spanning across kernel boundaries to be tracked correctly.

        Args:
            cache_states (list): The list returned by `parse_log_file` containing
                                 incomplete lifetimes and most recent read dictionaries.
                                 `[l1_lifetimes_incomplete, l2_lifetimes_incomplete,
                                 l1_most_recent_read, l2_most_recent_read]`.

        Returns:
            None: Updates internal state (`l1_lifetimes`, `l2_lifetimes`,
                  `l1_most_recent_read`, `l2_most_recent_read`,
                  `l1_current_lifetime_index`, `l2_current_lifetime_index`).
        """
        # Unpack the cache states
        l1_lifetimes, l2_lifetimes, self.l1_most_recent_read, self.l2_most_recent_read = cache_states
        # Assign the lifetimes to the class variables
        self.l1_lifetimes = l1_lifetimes
        self.l2_lifetimes = l2_lifetimes
        # print length of the lifetimes
        print(
            f"Importing {len(self.l1_lifetimes)} L1 lifetimes and {len(self.l2_lifetimes)} L2 lifetimes.")
        # Rebuild the current lifetime index dictionaries
        # Find the last index for each address
        self.l1_current_lifetime_index = {}
        for i, lt in enumerate(self.l1_lifetimes):
            self.l1_current_lifetime_index[lt.address] = i
        self.l2_current_lifetime_index = {}
        for i, lt in enumerate(self.l2_lifetimes):
            self.l2_current_lifetime_index[lt.address] = i

__init__(kernel, log_file_path, log_file_base, kernel_name=None, config_file_path=None)

Initialize the SimulationParser for a specific kernel.

Parameters:

Name Type Description Default
kernel dict

Dictionary containing kernel metadata and log lines from read_cache_log.

required
log_file_path str

Path to the log directory.

required
log_file_base str

Base name of the log files.

required
kernel_name str

Actual kernel name (e.g., from kernels.csv). Defaults to None.

None
config_file_path str

Path to the GPGPU-Sim config file. Required.

None

Raises:

Type Description
AssertionError

If config_file_path is None.

FileNotFoundError

If the config file cannot be read by read_config_file.

Source code in python-scripts/accel_sim_parser.py
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
def __init__(self, kernel: dict, log_file_path: str, log_file_base: str, kernel_name: str = None, config_file_path: str = None):
    """
    Initialize the SimulationParser for a specific kernel.

    Args:
        kernel (dict): Dictionary containing kernel metadata and log lines from `read_cache_log`.
        log_file_path (str): Path to the log directory.
        log_file_base (str): Base name of the log files.
        kernel_name (str, optional): Actual kernel name (e.g., from `kernels.csv`). Defaults to None.
        config_file_path (str, optional): Path to the GPGPU-Sim config file. Required.

    Raises:
        AssertionError: If `config_file_path` is None.
        FileNotFoundError: If the config file cannot be read by `read_config_file`.
    """
    # Constants
    # Lovelace GPU has frequency of 2235
    self.GPU_FREQ = int(os.getenv('GPU_FREQ', 1593))
    # Time in ns for one cycle
    self.CYCLE_TIME = 1e9 / (self.GPU_FREQ * 1e6)

    self.log_path = log_file_path
    self.log_file = log_file_base

    # Kernel identifiers
    self.kernel_name = kernel_name
    self.kernel_id = kernel["kernel_id"]
    self.sim_cycles = kernel["gpu_sim_cycle"]
    self.sim_insn = kernel["gpu_sim_insn"]
    self.ipc = kernel["gpu_ipc"]
    self.total_sim_cycles = kernel["gpu_tot_sim_cycle"]
    self.total_sim_insn = kernel["gpu_tot_sim_insn"]

    # Log lines for this kernel
    self.log_lines = kernel["lines"]

    self.sector_size = 32
    self.l1_size = kernel.get("l1_config", 128 * 1024)
    self.l1_cache_lines = self.l1_size / self.sector_size
    self.l2_size = convert_size(os.getenv('L2_SIZE', "50MB"))
    self.l2_cache_lines = self.l2_size / self.sector_size

    # Data structures for cache lifetimes replaced by LifetimeType lists
    self.l1_lifetimes = []  # list of LifetimeType objects for L1
    self.l2_lifetimes = []  # list of LifetimeType objects for L2
    self.l1_most_recent_read = {}
    self.l2_most_recent_read = {}
    # Lookup tables remain unchanged:
    self.l1_current_lifetime_index = {}
    self.l2_current_lifetime_index = {}

    self.l1_lifetime_cycles = []
    self.l1_lifetime_ns = np.array([], dtype=np.float64)
    self.l2_lifetime_cycles = []
    self.l2_lifetime_ns = np.array([], dtype=np.float64)

    # Counters for instructions
    self.l1_read_cycles = []
    self.l1_read_cycle_count = 0
    self.l1_read_count = 0
    self.l1_write_cycles = []
    self.l1_write_cycle_count = 0
    self.l1_write_count = 0
    self.l2_read_cycles = []
    self.l2_read_cycle_count = 0
    self.l2_read_count = 0
    self.l2_write_cycles = []
    self.l2_write_cycle_count = 0
    self.l2_write_count = 0
    self.l1_unique_addrs = 0
    self.l2_unique_addrs = 0

    # Cache configurations
    assert config_file_path is not None, "Config file path must be provided."
    l1_config, l2_config = read_config_file(config_file_path)
    self.l1_write_policy = l1_config["write_policy"]
    self.l1_write_allocation = l1_config["write_allocation"]
    self.l2_write_policy = l2_config["write_policy"]
    self.l2_write_allocation = l2_config["write_allocation"]

coarse_grained_dict()

Generate a dictionary containing coarse-grained, kernel-level statistics.

Aggregates lifetime data (mean, median, 90th percentile, max) and combines it with instruction statistics from get_instruction_stats() and cache configuration details into a single dictionary summarizing the kernel's behavior.

Returns:

Name Type Description
dict

A dictionary containing aggregated statistics for the kernel, including lifetime metrics (mean, median, etc. in microseconds), instruction counts and frequencies, utilization, cache configuration, and lifetime counts. Keys match the column names used for the coarse-grained CSV output.

Source code in python-scripts/accel_sim_parser.py
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
def coarse_grained_dict(self):
    """
    Generate a dictionary containing coarse-grained, kernel-level statistics.

    Aggregates lifetime data (mean, median, 90th percentile, max) and combines
    it with instruction statistics from `get_instruction_stats()` and cache
    configuration details into a single dictionary summarizing the kernel's behavior.

    Returns:
        dict: A dictionary containing aggregated statistics for the kernel,
              including lifetime metrics (mean, median, etc. in microseconds),
              instruction counts and frequencies, utilization, cache configuration,
              and lifetime counts. Keys match the column names used for the
              coarse-grained CSV output.
    """
    # Check if lifetime data is empty
    if len(self.l1_lifetime_ns) == 0:
        l1_mean, l1_median, l1_90, l1_max = 0, 0, 0, 0
    else:
        l1_mean = np.mean(self.l1_lifetime_ns)
        l1_median = np.median(self.l1_lifetime_ns)
        l1_90 = np.percentile(self.l1_lifetime_ns, 90)
        l1_max = np.max(self.l1_lifetime_ns)
    if len(self.l2_lifetime_ns) == 0:
        l2_mean, l2_median, l2_90, l2_max = 0, 0, 0, 0
    else:
        l2_mean = np.mean(self.l2_lifetime_ns)
        l2_median = np.median(self.l2_lifetime_ns)
        l2_90 = np.percentile(self.l2_lifetime_ns, 90)
        l2_max = np.max(self.l2_lifetime_ns)
    stats = self.get_instruction_stats()
    return {
        "Kernel ID": self.kernel_id,
        "Mangled Names": self.kernel_name,
        "L1 Lifetime": l1_mean / 1e3,
        "L1 Lifetime Median": l1_median / 1e3,
        "L1 Lifetime 90%-tile": l1_90 / 1e3,
        "L1 Lifetime Max": l1_max / 1e3,
        "L2 Lifetime": l2_mean / 1e3,
        "L2 Lifetime Median": l2_median / 1e3,
        "L2 Lifetime 90%-tile": l2_90 / 1e3,
        "L2 Lifetime Max": l2_max / 1e3,
        "L1 Read Count": stats["l1_load_count"],
        "L1 Read Cycles": stats["l1_read_cycles"],
        "L1 Write Count": stats["l1_store_count"],
        "L1 Write Cycles": stats["l1_write_cycles"],
        "L2 Read Count": stats["l2_load_count"],
        "L2 Read Cycles": stats["l2_read_cycles"],
        "L2 Write Count": stats["l2_store_count"],
        "L2 Write Cycles": stats["l2_write_cycles"],
        "L1 Zero Count": stats["l1_zero_count"],
        "L2 Zero Count": stats["l2_zero_count"],
        "L1 Read Frequency": stats["l1_load_frequency"],
        "L1 Write Frequency": stats["l1_store_frequency"],
        "L2 Read Frequency": stats["l2_load_frequency"],
        "L2 Write Frequency": stats["l2_store_frequency"],
        "L1 Utilization": stats["l1_utilization"],
        "L1 Size": self.l1_size,
        "L1 Unique Addresses": stats["l1_unique_addrs"],
        "L2 Utilization": stats["l2_utilization"],
        "L2 Size": self.l2_size,
        "L2 Unique Addresses": stats["l2_unique_addrs"],
        "Total Read Frequency": stats["load_frequency"],
        "Total Write Frequency": stats["store_frequency"],
        "Total Cycles": self.sim_cycles,
        "L1 Write Policy": self.l1_write_policy.name,
        "L1 Write Allocation": self.l1_write_allocation.name,
        "L2 Write Policy": self.l2_write_policy.name,
        "L2 Write Allocation": self.l2_write_allocation.name,
        "L1 Lifetime Count": self.l1_lifetimes_count,
        "L2 Lifetime Count": self.l2_lifetimes_count,
    }

fine_grained_df()

Create DataFrames containing fine-grained lifetime data for L1 and L2 caches.

Generates two pandas DataFrames, one for L1 and one for L2, containing individual cache line lifetime entries. Each row represents a completed lifetime instance.

Returns:

Type Description

Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing: - l1_df: DataFrame with columns 'kernel_id', 'address' (hex), 'lifetime_cycles', 'lifetime_ns'. - l2_df: DataFrame with columns 'kernel_id', 'address' (hex), 'lifetime_cycles', 'lifetime_ns'.

Source code in python-scripts/accel_sim_parser.py
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
def fine_grained_df(self):
    """
    Create DataFrames containing fine-grained lifetime data for L1 and L2 caches.

    Generates two pandas DataFrames, one for L1 and one for L2, containing
    individual cache line lifetime entries. Each row represents a completed
    lifetime instance.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing:
            - l1_df: DataFrame with columns 'kernel_id', 'address' (hex),
                     'lifetime_cycles', 'lifetime_ns'.
            - l2_df: DataFrame with columns 'kernel_id', 'address' (hex),
                     'lifetime_cycles', 'lifetime_ns'.
    """
    # Use valid lifetimes computed in parse_log_file
    min_l1_len = len(self.l1_lifetime_cycles)
    min_l2_len = len(self.l2_lifetime_cycles)
    l1_addresses = [hex(int(lt.address)) for lt in self.l1_lifetimes]
    l1_df = pd.DataFrame({
        "kernel_id": self.kernel_id,
        "address": l1_addresses[:min_l1_len],
        "lifetime_cycles": self.l1_lifetime_cycles[:min_l1_len],
        "lifetime_ns": self.l1_lifetime_ns[:min_l1_len]
    })
    l2_addresses = [hex(int(lt.address)) for lt in self.l2_lifetimes]
    l2_df = pd.DataFrame({
        "kernel_id": self.kernel_id,
        "address": l2_addresses[:min_l2_len],
        "lifetime_cycles": self.l2_lifetime_cycles[:min_l2_len],
        "lifetime_ns": self.l2_lifetime_ns[:min_l2_len]
    })
    return l1_df, l2_df

get_instruction_stats()

Calculate instruction frequencies and cache utilization statistics.

Computes L1/L2 load/store frequencies (in MHz) based on the number of unique cycles with corresponding operations and the total simulation cycles for the kernel. Also calculates overall load/store frequencies and L1/L2 cache utilization based on unique addresses accessed.

Returns:

Name Type Description
dict

A dictionary containing various statistics: - 'kernel_name', 'kernel_id' - 'l1_load_count', 'l1_store_count': Total L1 operations. - 'l1_read_cycles', 'l1_write_cycles': Unique cycles with L1 ops. - 'l1_load_frequency', 'l1_store_frequency': Frequencies in MHz. - 'l2_load_count', 'l2_store_count': Total L2 operations. - 'l2_read_cycles', 'l2_write_cycles': Unique cycles with L2 ops. - 'l2_load_frequency', 'l2_store_frequency': Frequencies in MHz. - 'l1_unique_addrs', 'l2_unique_addrs': Unique addresses accessed. - 'l1_utilization', 'l2_utilization': Cache utilization ratios. - 'load_count', 'store_count': Total unique cycles with loads/stores. - 'load_frequency', 'store_frequency': Overall frequencies in MHz. - 'l1_zero_count', 'l2_zero_count': Counts of zero/incomplete lifetimes. - 'l1_lifetimes_count', 'l2_lifetimes_count': Counts of valid lifetimes.

Source code in python-scripts/accel_sim_parser.py
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
def get_instruction_stats(self):
    """
    Calculate instruction frequencies and cache utilization statistics.

    Computes L1/L2 load/store frequencies (in MHz) based on the number of
    unique cycles with corresponding operations and the total simulation cycles
    for the kernel. Also calculates overall load/store frequencies and
    L1/L2 cache utilization based on unique addresses accessed.

    Returns:
        dict: A dictionary containing various statistics:
            - 'kernel_name', 'kernel_id'
            - 'l1_load_count', 'l1_store_count': Total L1 operations.
            - 'l1_read_cycles', 'l1_write_cycles': Unique cycles with L1 ops.
            - 'l1_load_frequency', 'l1_store_frequency': Frequencies in MHz.
            - 'l2_load_count', 'l2_store_count': Total L2 operations.
            - 'l2_read_cycles', 'l2_write_cycles': Unique cycles with L2 ops.
            - 'l2_load_frequency', 'l2_store_frequency': Frequencies in MHz.
            - 'l1_unique_addrs', 'l2_unique_addrs': Unique addresses accessed.
            - 'l1_utilization', 'l2_utilization': Cache utilization ratios.
            - 'load_count', 'store_count': Total unique cycles with loads/stores.
            - 'load_frequency', 'store_frequency': Overall frequencies in MHz.
            - 'l1_zero_count', 'l2_zero_count': Counts of zero/incomplete lifetimes.
            - 'l1_lifetimes_count', 'l2_lifetimes_count': Counts of valid lifetimes.
    """
    if self.sim_cycles > 0:
        l1_load_freq = self.GPU_FREQ * self.l1_read_cycle_count / self.sim_cycles
        l1_store_freq = self.GPU_FREQ * self.l1_write_cycle_count / self.sim_cycles
        l2_load_freq = self.GPU_FREQ * self.l2_read_cycle_count / self.sim_cycles
        l2_store_freq = self.GPU_FREQ * self.l2_write_cycle_count / self.sim_cycles
        # Get the union between the L1 and L2 read cycle sets
        load_cycles = set(self.l1_read_cycles).union(
            set(self.l2_read_cycles))
        # Get the union between the L1 and L2 write cycle sets
        store_cycles = set(self.l1_write_cycles).union(
            set(self.l2_write_cycles))
        load_freq = self.GPU_FREQ * len(load_cycles) / self.sim_cycles
        store_freq = self.GPU_FREQ * len(store_cycles) / self.sim_cycles
    else:
        l1_load_freq = 0
        l1_store_freq = 0
        l2_load_freq = 0
        l2_store_freq = 0
        load_freq = 0
        store_freq = 0

    return {
        'kernel_name': self.kernel_name,
        'kernel_id': self.kernel_id,
        # 'l1_load_count': self.l1_read_cycle_count,
        # 'l1_store_count': self.l1_write_cycle_count,
        'l1_load_count': self.l1_read_count,
        'l1_store_count': self.l1_write_count,
        'l1_read_cycles': self.l1_read_cycle_count,
        'l1_write_cycles': self.l1_write_cycle_count,
        'l1_load_frequency': l1_load_freq,
        'l1_store_frequency': l1_store_freq,
        # 'l2_load_count': self.l2_read_cycle_count,
        # 'l2_store_count': self.l2_write_cycle_count,
        'l2_load_count': self.l2_read_count,
        'l2_store_count': self.l2_write_count,
        'l2_read_cycles': self.l2_read_cycle_count,
        'l2_write_cycles': self.l2_write_cycle_count,
        'l2_load_frequency': l2_load_freq,
        'l2_store_frequency': l2_store_freq,
        "l1_unique_addrs": self.l1_unique_addrs,
        "l2_unique_addrs": self.l2_unique_addrs,
        "l1_utilization": self.l1_unique_addrs / self.l1_cache_lines,
        "l2_utilization": self.l2_unique_addrs / self.l2_cache_lines,
        'load_count': self.l1_read_cycle_count + self.l2_read_cycle_count,
        'store_count': self.l1_write_cycle_count + self.l2_write_cycle_count,
        'load_frequency': load_freq,
        'store_frequency': store_freq,
        'l1_zero_count': self.l1_zero_count,
        'l2_zero_count': self.l2_zero_count,
        'l1_lifetimes_count': self.l1_lifetimes_count,
        'l2_lifetimes_count': self.l2_lifetimes_count,
    }

import_cache_states(cache_states)

Import cache state from the previous kernel to continue lifetime tracking.

Takes the state returned by parse_log_file from the previously processed kernel and initializes the current parser instance with it. This allows lifetimes spanning across kernel boundaries to be tracked correctly.

Parameters:

Name Type Description Default
cache_states list

The list returned by parse_log_file containing incomplete lifetimes and most recent read dictionaries. [l1_lifetimes_incomplete, l2_lifetimes_incomplete, l1_most_recent_read, l2_most_recent_read].

required

Returns:

Name Type Description
None

Updates internal state (l1_lifetimes, l2_lifetimes, l1_most_recent_read, l2_most_recent_read, l1_current_lifetime_index, l2_current_lifetime_index).

Source code in python-scripts/accel_sim_parser.py
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
def import_cache_states(self, cache_states):
    """
    Import cache state from the previous kernel to continue lifetime tracking.

    Takes the state returned by `parse_log_file` from the previously processed
    kernel and initializes the current parser instance with it. This allows
    lifetimes spanning across kernel boundaries to be tracked correctly.

    Args:
        cache_states (list): The list returned by `parse_log_file` containing
                             incomplete lifetimes and most recent read dictionaries.
                             `[l1_lifetimes_incomplete, l2_lifetimes_incomplete,
                             l1_most_recent_read, l2_most_recent_read]`.

    Returns:
        None: Updates internal state (`l1_lifetimes`, `l2_lifetimes`,
              `l1_most_recent_read`, `l2_most_recent_read`,
              `l1_current_lifetime_index`, `l2_current_lifetime_index`).
    """
    # Unpack the cache states
    l1_lifetimes, l2_lifetimes, self.l1_most_recent_read, self.l2_most_recent_read = cache_states
    # Assign the lifetimes to the class variables
    self.l1_lifetimes = l1_lifetimes
    self.l2_lifetimes = l2_lifetimes
    # print length of the lifetimes
    print(
        f"Importing {len(self.l1_lifetimes)} L1 lifetimes and {len(self.l2_lifetimes)} L2 lifetimes.")
    # Rebuild the current lifetime index dictionaries
    # Find the last index for each address
    self.l1_current_lifetime_index = {}
    for i, lt in enumerate(self.l1_lifetimes):
        self.l1_current_lifetime_index[lt.address] = i
    self.l2_current_lifetime_index = {}
    for i, lt in enumerate(self.l2_lifetimes):
        self.l2_current_lifetime_index[lt.address] = i

parse_log_file()

Parse all log lines for the kernel and finalize lifetime calculations.

Iterates through self.log_lines, calling process_line for each. After processing all lines, it finalizes any remaining active lifetimes using the l1_most_recent_read and l2_most_recent_read dictionaries. Calculates lifetime durations in cycles and nanoseconds, storing them in l1_lifetime_cycles, l1_lifetime_ns, etc. Updates final counts for unique addresses and zero/incomplete lifetimes.

Returns:

Name Type Description
list list

A list containing the state needed to continue parsing for the next kernel: [l1_lifetimes_incomplete, l2_lifetimes_incomplete, l1_most_recent_read, l2_most_recent_read]. l1_lifetimes_incomplete and l2_lifetimes_incomplete are lists of LifetimeType objects whose end cycle is still None.

Source code in python-scripts/accel_sim_parser.py
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
def parse_log_file(self) -> list:
    """
    Parse all log lines for the kernel and finalize lifetime calculations.

    Iterates through `self.log_lines`, calling `process_line` for each.
    After processing all lines, it finalizes any remaining active lifetimes
    using the `l1_most_recent_read` and `l2_most_recent_read` dictionaries.
    Calculates lifetime durations in cycles and nanoseconds, storing them in
    `l1_lifetime_cycles`, `l1_lifetime_ns`, etc. Updates final counts for
    unique addresses and zero/incomplete lifetimes.

    Returns:
        list: A list containing the state needed to continue parsing for the
              next kernel: `[l1_lifetimes_incomplete, l2_lifetimes_incomplete,
              l1_most_recent_read, l2_most_recent_read]`.
              `l1_lifetimes_incomplete` and `l2_lifetimes_incomplete` are lists
              of `LifetimeType` objects whose `end` cycle is still None.
    """

    print(len(self.l1_lifetimes), len(self.l2_lifetimes))

    line_count = 0
    for line in self.log_lines:
        if line_count % 10000 == 0:
            print(
                f"\tProcessing line {line_count} of {len(self.log_lines)}")
        self.process_line(line)
        line_count += 1

    # Finalize L1 lifetime entries
    valid_l1 = []
    self.l1_zero_count = 0
    for index, lt in enumerate(self.l1_lifetimes):
        if lt.end is not None:
            if lt.end > lt.start:
                valid_l1.append(lt)
            elif lt.end == lt.start:
                self.l1_zero_count += 1
        elif self.l1_most_recent_read.get(lt.address) is not None:
            lt.end = self.l1_most_recent_read[lt.address]
            if lt.end > lt.start:
                valid_l1.append(lt)
            elif lt.end == lt.start:
                self.l1_zero_count += 1

    self.l1_lifetime_cycles = np.array(
        [lt.calculate_lifetime() for lt in valid_l1])
    # Convert to nanoseconds
    self.l1_lifetime_ns = self.CYCLE_TIME * self.l1_lifetime_cycles

    # Finalize L2 lifetime entries
    valid_l2 = []
    self.l2_zero_count = 0
    for lt in self.l2_lifetimes:
        if lt.end is not None:
            if lt.end > lt.start:
                valid_l2.append(lt)
            elif lt.end == lt.start:
                self.l2_zero_count += 1
        elif self.l2_most_recent_read.get(lt.address) is not None:
            lt.end = self.l2_most_recent_read[lt.address]
            if lt.end > lt.start:
                valid_l2.append(lt)
            elif lt.end == lt.start:
                self.l2_zero_count += 1

    self.l2_lifetime_cycles = np.array(
        [lt.calculate_lifetime() for lt in valid_l2])
    # Convert to nanoseconds
    self.l2_lifetime_ns = self.CYCLE_TIME * self.l2_lifetime_cycles

    # Update read/write counters and unique addresses
    self.l1_read_cycle_count = len(self.l1_read_cycles)
    self.l1_write_cycle_count = len(self.l1_write_cycles)
    self.l2_read_cycle_count = len(self.l2_read_cycles)
    self.l2_write_cycle_count = len(self.l2_write_cycles)
    self.l1_unique_addrs = len({lt.address for lt in self.l1_lifetimes})
    self.l2_unique_addrs = len({lt.address for lt in self.l2_lifetimes})

    l1_lifetimes_export = [
        lt for lt in self.l1_lifetimes if lt.end is None]
    l2_lifetimes_export = [
        lt for lt in self.l2_lifetimes if lt.end is None]
    self.l1_zero_count += len(l1_lifetimes_export)
    self.l2_zero_count += len(l2_lifetimes_export)
    self.l1_lifetimes = valid_l1
    self.l2_lifetimes = valid_l2
    self.l1_lifetimes_count = len(self.l1_lifetimes)
    self.l2_lifetimes_count = len(self.l2_lifetimes)

    print(
        f"Kernel {self.kernel_name} (ID: {self.kernel_id}): Processed {len(self.log_lines)} lines."
    )
    print(
        f"Built L1 Lifetime DataFrame with {len(self.l1_lifetime_cycles)} entries."
    )
    print(
        f"Built L2 Lifetime DataFrame with {len(self.l2_lifetime_cycles)} entries."
    )
    print(
        f"Exporting {len(l1_lifetimes_export)} L1 lifetimes and {len(l2_lifetimes_export)} L2 lifetimes.")

    return [
        l1_lifetimes_export,
        l2_lifetimes_export,
        self.l1_most_recent_read,
        self.l2_most_recent_read
    ]

process_cycle(line)

Extract the simulation cycle number from a log line.

Parameters:

Name Type Description Default
line str

A log line containing 'Cycle '.

required

Returns:

Name Type Description
int int

The extracted cycle number.

Raises:

Type Description
ValueError

If the cycle number cannot be parsed as an integer.

IndexError

If the line format is unexpected.

Source code in python-scripts/accel_sim_parser.py
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
def process_cycle(self, line: str) -> int:
    """
    Extract the simulation cycle number from a log line.

    Args:
        line (str): A log line containing 'Cycle <num>'.

    Returns:
        int: The extracted cycle number.

    Raises:
        ValueError: If the cycle number cannot be parsed as an integer.
        IndexError: If the line format is unexpected.
    """
    cycle_str = line.split("Cycle ")[1].split()[0]
    cycle = int(cycle_str.strip(":"))
    return cycle

process_line(line)

Process a single cache log line to update lifetime tracking.

Parses the line to identify L1 or L2 cache accesses, address, cycle, status (hit/miss), and operation type (load/store). Updates the internal lifetime tracking structures (l1_lifetimes, l2_lifetimes, l1_current_lifetime_index, l2_current_lifetime_index, l1_most_recent_read, l2_most_recent_read) based on the access and configured cache policies. Also updates read/write counters.

Parameters:

Name Type Description Default
line str

The cache log line to process.

required

Returns:

Name Type Description
None

Modifies internal state.

Source code in python-scripts/accel_sim_parser.py
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
def process_line(self, line: str):
    """
    Process a single cache log line to update lifetime tracking.

    Parses the line to identify L1 or L2 cache accesses, address, cycle,
    status (hit/miss), and operation type (load/store). Updates the internal
    lifetime tracking structures (`l1_lifetimes`, `l2_lifetimes`,
    `l1_current_lifetime_index`, `l2_current_lifetime_index`,
    `l1_most_recent_read`, `l2_most_recent_read`) based on the access and
    configured cache policies. Also updates read/write counters.

    Args:
        line (str): The cache log line to process.

    Returns:
        None: Modifies internal state.
    """
    # Process L1 cache lines
    # GPGPU-Sim Cycle 11097: Load instr from L1D cache at SM 0 bank 3 addr 2d61c4e0 status 2
    if "L1D cache at SM" in line:
        # Get the number after "status" to determine if it's a hit or miss
        status = int(line.split("status ")[1].split()[0])
        if status >= 3:
            return
        # Get the cycle number
        cycle = int(line.split("Cycle ")[1].split()[0].strip(":"))
        # Get the address after "addr" and convert to decimal int
        address = int(line.split("addr ")[1].split()[0], 16)

        # look up the index in the three lifetime lists
        index = self.l1_current_lifetime_index.get(address, None)

        l1_write = "Store" in line
        l1_read = "Load" in line

        # End the lifetime of the cache line
        if (status == 2 or l1_write) and index is not None:
            start = self.l1_lifetimes[index].start
            # If an active lifetime exists, end it using the latest read
            last_read = self.l1_most_recent_read.get(address, start)
            self.l1_lifetimes[index].end = last_read if last_read >= start else None

        # Decide if we need to create a new lifetime entry:
        # - For a miss (status==2):
        #   * If there is no active entry, create one. If NO_WRITE_ALLOCATE, only create one if this is not a store.
        #   * Otherwise, always create a new entry.
        # - For a store hit:
        #   * Create a new entry
        # Determine whether to create a new lifetime entry in a single expression:
        new_entry = (
            # Cache miss
            (status == 2 and
             # cold-start miss
             (index is not None or
              # Start lifetime on read or write as determined by write allocation policy
              (self.l1_write_allocation == WriteAllocation.WRITE_ALLOCATE or l1_read)))
            # Cache hit and write to existing entry
            or (l1_write and index is not None))

        # If a new lifetime entry is needed, create it.
        if new_entry:
            new_lt = LifetimeType(address, cycle, None)
            self.l1_lifetimes.append(new_lt)
            self.l1_current_lifetime_index[address] = len(
                self.l1_lifetimes) - 1

        # Process the instruction type
        if l1_write:
            self.l1_write_count += 1
            if cycle not in self.l1_write_cycles:
                self.l1_write_cycles.append(cycle)
        elif l1_read:
            self.l1_read_count += 1
            # store the most recent read if this is a hit
            if status == 0:
                self.l1_most_recent_read[address] = cycle
            if cycle not in self.l1_read_cycles:
                self.l1_read_cycles.append(cycle)
        else:
            print(f"Warning: Unknown L1D instruction type: {line}")
            return

    # Process L2 cache lines
    elif "L2 Address" in line:
        # GPGPU-Sim Cycle 12969: MEMORY_SUBPARTITION_UNIT -  2 - Store Request to L2 Address=71082d62bb60, status=0
        status = int(line.split("status=")[1].split()[0])
        if status >= 3:
            return
        cycle = int(line.split("Cycle ")[1].split()[0].strip(":"))
        address = int(line.split("Address=")[1].split(",")[0], 16)
        # Get the bank after MEMORY_SUBPARTITION_UNIT
        bank = int(line.split("MEMORY_SUBPARTITION_UNIT - ")
                   [1].split(" - ")[0].strip())
        # if bank != 0:
        #     return

        # look up the index in the three lifetime lists
        index = self.l2_current_lifetime_index.get(address, None)

        l2_write = "Store" in line
        l2_read = "Load" in line

        # End lifetime if needed
        if (status == 2 or l2_write) and index is not None:
            start = self.l2_lifetimes[index].start
            # If an active lifetime exists, end it using the latest read
            last_read = self.l2_most_recent_read.get(address, start)
            self.l2_lifetimes[index].end = last_read if last_read >= start else None
        # Decide if we need to create a new lifetime entry:
        # - For a miss (status==2):
        #   * If there is no active entry, create one. If NO_WRITE_ALLOCATE, only create one if this is not a store.
        #   * Otherwise, always create a new entry.
        # - For a store hit:
        #   * Create a new entry
        # Determine whether to create a new lifetime entry in a single expression:
        new_entry = (
            # Cache miss
            (status == 2 and
                # cold-start miss
                (index is not None or
                 # Start lifetime on read or write as determined by write allocation policy
                 (self.l2_write_allocation == WriteAllocation.WRITE_ALLOCATE or l2_read)))
            # Cache hit and write to existing entry
            or (l2_write and index is not None))
        # If a new lifetime entry is needed, create it.
        if new_entry:
            new_lt = LifetimeType(address, cycle, None)
            self.l2_lifetimes.append(new_lt)
            self.l2_current_lifetime_index[address] = len(
                self.l2_lifetimes) - 1

        # Process the instruction type
        if l2_write:
            self.l2_write_count += 1
            if cycle not in self.l2_write_cycles:
                self.l2_write_cycles.append(cycle)
        elif l2_read:
            self.l2_read_count += 1
            if status == 0:
                self.l2_most_recent_read[address] = cycle
            if cycle not in self.l2_read_cycles:
                self.l2_read_cycles.append(cycle)
        else:
            print(f"Warning: Unknown L2 instruction type: {line}")
            return

WriteAllocation

Bases: Enum

Cache write allocation policies.

Source code in python-scripts/accel_sim_parser.py
37
38
39
40
class WriteAllocation(enum.Enum):
    """Cache write allocation policies."""
    WRITE_ALLOCATE = 1
    NO_WRITE_ALLOCATE = 2

WritePolicy

Bases: Enum

Cache write policies.

Source code in python-scripts/accel_sim_parser.py
43
44
45
46
class WritePolicy(enum.Enum):
    """Cache write policies."""
    WRITE_BACK = 1
    WRITE_THROUGH = 2

convert_size(size_str)

Convert a size string (e.g., "128KB", "50MB") to bytes.

Parses strings with suffixes KB, MB, GB (case-insensitive) and returns the equivalent size in bytes.

Parameters:

Name Type Description Default
size_str str

The size string to convert (e.g., "256KB", "1GB").

required

Returns:

Name Type Description
int int

The size in bytes.

Raises:

Type Description
ValueError

If the string format or suffix is invalid.

Source code in python-scripts/accel_sim_parser.py
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
def convert_size(size_str: str) -> int:
    """
    Convert a size string (e.g., "128KB", "50MB") to bytes.

    Parses strings with suffixes KB, MB, GB (case-insensitive) and returns
    the equivalent size in bytes.

    Args:
        size_str (str): The size string to convert (e.g., "256KB", "1GB").

    Returns:
        int: The size in bytes.

    Raises:
        ValueError: If the string format or suffix is invalid.
    """
    size_str = size_str.upper()  # Ensure case-insensitivity
    if "GB" in size_str:
        return int(size_str.split("GB")[0]) * 1024 * 1024 * 1024
    elif "MB" in size_str:
        return int(size_str.split("MB")[0]) * 1024 * 1024
    elif "KB" in size_str:
        return int(size_str.split("KB")[0]) * 1024
    elif "B" in size_str:
        return int(size_str.split("B")[0])
    else:
        raise ValueError(f"Invalid size string: {size_str}")

find_and_append(kernels, current_index, key, line, dtype=int)

Find a key-value pair in a log line and append it to the current kernel's data.

Searches for ' = ' in the line. If found, converts the value to the specified dtype and adds it to the dictionary at kernels[current_index] with the given key. Handles potential errors during conversion.

Parameters:

Name Type Description Default
kernels list

A list of dictionaries, where each dictionary holds data for a kernel.

required
current_index int

The index in the kernels list corresponding to the current kernel.

required
key str

The key string to search for in the line (e.g., "gpu_sim_cycle").

required
line str

The log line to parse.

required
dtype type

The data type to convert the found value to (e.g., int, float). Defaults to int.

int

Returns:

Name Type Description
None

Modifies the kernels list in place.

Source code in python-scripts/accel_sim_parser.py
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def find_and_append(kernels: list, current_index: int, key: str, line: str, dtype=int):
    """
    Find a key-value pair in a log line and append it to the current kernel's data.

    Searches for '<key> = <value>' in the `line`. If found, converts the value
    to the specified `dtype` and adds it to the dictionary at `kernels[current_index]`
    with the given `key`. Handles potential errors during conversion.

    Args:
        kernels (list): A list of dictionaries, where each dictionary holds data for a kernel.
        current_index (int): The index in the `kernels` list corresponding to the current kernel.
        key (str): The key string to search for in the line (e.g., "gpu_sim_cycle").
        line (str): The log line to parse.
        dtype (type): The data type to convert the found value to (e.g., int, float). Defaults to int.

    Returns:
        None: Modifies the `kernels` list in place.
    """
    if key in line:
        try:
            num = dtype(line.split("=")[1].strip())
            kernels[current_index][key] = num
        except Exception as e:
            print(f"Error parsing {key} in line: {line}")
            print(e)
            kernels[current_index][key] = None
        finally:
            return

get_cache_policy(line)

Extract cache write policy and allocation policy from a GPGPU-Sim config line.

Parses a GPGPU-Sim configuration line (e.g., starting with '-gpgpu_cache:dl1') to determine the cache's write policy (Write-Back/Write-Through) and write allocation policy (Write-Allocate/No-Write-Allocate).

Parameters:

Name Type Description Default
line str

A configuration line string (e.g., "-gpgpu_cache:dl1 S:4:128:256,L:T:m:L:L,A:384:48,16:0,32").

required

Returns:

Name Type Description
dict dict

A dictionary containing: - 'write_policy' (WritePolicy): The parsed write policy enum. - 'write_allocation' (WriteAllocation): The parsed write allocation enum.

Source code in python-scripts/accel_sim_parser.py
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
def get_cache_policy(line: str) -> dict:
    """
    Extract cache write policy and allocation policy from a GPGPU-Sim config line.

    Parses a GPGPU-Sim configuration line (e.g., starting with '-gpgpu_cache:dl1')
    to determine the cache's write policy (Write-Back/Write-Through) and
    write allocation policy (Write-Allocate/No-Write-Allocate).

    Args:
        line (str): A configuration line string (e.g., "-gpgpu_cache:dl1 S:4:128:256,L:T:m:L:L,A:384:48,16:0,32").

    Returns:
        dict: A dictionary containing:
            - 'write_policy' (WritePolicy): The parsed write policy enum.
            - 'write_allocation' (WriteAllocation): The parsed write allocation enum.
    """
    line = line.split(" ")[1]
    # split by colon
    parts = line.split(":")
    # get the parts corresponding to wr and wr_alloc
    wr = parts[4]
    wr_alloc = parts[6]
    print(wr, wr_alloc)
    if wr == "T":
        write_policy = WritePolicy.WRITE_THROUGH
    else:
        write_policy = WritePolicy.WRITE_BACK
    if wr_alloc == "N":
        write_alloc = WriteAllocation.NO_WRITE_ALLOCATE
    else:
        write_alloc = WriteAllocation.WRITE_ALLOCATE
    return {
        "write_policy": write_policy,
        "write_allocation": write_alloc
    }

read_cache_log(log_file_path, basename=None)

Read Accel-Sim cache and simulation logs, grouping lines by kernel.

Parses the .sim_cache.log and .sim.log files. It identifies kernel launches and groups subsequent log lines belonging to each kernel. It also extracts kernel metadata like name, ID, simulation cycles, instructions, IPC, and detected cache configuration changes from the log lines.

Parameters:

Name Type Description Default
log_file_path str

The directory containing the log files.

required
basename str

The base name of the log files (e.g., "program_timestamp").

None

Returns:

Type Description

list[dict]: A list of dictionaries. Each dictionary represents a kernel and contains: - 'kernel_name' (str): The name of the kernel. - 'kernel_id' (int): The ID of the kernel. - 'lines' (list[str]): A list of log lines associated with this kernel from .sim_cache.log. - 'gpu_sim_cycle' (Optional[int]): Simulation cycles for this kernel. - 'gpu_tot_sim_cycle' (Optional[int]): Total simulation cycles up to this kernel. - 'gpu_sim_insn' (Optional[int]): Instructions executed by this kernel. - 'gpu_tot_sim_insn' (Optional[int]): Total instructions executed up to this kernel. - 'gpu_ipc' (Optional[float]): Instructions per cycle for this kernel. - 'l1_config' (Optional[int]): Detected L1 cache size in bytes (if reconfigured).

Source code in python-scripts/accel_sim_parser.py
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
def read_cache_log(log_file_path: str, basename: str = None):
    """
    Read Accel-Sim cache and simulation logs, grouping lines by kernel.

    Parses the `.sim_cache.log` and `.sim.log` files. It identifies kernel
    launches and groups subsequent log lines belonging to each kernel. It also
    extracts kernel metadata like name, ID, simulation cycles, instructions, IPC,
    and detected cache configuration changes from the log lines.

    Args:
        log_file_path (str): The directory containing the log files.
        basename (str): The base name of the log files (e.g., "program_timestamp").

    Returns:
        list[dict]: A list of dictionaries. Each dictionary represents a kernel and contains:
            - 'kernel_name' (str): The name of the kernel.
            - 'kernel_id' (int): The ID of the kernel.
            - 'lines' (list[str]): A list of log lines associated with this kernel from `.sim_cache.log`.
            - 'gpu_sim_cycle' (Optional[int]): Simulation cycles for this kernel.
            - 'gpu_tot_sim_cycle' (Optional[int]): Total simulation cycles up to this kernel.
            - 'gpu_sim_insn' (Optional[int]): Instructions executed by this kernel.
            - 'gpu_tot_sim_insn' (Optional[int]): Total instructions executed up to this kernel.
            - 'gpu_ipc' (Optional[float]): Instructions per cycle for this kernel.
            - 'l1_config' (Optional[int]): Detected L1 cache size in bytes (if reconfigured).
    """

    # Read cache log file
    log_file = os.path.join(log_file_path, basename + ".sim_cache.log")
    print(f"Reading log file: {log_file}")
    with open(log_file, 'r') as f:
        lines = f.readlines()
    kernels = []
    current_index = -1
    current_kernel_name = ""
    current_kernel_id = 0
    for line in lines:
        if "-kernel name = " in line:
            current_kernel_name = line.split("-kernel name = ")[1].strip()
        if "-kernel id = " in line:
            current_kernel_id = int(line.split("-kernel id = ")[1].strip())
        if "launching kernel name" in line:
            current_index += 1
            print(
                f"Found kernel: {current_kernel_name} (ID: {current_kernel_id})")
            kernels.append({
                "kernel_name": current_kernel_name,
                "kernel_id": current_kernel_id,
                "lines": []
            })
        else:
            if current_index >= 0:
                kernels[current_index]["lines"].append(line)

    # Read report log file
    report_file = os.path.join(log_file_path, basename + ".sim.log")
    print(f"Reading log file: {report_file}")
    with open(report_file, 'r') as f:
        lines = f.readlines()

    current_index = -1
    for line in lines:
        if "launching kernel name" in line:
            m = re.search(r"launching kernel name: (\S+) uid: (\d+)", line)
            if m:
                current_index += 1

        find_and_append(kernels, current_index, "gpu_sim_cycle", line, int)
        find_and_append(kernels, current_index, "gpu_tot_sim_cycle", line, int)
        find_and_append(kernels, current_index, "gpu_sim_insn", line, int)
        find_and_append(kernels, current_index, "gpu_tot_sim_insn", line, int)
        find_and_append(kernels, current_index, "gpu_ipc", line, float)

        if "GPGPU-Sim: Reconfigure L1 cache to" in line:
            m = re.search(r"Reconfigure L1 cache to (\S+)", line)
            if m:
                kernels[current_index]["l1_config"] = convert_size(m.group(1))

    return kernels

read_config_file(config_file_path=None)

Read L1 and L2 cache policies from a GPGPU-Sim configuration file.

Parses the specified configuration file (or the default configs/gpgpusim.config) to find lines defining the L1 data cache (-gpgpu_cache:dl1) and L2 data cache (-gpgpu_cache:dl2) and extracts their write and allocation policies using get_cache_policy.

Parameters:

Name Type Description Default
config_file_path str

Path to the GPGPU-Sim config file. Defaults to configs/gpgpusim.config relative to this script's directory.

None

Returns:

Type Description

Tuple[dict, dict]: A tuple containing two dictionaries: - l1_config: Dictionary with 'write_policy' and 'write_allocation' for L1. - l2_config: Dictionary with 'write_policy' and 'write_allocation' for L2. Returns empty dictionaries if config lines are not found.

Raises:

Type Description
FileNotFoundError

If the specified or default config file does not exist.

Source code in python-scripts/accel_sim_parser.py
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
def read_config_file(config_file_path: str = None):
    """
    Read L1 and L2 cache policies from a GPGPU-Sim configuration file.

    Parses the specified configuration file (or the default `configs/gpgpusim.config`)
    to find lines defining the L1 data cache (`-gpgpu_cache:dl1`) and L2 data cache
    (`-gpgpu_cache:dl2`) and extracts their write and allocation policies using
    `get_cache_policy`.

    Args:
        config_file_path (str, optional): Path to the GPGPU-Sim config file.
                                          Defaults to `configs/gpgpusim.config` relative
                                          to this script's directory.

    Returns:
        Tuple[dict, dict]: A tuple containing two dictionaries:
            - l1_config: Dictionary with 'write_policy' and 'write_allocation' for L1.
            - l2_config: Dictionary with 'write_policy' and 'write_allocation' for L2.
                         Returns empty dictionaries if config lines are not found.

    Raises:
        FileNotFoundError: If the specified or default config file does not exist.
    """
    if config_file_path is None:
        config_file_path = os.path.join(
            os.path.dirname(__file__), "configs", "gpgpusim.config")
    print(f"Reading config file: {config_file_path}")
    with open(config_file_path, 'r') as f:
        lines = f.readlines()
    config = {}
    for line in lines:
        if line.startswith("-gpgpu_cache:dl1 "):
            l1_config = get_cache_policy(line)
            print(f"L1 config: {l1_config}")
        if line.startswith("-gpgpu_cache:dl2 "):
            l2_config = get_cache_policy(line)
            print(f"L2 config: {l2_config}")
    return l1_config, l2_config

read_function_names(log_file_path, basename)

Read kernel ID to kernel name mapping from the kernels.csv file.

Parses the kernels.csv file (typically generated by PKS or tracing) to create a dictionary mapping kernel IDs (as integers) to their corresponding names (as strings).

Parameters:

Name Type Description Default
log_file_path str

The directory containing the traces subdirectory.

required
basename str

The base name of the log files (unused in this function, but kept for consistency).

required

Returns:

Type Description

dict[int, str]: A dictionary mapping kernel IDs to kernel names. Returns an empty dictionary if kernels.csv is not found.

Source code in python-scripts/accel_sim_parser.py
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
def read_function_names(log_file_path: str, basename: str):
    """
    Read kernel ID to kernel name mapping from the `kernels.csv` file.

    Parses the `kernels.csv` file (typically generated by PKS or tracing)
    to create a dictionary mapping kernel IDs (as integers) to their
    corresponding names (as strings).

    Args:
        log_file_path (str): The directory containing the `traces` subdirectory.
        basename (str): The base name of the log files (unused in this function,
                        but kept for consistency).

    Returns:
        dict[int, str]: A dictionary mapping kernel IDs to kernel names. Returns
                        an empty dictionary if `kernels.csv` is not found.
    """
    kernels_file = os.path.join(log_file_path, "traces", "kernels.csv")
    if not os.path.exists(kernels_file):
        print(f"File {kernels_file} does not exist.")
        return {}
    print(f"Reading kernel names from {kernels_file}")
    # Read CSV file as pandas DataFrame
    df = pd.read_csv(kernels_file)
    # Construct a dictionary with "Kernel ID" as key and "Kernel Name" as value
    kernel_names = {}
    for index, row in df.iterrows():
        kernel_id = row["Kernel ID"]
        kernel_name = row["Kernel Name"]
        kernel_names[kernel_id] = kernel_name
    return kernel_names

run_parser(log_file_path, log_file_base, config_file_path)

Main function to run the simulation log parsing process for all kernels.

Reads the grouped kernel data using read_cache_log, iterates through each kernel, creates a SimulationParser instance, imports state from the previous kernel (if any), parses the current kernel's logs, and collects both fine-grained (per-lifetime) and coarse-grained (per-kernel) statistics. Finally, saves the aggregated statistics to CSV files using Dask DataFrames.

Parameters:

Name Type Description Default
log_file_path str

Path to the directory containing the log files.

required
log_file_base str

Base name of the log files (e.g., "program_timestamp").

required
config_file_path str

Path to the GPGPU-Sim configuration file used for simulation.

required

Returns:

Name Type Description
None

Generates CSV output files: - <log_file_base>.sim.csv (coarse-grained kernel stats) - <log_file_base>.sim_l1.csv (fine-grained L1 lifetimes) - <log_file_base>.sim_l2.csv (fine-grained L2 lifetimes)

Raises:

Type Description
SystemExit

If no kernels are found in the log file.

Source code in python-scripts/accel_sim_parser.py
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
def run_parser(log_file_path: str, log_file_base: str, config_file_path: str):
    """
    Main function to run the simulation log parsing process for all kernels.

    Reads the grouped kernel data using `read_cache_log`, iterates through each
    kernel, creates a `SimulationParser` instance, imports state from the previous
    kernel (if any), parses the current kernel's logs, and collects both
    fine-grained (per-lifetime) and coarse-grained (per-kernel) statistics.
    Finally, saves the aggregated statistics to CSV files using Dask DataFrames.

    Args:
        log_file_path (str): Path to the directory containing the log files.
        log_file_base (str): Base name of the log files (e.g., "program_timestamp").
        config_file_path (str): Path to the GPGPU-Sim configuration file used for simulation.

    Returns:
        None: Generates CSV output files:
              - `<log_file_base>.sim.csv` (coarse-grained kernel stats)
              - `<log_file_base>.sim_l1.csv` (fine-grained L1 lifetimes)
              - `<log_file_base>.sim_l2.csv` (fine-grained L2 lifetimes)

    Raises:
        SystemExit: If no kernels are found in the log file.
    """
    groups = read_cache_log(log_file_path, log_file_base)
    kernel_names = read_function_names(log_file_path, log_file_base)
    overall_load = 0
    overall_store = 0
    kernels = []

    print(f"Found {len(groups)} kernels in log file.")
    if len(groups) == 0:
        print("No kernels found in log file. Exiting.")
        sys.exit(1)

    coarse_grain_stats = []
    l1_fine_grain_df = pd.DataFrame()
    l2_fine_grain_df = pd.DataFrame()

    cache_state = None
    print(config_file_path)

    for kernel in groups:
        kernel_id = kernel["kernel_id"]
        if kernel_id == 991:
            print("Skipping kernel ID 991")
            continue
        kernel_name = kernel_names.get(kernel_id, kernel["kernel_name"])
        print(
            f"Processing kernel {kernel_name} (ID: {kernel_id}) with {len(kernel['lines'])} lines.")
        try:
            parser_instance = SimulationParser(
                kernel, log_file_path, log_file_base, kernel_name, config_file_path)
        except Exception as e:
            print(f"Error parsing kernel {kernel_name} (ID: {kernel_id}): {e}")
            continue
        if cache_state is not None:
            parser_instance.import_cache_states(cache_state)
        cache_state = parser_instance.parse_log_file()
        l1_df, l2_df = parser_instance.fine_grained_df()
        # Concatenate the fine-grained dataframes
        l1_fine_grain_df = pd.concat([l1_fine_grain_df, l1_df])
        l2_fine_grain_df = pd.concat([l2_fine_grain_df, l2_df])
        coarse_grain_stats.append(parser_instance.coarse_grained_dict())

    coarse_grain_df = pd.DataFrame(coarse_grain_stats)
    coarse_csv = os.path.join(log_file_path, f"{log_file_base}.sim.csv")

    # coarse_grain_df.to_csv(coarse_csv, index=False)
    # before turning into a Dask dataframe, cast any StringDtype columns to object
    # coarse_dd = dd.from_pandas(coarse_grain_df, npartitions=32)
    # coarse_dd.to_csv(coarse_csv, index=False, single_file=True)
    coarse_grain_df.to_csv(coarse_csv, index=False)
    print(f"Coarse-grained CSV file for all kernels saved to: {log_file_path}")
    l1_csv = os.path.join(log_file_path, f"{log_file_base}.sim_l1.csv")
    l2_csv = os.path.join(log_file_path, f"{log_file_base}.sim_l2.csv")

    # l1_dd = dd.from_pandas(l1_fine_grain_df, npartitions=32)
    # l2_dd = dd.from_pandas(l2_fine_grain_df, npartitions=32)
    # l1_dd.to_csv(l1_csv, index=False, single_file=True)
    # l2_dd.to_csv(l2_csv, index=False, single_file=True)
    l1_fine_grain_df.to_csv(l1_csv, index=False)
    l2_fine_grain_df.to_csv(l2_csv, index=False)
    print(f"Fine-grained CSV files for all kernels saved to: {log_file_path}")
    print("All kernels processed successfully.")