GainSight Accel-Sim Backend Python API Documentation

This page documents the Python scripts in gainsight/backend/python-scripts/ supporting the Accel-Sim backend using mkdocstrings. Please refer to the backend wiki for a summary of implementation details and usage instructions.

Accel-Sim Runner Script

The accel_sim.py script is the main entry point for running the Accel-Sim simulator. It handles command-line arguments, initializes the simulation environment, and manages the execution of the simulation.

Simulation-based, cache line-level analysis of GPU programs using Accel-Sim.

This script runs the Accel-Sim simulator on a given program with the specified arguments, by first generating SASS traces using the NVBit tracer and then running the simulator on the generated traces. The script also provides an option to run the program with kernel sampling using principal kernel selection (PKS) to reduce the number of kernels in the traces.

TODO: Output redirection to log files, error handling, and more detailed documentation.

Usage

python3 accel_sim.py python3 accel_sim.py [--sample][--arch ] [--delete][--verbose]

Pre-requisites

The program must exist and be executable.
The program must be run within its desired working directory.
The Accel-Sim tools must be compiled and available in the environment.
The PROJECT_ROOT and CUDA_INSTALL_PATH environment variables must be set to the project root directory and the CUDA installation path, respectively.

Parameters:

Name	Type	Description	Default
`program`	`str`	The program to profile.	required
`args`	`List[str]`	The arguments to pass to the program.	required
`--sample`	`bool`	Run the program with kernel sampling using PKS.	required
`--arch`	`str`	The architecture to simulate (default: "SM90_H100").	required
`--delete`	`bool`	Delete the traces directory	required
`--verbose`	`bool`	Store verbose output from Accel-Sim.	required

Output

The log files are saved under $PROJECT_ROOT/logs/ and contains the following files: - traces: The directory containing the generated SASS traces. - traces/kernelslist.g: The file containing the list of kernels in the traces; if kernel sampling is used, this file will be updated with the selected kernels. - .sim_cache.log: The cache log file containing detailed memory access information. - .sim.log: The main log file containing the output from the simulator with key simulation runtime information. - .sim.csv: The CSV file containing the kernel-level simulation statistics in tabular format. - .sim_l1.csv: The L1 cache lifetime statistics in tabular format. - .sim_l2.csv: The L2 cache lifetime statistics in tabular format. - .frontend.json: JSON output from the frontend post-processing script. - .postprocess.log: Log output from the frontend post-processing script. - .sim_cmd.log: The command used to run the simulation.

`AccelSimRunner`

Manages the Accel-Sim workflow: tracing, sampling, simulation, and post-processing.

This class encapsulates the steps required to profile a GPU program using Accel-Sim: 1. Optionally run Nsight Compute and Principal Kernel Selection (PKS) for sampling. 2. Run the NVBit tracer to generate SASS instruction traces. 3. Run the Accel-Sim GPGPU simulator on the generated (or sampled) traces. 4. Run post-processing scripts to analyze simulation output and generate reports.

Attributes:

Name	Type	Description
`program`	`str`	Absolute path to the executable program or Python interpreter.
`program_args`	`List[str]`	List of arguments to pass to the program.
`arch`	`str`	Target GPU architecture for simulation (e.g., "SM90_H100").
`verbose`	`bool`	Flag to enable verbose logging during simulation.
`delete`	`bool`	Flag to delete existing trace directory before tracing.
`pwd`	`str`	The original working directory from where the script was invoked.
`original_cwd`	`str`	The original current working directory (same as pwd).
`log_file_name`	`str`	Base name for log files (program_timestamp).
`log_file_path`	`str`	Path to the directory where logs are stored.
`kernelslist_g`	`str`	Path to the kernelslist.g file within the trace directory.
`ncu_file`	`str`	Path to an existing NCU report file to use for sampling.

Source code in python-scripts/accel_sim.py

class AccelSimRunner:
    """Manages the Accel-Sim workflow: tracing, sampling, simulation, and post-processing.

    This class encapsulates the steps required to profile a GPU program using Accel-Sim:
    1. Optionally run Nsight Compute and Principal Kernel Selection (PKS) for sampling.
    2. Run the NVBit tracer to generate SASS instruction traces.
    3. Run the Accel-Sim GPGPU simulator on the generated (or sampled) traces.
    4. Run post-processing scripts to analyze simulation output and generate reports.

    Attributes:
        program (str): Absolute path to the executable program or Python interpreter.
        program_args (List[str]): List of arguments to pass to the program.
        arch (str): Target GPU architecture for simulation (e.g., "SM90_H100").
        verbose (bool): Flag to enable verbose logging during simulation.
        delete (bool): Flag to delete existing trace directory before tracing.
        pwd (str): The original working directory from where the script was invoked.
        original_cwd (str): The original current working directory (same as pwd).
        log_file_name (str): Base name for log files (program_timestamp).
        log_file_path (str): Path to the directory where logs are stored.
        kernelslist_g (str): Path to the kernelslist.g file within the trace directory.
        ncu_file (str): Path to an existing NCU report file to use for sampling.
    """
    def __init__(self, program, args, arch="SM90_H100", verbose=False, delete=False, rename=None, ncu_file=None):
        """Initialize the AccelSimRunner.

        Sets up paths, log naming, and execution environment based on input parameters.

        Args:
            program (str): Path or name of the program to profile.
            args (List[str]): Arguments to pass to the program.
            arch (str): GPU architecture (default: "SM90_H100").
            verbose (bool): Enable verbose simulator logging.
            delete (bool): Remove existing traces directory.
            rename (str): Custom base name for log files.
            ncu_file (str): Existing Nsight Compute report for sampling.

        Returns:
            None: Initializes runner state and prepares directories.
        """
        # Convert program to absolute path if it's a relative path
        self.program = os.path.abspath(program) if not program in [
            "python", "python3", "torchrun"] else program
        self.program_args = args
        self.arch = arch
        self.verbose = verbose
        self.delete = delete
        self.pwd = os.getenv('PWD', os.getcwd())

        # Store original working directory
        self.original_cwd = os.getcwd()

        output_program_name = os.path.basename(program)
        # If the program is python3, we need to use sys.executable
        program_basename = os.path.basename(program)
        if program_basename == "python" or program_basename.startswith("python3") or program_basename == "torchrun":
            output_program_name = ".".join(
                os.path.basename(args[0]).split('.')[:-1])
            self.program = sys.executable
            self.program_args[0] = os.path.abspath(args[0])
        # If the program is a .py file, we need to use sys.executable
        if program.endswith(".py"):
            output_program_name = ".".join(
                os.path.basename(program).split('.')[:-1])
            self.program_args = [os.path.abspath(program)] + args
            self.program = sys.executable
        if rename:
            output_program_name = rename

        # Generate a timestamp for the log file name
        timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        self.log_file_name = f"{output_program_name}_{timestamp}"
        self.log_file_path = os.path.join(
            os.getenv('PROJECT_ROOT', '.'), 'logs', output_program_name)
        os.makedirs(self.log_file_path, exist_ok=True)

        self.kernelslist_g = os.path.join(
            self.log_file_path, 'traces', 'kernelslist.g')

        # Dump program and arguments to a file
        with open(os.path.join(self.log_file_path, f"{self.log_file_name}.sim_cmd.log"), 'w') as f:
            f.write(f"{self.program} {' '.join(self.program_args)}")

        self.ncu_file = ncu_file

    def run_tracer(self, sample=False):
        """Run the NVBit tracer to generate SASS instruction traces for the program.

        Sets up the environment for the NVBit tracer tool, executes the target program
        under the tracer, and then runs the post-processing script to convert raw
        traces into the format required by Accel-Sim (`.traceg` files). Optionally
        deletes the intermediate `.trace` files.

        Args:
            sample (bool): Enable kernel sampling via environment variable. Defaults to False.

        Returns:
            str: Path to the directory containing processed traces.

        Raises:
            FileNotFoundError: If the program executable is not found.
        """
        trace_path = os.path.join(self.log_file_path, 'traces')
        project_root = os.getenv('PROJECT_ROOT', '/gainsight')
        tracer_tool_path = os.getenv('TRACER_TOOLS', os.path.join(
            project_root, 'backend', 'accel-sim', 'util', 'tracer_nvbit', 'tracer_tool'))

        # Print kernelslist_g if it exists
        if os.path.exists(self.kernelslist_g):
            print(f"Kernels list file already exists: {self.kernelslist_g}")
            with open(self.kernelslist_g, 'r') as f:
                print(f.read())

        # Clear the traces directory
        if os.path.exists(os.path.join(trace_path, 'stats.csv')):
            print(
                f"Processed traces from {self.program} have already been generated and saved to {trace_path}")
            return trace_path
        os.makedirs(trace_path, exist_ok=True)

        # Run the tracer
        env = os.environ.copy()
        env['CUDA_INJECTION64_PATH'] = os.path.join(
            tracer_tool_path, 'tracer_tool.so')
        env['SAMPLE'] = '1' if sample else '0'
        env['KERNELSLIST'] = self.kernelslist_g
        env['USER_DEFINED_FOLDERS'] = '1'
        env['TRACES_FOLDER'] = trace_path
        env['TRACE_FILE_COMPRESS'] = '0'
        # Only trace the first 1000 kernels
        env['DYNAMIC_KERNEL_LIMIT_START'] = '0'
        env['DYNAMIC_KERNEL_LIMIT_END'] = '1000'
        env['INSTR_END'] = '200000000'

        # Check if the program exists
        if not os.path.exists(self.program) and not self.program in ["python", "python3", "torchrun"]:
            raise FileNotFoundError(f"Program not found: {self.program}")

        # Use original working directory instead of log file path
        # only run this if kernelslist does not exist
        if not os.path.exists(os.path.join(trace_path, 'kernelslist')):
            print("Running the tracer...")
            subprocess.run(
                [self.program] + self.program_args,
                env=env,
                cwd=self.pwd
                # cwd=self.log_file_path
            )
        else:
            print(
                f"Traces from {self.program} have already been generated and saved to {trace_path}")

        # Process the traces
        print("Processing traces...")
        subprocess.run([
            os.path.join(tracer_tool_path, 'traces-processing',
                         'post-traces-processing'),
            os.path.join(trace_path, 'kernelslist')
        ])

        # Delete the .trace files to save space
        for trace_file in os.listdir(trace_path):
            if trace_file.endswith('.trace') or trace_file.endswith('.trace.xz'):
                os.remove(os.path.join(trace_path, trace_file))
        print(
            f"Traces from {self.program} consisting of {len(os.listdir(trace_path)) - 3} kernel(s) have been saved to {trace_path}")
        return trace_path

    def run_accel_sim(self, no_write_allocate=False):
        """
        Run the Accel-Sim GPGPU simulator on the generated traces.

        Sets up the environment for Accel-Sim, constructs the command line with
        appropriate configuration files (including handling the write-allocate setting),
        and executes the simulator. It captures and redirects the simulator's output
        to different log files (`.sim_cache.log`, `.sim.log`, `.sim_verbose.log`)
        based on regex patterns.

        Args:
            no_write_allocate (bool): Disable cache write-allocate configuration. Defaults to False.

        Returns:
            None: Executes the simulation and writes log files.
        """
        # Get and set the environment variables
        env = os.environ.copy()
        project_root = os.getenv('PROJECT_ROOT', '/gainsight')
        ld_library_path = env.get('LD_LIBRARY_PATH', '')
        accel_sim_root = os.path.join(
            project_root, 'backend', 'accel-sim', 'gpu-simulator')
        env['ACCELSIM_ROOT'] = accel_sim_root
        gpgpusim_root = os.path.join(accel_sim_root, 'gpgpu-sim')
        gpgpusim_config = env.get(
            'GPGPUSIM_CONFIG', 'gcc-11.4.0/cuda-11080/release')
        ld_library_path = re.sub(
            rf'{re.escape(gpgpusim_root)}/lib/[0-9]+/(debug|release):', '', ld_library_path)
        ld_library_path = f"{gpgpusim_root}/lib/{gpgpusim_config}:{ld_library_path}"
        env['LD_LIBRARY_PATH'] = ld_library_path

        # Set arguments for the simulator
        accel_sim_exec = os.path.join(
            accel_sim_root, 'bin', 'release', 'accel-sim.out')
        trace_path = os.path.join(
            self.log_file_path, 'traces', 'kernelslist.g')
        # look for the config files under the config directory of the current script
        if no_write_allocate:
            gpgpu_sim_config = os.path.join(
                os.path.dirname(os.path.abspath(__file__)), 'configs', 'no_write_allocate.config')
        else:
            gpgpu_sim_config = os.path.join(
                os.path.dirname(os.path.abspath(__file__)), 'configs', 'gpgpusim.config')
        accel_sim_config = os.path.join(
            os.path.dirname(os.path.abspath(__file__)), 'configs', 'trace.config')
        cmd = [
            accel_sim_exec,
            '-trace', trace_path,
            '-config', gpgpu_sim_config,
            '-config', accel_sim_config,
            '-gpgpu_max_insn 10000000000'
        ]
        cmd_str = ' '.join(cmd)

        cache_log_path = os.path.join(
            self.log_file_path, f"{self.log_file_name}.sim_cache.log")
        main_log_path = os.path.join(
            self.log_file_path, f"{self.log_file_name}.sim.log")
        verbose_log_path = os.path.join(
            self.log_file_path, f"{self.log_file_name}.sim_verbose.log")

        cache_log_file = open(cache_log_path, 'w')
        main_log_file = open(main_log_path, 'w')
        verbose_log_file = open(
            verbose_log_path, 'w') if self.verbose else None

        # Define the regex patterns
        cache_pattern = r"L1|L2|Processing kernel|kernel name|kernel id|kernel command:CACHE_EVICT_WB:CACHE_WB"

        log_pattern = r"GPGPU-Sim:|gpgpu_simulation_time =|gpgpu_simulation_rate =|gpgpu_silicon_slowdown =|CPU Runtime:|GPU Runtime:|Processing kernel|kernel name|kernel id|kernel command|gpu_sim_cycle|gpu_sim_insn|gpu_ipc|gpu_tot_sim_cycle|gpu_tot_sim_insn|gpu_total_sim_rate|kernel_name"

        env['CMD_STR'] = cmd_str
        env['LOG_FILE_NAME'] = self.log_file_name
        env['LOG_FILE_PATH'] = self.log_file_path
        if self.verbose:
            env['VERBOSE'] = '1'

        # Run the simulator
        script = """#!/usr/bin/env bash
        source $ACCELSIM_ROOT/setup_environment.sh;
        stdbuf -oL -eL $CMD_STR
        """

        process = subprocess.Popen(
            ['bash', '-c', script],
            env=env,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            universal_newlines=True,
            bufsize=1,
            executable='/bin/bash'
        )
        # Process the output line by line
        for line in iter(process.stdout.readline, ''):
            # Write to verbose log if enabled
            if self.verbose:
                verbose_log_file.write(line)
                verbose_log_file.flush()

            # Check for cache log pattern
            if re.search(cache_pattern, line):
                cache_log_file.write(line)
                cache_log_file.flush()

            # Check for sim log pattern
            if re.search(log_pattern, line):
                main_log_file.write(line)
                main_log_file.flush()
                # Print this line to stdout
                print(line, end='')

            sys.stdout.flush()
        # Wait for process to complete
        process.wait()
        # Close the log files
        cache_log_file.close()
        main_log_file.close()
        if self.verbose:
            verbose_log_file.close()

    def run_sampling(self, sample_delete=False):
        """Perform kernel sampling using Nsight Compute and Principal Kernel Selection (PKS).

        If an NCU report (`self.ncu_file`) is not provided, it first runs Nsight Compute
        to generate one. Then, it runs the `PrincipalKernelSelector` on the NCU report
        to identify representative kernels. It renames the original `kernelslist.g`
        (if it exists) to `kernelslist.old.g` and generates a new `kernelslist.g`
        containing only the selected kernels. Optionally deletes trace files not
        corresponding to the selected kernels.

        Args:
            sample_delete (bool): Remove traces not selected by PKS. Defaults to False.

        Returns:
            str: Path to the directory with updated `kernelslist.g`.

        Raises:
            Exception: Propagates errors from NCU or PKS execution.
        """
        trace_path = os.path.join(
            self.log_file_path, 'traces', 'kernelslist.g')
        old_trace_path = os.path.join(
            self.log_file_path, 'traces', 'kernelslist.old.g')

        if os.path.exists(trace_path) and self.delete:
            shutil.rmtree(os.path.dirname(trace_path))

        # Create parent directory if it doesn't exist
        os.makedirs(os.path.join(
            self.log_file_path, 'traces'), exist_ok=True)

        # If old_trace_path exists, then the program has already been run with sampling
        if os.path.exists(old_trace_path) and os.path.exists(trace_path):
            print(
                f"Samples from {self.program} have already been generated and saved to {trace_path}")
            if self.delete:
                os.remove(old_trace_path)
            else:
                return trace_path

        if not self.ncu_file or not os.path.exists(self.ncu_file):
            runner = NsightNVBitRunner()
            runner.init_from_params(
                program_name=self.program,
                program_args=self.program_args,
                log_file_name=self.log_file_name,
                log_file_path=self.log_file_path
            )
            ncu_file = runner.run_ncu(dry_run=False)
        else:
            ncu_file = self.ncu_file
        try:
            pks = PrincipalKernelSelector(
                ncu_input_file=ncu_file,
                output_dir=os.path.join(self.log_file_path, 'traces')
            )
            # Copy the existing kernelslist.g file to kernelslist.old.g
            if os.path.exists(trace_path):
                os.rename(trace_path, os.path.join(
                    self.log_file_path, 'traces', 'kernelslist.old.g'))
            pks.run_pks(trace_path, delete=sample_delete)
        except Exception as e:
            print(f"Error: {e}")
        finally:
            return trace_path

    def run_post_processing(self, config_file, sample=False):
        """Post-process simulation outputs into CSV and JSON reports.

        First, runs the `accel_sim_parser.run_parser` function to generate CSV summaries
        (`*.sim.csv`, `*.sim_l1.csv`, `*.sim_l2.csv`) from the simulation cache log.
        Second, runs the `gain_cell_frontend.py` script to generate a JSON file
        (`*.frontend.json`) for visualization or further analysis.

        Args:
            config_file (str): Path to GPGPU-Sim configuration used during simulation.
            sample (bool): Indicates if sampling was used. Defaults to False.

        Returns:
            None: Outputs `.sim.csv`, `.sim_l1.csv`, `.sim_l2.csv` and `.frontend.json`.
        """
        cache_log_path = os.path.join(
            self.log_file_path, f"{self.log_file_name}.sim.csv")
        run_parser(self.log_file_path, self.log_file_name, config_file)

        project_root = os.getenv('PROJECT_ROOT', '/gainsight')
        frontend_path = os.path.join(project_root, 'frontend')
        frontend_scripts_path = os.path.join(
            frontend_path, 'gain_cell_frontend.py')
        sample_string = "--sample" if sample else ""
        postprocess_log_path = os.path.join(
            self.log_file_path, f"{self.log_file_name}.postprocess.log")
        postprocess_log_file = open(postprocess_log_path, 'w')
        # Run the frontend script with cache_log_path as argument in the working directory of frontend_path
        process = subprocess.Popen(
            [sys.executable, frontend_scripts_path, sample_string, cache_log_path],
            cwd=frontend_path,
            stdout=postprocess_log_file,
            stderr=subprocess.STDOUT,
            universal_newlines=True,
            bufsize=1
        )
        if process.stdout is not None:
            for line in iter(process.stdout.readline, ''):
                postprocess_log_file.write(line)
                postprocess_log_file.flush()
                # Print this line to stdout
                print(line, end='')
                sys.stdout.flush()
        # Wait for process to complete
        process.wait()
        postprocess_log_file.close()

        # Check if the output JSON file exists
        output_json_path = os.path.join(
            self.log_file_path, f"{self.log_file_name}.frontend.json")
        if os.path.exists(output_json_path):
            print(
                f"Frontend JSON file generated: {output_json_path}")
        else:
            print(
                f"Error: Frontend JSON file not generated. Please check the frontend script for errors.")
        return

`init(program, args, arch='SM90_H100', verbose=False, delete=False, rename=None, ncu_file=None)`

Initialize the AccelSimRunner.

Sets up paths, log naming, and execution environment based on input parameters.

Parameters:

Name	Type	Description	Default
`program`	`str`	Path or name of the program to profile.	required
`args`	`List[str]`	Arguments to pass to the program.	required
`arch`	`str`	GPU architecture (default: "SM90_H100").	`'SM90_H100'`
`verbose`	`bool`	Enable verbose simulator logging.	`False`
`delete`	`bool`	Remove existing traces directory.	`False`
`rename`	`str`	Custom base name for log files.	`None`
`ncu_file`	`str`	Existing Nsight Compute report for sampling.	`None`

Returns:

Name	Type	Description
`None`		Initializes runner state and prepares directories.

Source code in python-scripts/accel_sim.py

def __init__(self, program, args, arch="SM90_H100", verbose=False, delete=False, rename=None, ncu_file=None):
    """Initialize the AccelSimRunner.

    Sets up paths, log naming, and execution environment based on input parameters.

    Args:
        program (str): Path or name of the program to profile.
        args (List[str]): Arguments to pass to the program.
        arch (str): GPU architecture (default: "SM90_H100").
        verbose (bool): Enable verbose simulator logging.
        delete (bool): Remove existing traces directory.
        rename (str): Custom base name for log files.
        ncu_file (str): Existing Nsight Compute report for sampling.

    Returns:
        None: Initializes runner state and prepares directories.
    """
    # Convert program to absolute path if it's a relative path
    self.program = os.path.abspath(program) if not program in [
        "python", "python3", "torchrun"] else program
    self.program_args = args
    self.arch = arch
    self.verbose = verbose
    self.delete = delete
    self.pwd = os.getenv('PWD', os.getcwd())

    # Store original working directory
    self.original_cwd = os.getcwd()

    output_program_name = os.path.basename(program)
    # If the program is python3, we need to use sys.executable
    program_basename = os.path.basename(program)
    if program_basename == "python" or program_basename.startswith("python3") or program_basename == "torchrun":
        output_program_name = ".".join(
            os.path.basename(args[0]).split('.')[:-1])
        self.program = sys.executable
        self.program_args[0] = os.path.abspath(args[0])
    # If the program is a .py file, we need to use sys.executable
    if program.endswith(".py"):
        output_program_name = ".".join(
            os.path.basename(program).split('.')[:-1])
        self.program_args = [os.path.abspath(program)] + args
        self.program = sys.executable
    if rename:
        output_program_name = rename

    # Generate a timestamp for the log file name
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    self.log_file_name = f"{output_program_name}_{timestamp}"
    self.log_file_path = os.path.join(
        os.getenv('PROJECT_ROOT', '.'), 'logs', output_program_name)
    os.makedirs(self.log_file_path, exist_ok=True)

    self.kernelslist_g = os.path.join(
        self.log_file_path, 'traces', 'kernelslist.g')

    # Dump program and arguments to a file
    with open(os.path.join(self.log_file_path, f"{self.log_file_name}.sim_cmd.log"), 'w') as f:
        f.write(f"{self.program} {' '.join(self.program_args)}")

    self.ncu_file = ncu_file

`run_accel_sim(no_write_allocate=False)`

Run the Accel-Sim GPGPU simulator on the generated traces.

Sets up the environment for Accel-Sim, constructs the command line with appropriate configuration files (including handling the write-allocate setting), and executes the simulator. It captures and redirects the simulator's output to different log files (.sim_cache.log, .sim.log, .sim_verbose.log) based on regex patterns.

Parameters:

Name	Type	Description	Default
`no_write_allocate`	`bool`	Disable cache write-allocate configuration. Defaults to False.	`False`

Returns:

Name	Type	Description
`None`		Executes the simulation and writes log files.

Source code in python-scripts/accel_sim.py

def run_accel_sim(self, no_write_allocate=False):
    """
    Run the Accel-Sim GPGPU simulator on the generated traces.

    Sets up the environment for Accel-Sim, constructs the command line with
    appropriate configuration files (including handling the write-allocate setting),
    and executes the simulator. It captures and redirects the simulator's output
    to different log files (`.sim_cache.log`, `.sim.log`, `.sim_verbose.log`)
    based on regex patterns.

    Args:
        no_write_allocate (bool): Disable cache write-allocate configuration. Defaults to False.

    Returns:
        None: Executes the simulation and writes log files.
    """
    # Get and set the environment variables
    env = os.environ.copy()
    project_root = os.getenv('PROJECT_ROOT', '/gainsight')
    ld_library_path = env.get('LD_LIBRARY_PATH', '')
    accel_sim_root = os.path.join(
        project_root, 'backend', 'accel-sim', 'gpu-simulator')
    env['ACCELSIM_ROOT'] = accel_sim_root
    gpgpusim_root = os.path.join(accel_sim_root, 'gpgpu-sim')
    gpgpusim_config = env.get(
        'GPGPUSIM_CONFIG', 'gcc-11.4.0/cuda-11080/release')
    ld_library_path = re.sub(
        rf'{re.escape(gpgpusim_root)}/lib/[0-9]+/(debug|release):', '', ld_library_path)
    ld_library_path = f"{gpgpusim_root}/lib/{gpgpusim_config}:{ld_library_path}"
    env['LD_LIBRARY_PATH'] = ld_library_path

    # Set arguments for the simulator
    accel_sim_exec = os.path.join(
        accel_sim_root, 'bin', 'release', 'accel-sim.out')
    trace_path = os.path.join(
        self.log_file_path, 'traces', 'kernelslist.g')
    # look for the config files under the config directory of the current script
    if no_write_allocate:
        gpgpu_sim_config = os.path.join(
            os.path.dirname(os.path.abspath(__file__)), 'configs', 'no_write_allocate.config')
    else:
        gpgpu_sim_config = os.path.join(
            os.path.dirname(os.path.abspath(__file__)), 'configs', 'gpgpusim.config')
    accel_sim_config = os.path.join(
        os.path.dirname(os.path.abspath(__file__)), 'configs', 'trace.config')
    cmd = [
        accel_sim_exec,
        '-trace', trace_path,
        '-config', gpgpu_sim_config,
        '-config', accel_sim_config,
        '-gpgpu_max_insn 10000000000'
    ]
    cmd_str = ' '.join(cmd)

    cache_log_path = os.path.join(
        self.log_file_path, f"{self.log_file_name}.sim_cache.log")
    main_log_path = os.path.join(
        self.log_file_path, f"{self.log_file_name}.sim.log")
    verbose_log_path = os.path.join(
        self.log_file_path, f"{self.log_file_name}.sim_verbose.log")

    cache_log_file = open(cache_log_path, 'w')
    main_log_file = open(main_log_path, 'w')
    verbose_log_file = open(
        verbose_log_path, 'w') if self.verbose else None

    # Define the regex patterns
    cache_pattern = r"L1|L2|Processing kernel|kernel name|kernel id|kernel command:CACHE_EVICT_WB:CACHE_WB"

    log_pattern = r"GPGPU-Sim:|gpgpu_simulation_time =|gpgpu_simulation_rate =|gpgpu_silicon_slowdown =|CPU Runtime:|GPU Runtime:|Processing kernel|kernel name|kernel id|kernel command|gpu_sim_cycle|gpu_sim_insn|gpu_ipc|gpu_tot_sim_cycle|gpu_tot_sim_insn|gpu_total_sim_rate|kernel_name"

    env['CMD_STR'] = cmd_str
    env['LOG_FILE_NAME'] = self.log_file_name
    env['LOG_FILE_PATH'] = self.log_file_path
    if self.verbose:
        env['VERBOSE'] = '1'

    # Run the simulator
    script = """#!/usr/bin/env bash
    source $ACCELSIM_ROOT/setup_environment.sh;
    stdbuf -oL -eL $CMD_STR
    """

    process = subprocess.Popen(
        ['bash', '-c', script],
        env=env,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        universal_newlines=True,
        bufsize=1,
        executable='/bin/bash'
    )
    # Process the output line by line
    for line in iter(process.stdout.readline, ''):
        # Write to verbose log if enabled
        if self.verbose:
            verbose_log_file.write(line)
            verbose_log_file.flush()

        # Check for cache log pattern
        if re.search(cache_pattern, line):
            cache_log_file.write(line)
            cache_log_file.flush()

        # Check for sim log pattern
        if re.search(log_pattern, line):
            main_log_file.write(line)
            main_log_file.flush()
            # Print this line to stdout
            print(line, end='')

        sys.stdout.flush()
    # Wait for process to complete
    process.wait()
    # Close the log files
    cache_log_file.close()
    main_log_file.close()
    if self.verbose:
        verbose_log_file.close()

`run_post_processing(config_file, sample=False)`

Post-process simulation outputs into CSV and JSON reports.

First, runs the accel_sim_parser.run_parser function to generate CSV summaries (*.sim.csv, *.sim_l1.csv, *.sim_l2.csv) from the simulation cache log. Second, runs the gain_cell_frontend.py script to generate a JSON file (*.frontend.json) for visualization or further analysis.

Parameters:

Name	Type	Description	Default
`config_file`	`str`	Path to GPGPU-Sim configuration used during simulation.	required
`sample`	`bool`	Indicates if sampling was used. Defaults to False.	`False`

Returns:

Name	Type	Description
`None`		Outputs `.sim.csv`, `.sim_l1.csv`, `.sim_l2.csv` and `.frontend.json`.

Source code in python-scripts/accel_sim.py

def run_post_processing(self, config_file, sample=False):
    """Post-process simulation outputs into CSV and JSON reports.

    First, runs the `accel_sim_parser.run_parser` function to generate CSV summaries
    (`*.sim.csv`, `*.sim_l1.csv`, `*.sim_l2.csv`) from the simulation cache log.
    Second, runs the `gain_cell_frontend.py` script to generate a JSON file
    (`*.frontend.json`) for visualization or further analysis.

    Args:
        config_file (str): Path to GPGPU-Sim configuration used during simulation.
        sample (bool): Indicates if sampling was used. Defaults to False.

    Returns:
        None: Outputs `.sim.csv`, `.sim_l1.csv`, `.sim_l2.csv` and `.frontend.json`.
    """
    cache_log_path = os.path.join(
        self.log_file_path, f"{self.log_file_name}.sim.csv")
    run_parser(self.log_file_path, self.log_file_name, config_file)

    project_root = os.getenv('PROJECT_ROOT', '/gainsight')
    frontend_path = os.path.join(project_root, 'frontend')
    frontend_scripts_path = os.path.join(
        frontend_path, 'gain_cell_frontend.py')
    sample_string = "--sample" if sample else ""
    postprocess_log_path = os.path.join(
        self.log_file_path, f"{self.log_file_name}.postprocess.log")
    postprocess_log_file = open(postprocess_log_path, 'w')
    # Run the frontend script with cache_log_path as argument in the working directory of frontend_path
    process = subprocess.Popen(
        [sys.executable, frontend_scripts_path, sample_string, cache_log_path],
        cwd=frontend_path,
        stdout=postprocess_log_file,
        stderr=subprocess.STDOUT,
        universal_newlines=True,
        bufsize=1
    )
    if process.stdout is not None:
        for line in iter(process.stdout.readline, ''):
            postprocess_log_file.write(line)
            postprocess_log_file.flush()
            # Print this line to stdout
            print(line, end='')
            sys.stdout.flush()
    # Wait for process to complete
    process.wait()
    postprocess_log_file.close()

    # Check if the output JSON file exists
    output_json_path = os.path.join(
        self.log_file_path, f"{self.log_file_name}.frontend.json")
    if os.path.exists(output_json_path):
        print(
            f"Frontend JSON file generated: {output_json_path}")
    else:
        print(
            f"Error: Frontend JSON file not generated. Please check the frontend script for errors.")
    return

`run_sampling(sample_delete=False)`

Perform kernel sampling using Nsight Compute and Principal Kernel Selection (PKS).

If an NCU report (self.ncu_file) is not provided, it first runs Nsight Compute to generate one. Then, it runs the PrincipalKernelSelector on the NCU report to identify representative kernels. It renames the original kernelslist.g (if it exists) to kernelslist.old.g and generates a new kernelslist.g containing only the selected kernels. Optionally deletes trace files not corresponding to the selected kernels.

Parameters:

Name	Type	Description	Default
`sample_delete`	`bool`	Remove traces not selected by PKS. Defaults to False.	`False`

Returns:

Name	Type	Description
`str`		Path to the directory with updated `kernelslist.g`.

Raises:

Type	Description
`Exception`	Propagates errors from NCU or PKS execution.

Source code in python-scripts/accel_sim.py

def run_sampling(self, sample_delete=False):
    """Perform kernel sampling using Nsight Compute and Principal Kernel Selection (PKS).

    If an NCU report (`self.ncu_file`) is not provided, it first runs Nsight Compute
    to generate one. Then, it runs the `PrincipalKernelSelector` on the NCU report
    to identify representative kernels. It renames the original `kernelslist.g`
    (if it exists) to `kernelslist.old.g` and generates a new `kernelslist.g`
    containing only the selected kernels. Optionally deletes trace files not
    corresponding to the selected kernels.

    Args:
        sample_delete (bool): Remove traces not selected by PKS. Defaults to False.

    Returns:
        str: Path to the directory with updated `kernelslist.g`.

    Raises:
        Exception: Propagates errors from NCU or PKS execution.
    """
    trace_path = os.path.join(
        self.log_file_path, 'traces', 'kernelslist.g')
    old_trace_path = os.path.join(
        self.log_file_path, 'traces', 'kernelslist.old.g')

    if os.path.exists(trace_path) and self.delete:
        shutil.rmtree(os.path.dirname(trace_path))

    # Create parent directory if it doesn't exist
    os.makedirs(os.path.join(
        self.log_file_path, 'traces'), exist_ok=True)

    # If old_trace_path exists, then the program has already been run with sampling
    if os.path.exists(old_trace_path) and os.path.exists(trace_path):
        print(
            f"Samples from {self.program} have already been generated and saved to {trace_path}")
        if self.delete:
            os.remove(old_trace_path)
        else:
            return trace_path

    if not self.ncu_file or not os.path.exists(self.ncu_file):
        runner = NsightNVBitRunner()
        runner.init_from_params(
            program_name=self.program,
            program_args=self.program_args,
            log_file_name=self.log_file_name,
            log_file_path=self.log_file_path
        )
        ncu_file = runner.run_ncu(dry_run=False)
    else:
        ncu_file = self.ncu_file
    try:
        pks = PrincipalKernelSelector(
            ncu_input_file=ncu_file,
            output_dir=os.path.join(self.log_file_path, 'traces')
        )
        # Copy the existing kernelslist.g file to kernelslist.old.g
        if os.path.exists(trace_path):
            os.rename(trace_path, os.path.join(
                self.log_file_path, 'traces', 'kernelslist.old.g'))
        pks.run_pks(trace_path, delete=sample_delete)
    except Exception as e:
        print(f"Error: {e}")
    finally:
        return trace_path

`run_tracer(sample=False)`

Run the NVBit tracer to generate SASS instruction traces for the program.

Sets up the environment for the NVBit tracer tool, executes the target program under the tracer, and then runs the post-processing script to convert raw traces into the format required by Accel-Sim (.traceg files). Optionally deletes the intermediate .trace files.

Parameters:

Name	Type	Description	Default
`sample`	`bool`	Enable kernel sampling via environment variable. Defaults to False.	`False`

Returns:

Name	Type	Description
`str`		Path to the directory containing processed traces.

Raises:

Type	Description
`FileNotFoundError`	If the program executable is not found.

Source code in python-scripts/accel_sim.py

def run_tracer(self, sample=False):
    """Run the NVBit tracer to generate SASS instruction traces for the program.

    Sets up the environment for the NVBit tracer tool, executes the target program
    under the tracer, and then runs the post-processing script to convert raw
    traces into the format required by Accel-Sim (`.traceg` files). Optionally
    deletes the intermediate `.trace` files.

    Args:
        sample (bool): Enable kernel sampling via environment variable. Defaults to False.

    Returns:
        str: Path to the directory containing processed traces.

    Raises:
        FileNotFoundError: If the program executable is not found.
    """
    trace_path = os.path.join(self.log_file_path, 'traces')
    project_root = os.getenv('PROJECT_ROOT', '/gainsight')
    tracer_tool_path = os.getenv('TRACER_TOOLS', os.path.join(
        project_root, 'backend', 'accel-sim', 'util', 'tracer_nvbit', 'tracer_tool'))

    # Print kernelslist_g if it exists
    if os.path.exists(self.kernelslist_g):
        print(f"Kernels list file already exists: {self.kernelslist_g}")
        with open(self.kernelslist_g, 'r') as f:
            print(f.read())

    # Clear the traces directory
    if os.path.exists(os.path.join(trace_path, 'stats.csv')):
        print(
            f"Processed traces from {self.program} have already been generated and saved to {trace_path}")
        return trace_path
    os.makedirs(trace_path, exist_ok=True)

    # Run the tracer
    env = os.environ.copy()
    env['CUDA_INJECTION64_PATH'] = os.path.join(
        tracer_tool_path, 'tracer_tool.so')
    env['SAMPLE'] = '1' if sample else '0'
    env['KERNELSLIST'] = self.kernelslist_g
    env['USER_DEFINED_FOLDERS'] = '1'
    env['TRACES_FOLDER'] = trace_path
    env['TRACE_FILE_COMPRESS'] = '0'
    # Only trace the first 1000 kernels
    env['DYNAMIC_KERNEL_LIMIT_START'] = '0'
    env['DYNAMIC_KERNEL_LIMIT_END'] = '1000'
    env['INSTR_END'] = '200000000'

    # Check if the program exists
    if not os.path.exists(self.program) and not self.program in ["python", "python3", "torchrun"]:
        raise FileNotFoundError(f"Program not found: {self.program}")

    # Use original working directory instead of log file path
    # only run this if kernelslist does not exist
    if not os.path.exists(os.path.join(trace_path, 'kernelslist')):
        print("Running the tracer...")
        subprocess.run(
            [self.program] + self.program_args,
            env=env,
            cwd=self.pwd
            # cwd=self.log_file_path
        )
    else:
        print(
            f"Traces from {self.program} have already been generated and saved to {trace_path}")

    # Process the traces
    print("Processing traces...")
    subprocess.run([
        os.path.join(tracer_tool_path, 'traces-processing',
                     'post-traces-processing'),
        os.path.join(trace_path, 'kernelslist')
    ])

    # Delete the .trace files to save space
    for trace_file in os.listdir(trace_path):
        if trace_file.endswith('.trace') or trace_file.endswith('.trace.xz'):
            os.remove(os.path.join(trace_path, trace_file))
    print(
        f"Traces from {self.program} consisting of {len(os.listdir(trace_path)) - 3} kernel(s) have been saved to {trace_path}")
    return trace_path

`parse_args()`

Parse command-line arguments for the Accel-Sim runner script.

Defines and parses arguments related to program execution, simulation options, tracing, sampling, and post-processing.

Returns:

Type	Description
	argparse.Namespace: An object containing the parsed command-line arguments. Includes attributes like `program`, `args`, `sample`, `arch`, `delete`, `verbose`, `rename`, `trace_only`, `replay_only`, `post_process`, `sample_delete`, `no_write_allocate`, `ncu_file`.

Source code in python-scripts/accel_sim.py

def parse_args():
    """Parse command-line arguments for the Accel-Sim runner script.

    Defines and parses arguments related to program execution, simulation options,
    tracing, sampling, and post-processing.

    Returns:
        argparse.Namespace: An object containing the parsed command-line arguments.
            Includes attributes like `program`, `args`, `sample`, `arch`, `delete`,
            `verbose`, `rename`, `trace_only`, `replay_only`, `post_process`,
            `sample_delete`, `no_write_allocate`, `ncu_file`.
    """
    # echo "Usage: ./generate_traces.sh [--sample] <program> <args>"
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--sample",
        help="Run the program with kernel sampling",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--arch",
        help="The architecture to simulate",
        type=str,
        default="SM90_H100"
    )
    parser.add_argument(
        "--delete",
        help="Delete the traces directory before running the tracer",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--verbose",
        help="Store verbose output from Accel-Sim",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--rename",
        help="Rename the log file",
        type=str,
        default=None
    )
    parser.add_argument(
        "--trace-only",
        help="Only run the tracer and exit",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--replay-only",
        help="Only run the replay and exit",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--post-process",
        help="Run the post-processing step to generate the simulation results",
        type=str,
        default=None
    )
    parser.add_argument(
        "--sample-delete",
        help="Delete traces that are not used for sampling",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--no-write-allocate",
        help="Disable write allocate for the cache",
        action="store_true",
        default=False
    )
    parser.add_argument(
        "--ncu-file",
        help="The Nsight Compute report to use for kernel sampling",
        type=str,
        default=None
    )
    parser.add_argument(
        "program",
        help="The program to profile",
        type=str
    )
    parser.add_argument(
        "args",
        help="Arguments to pass to the program",
        type=str,
        nargs=argparse.REMAINDER
    )
    args = parser.parse_args()
    return args

Sampling via Principal Kernel Selection (PKS)

The pks.py script implements the Principal Kernel Selection (PKS) algorithm for sampling in the Accel-Sim simulator. It provides functions to select a subset of kernels based on their execution characteristics and performance metrics.

Principal Kernel Selection implementation for CUDA kernel profiling.

This module implements the Principal Kernel Selection (PKS) algorithm, which uses PCA and K-means clustering to identify representative CUDA kernels from an NVIDIA NSight Compute report. This helps reduce simulation time by focusing on a smaller set of representative kernels.

The functionalities of this module is an adaptation and reproduction of the following work: Cesar Avalos Baddouh, Mahmoud Khairy, Roland N. Green, Mathias Payer, and Timothy G. Rogers. 2021. Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '21). Association for Computing Machinery, New York, NY, USA, 724–737. https://doi.org/10.1145/3466752.3480100

TODO: Implement real-time monitoring of read/write frequencies, bursts, and lifetime trends.

Typical usage

pks = PrincipalKernelSelector( ncu_input_file="path/to/report.ncu-rep", output_dir="path/to/traces_dir" ) pks.run_pks()

`PrincipalKernelSelector`

Analyzes CUDA kernel metrics to identify representative kernels.

Implements the Principal Kernel Selection (PKS) algorithm using PCA and K-means clustering to select representative CUDA kernels for simulation.

Parameters:

Name	Type	Description	Default
`ncu_input_file`	`str`	Path to the NCU report file to analyze.	required
`output_dir`	`str`	Directory for saving output files. Defaults to script directory.	`None`

Returns:

Type	Description
	None

Attributes:

Name	Type	Description
`metrics`	`List[str]`	List of collected NCU metrics for analysis.
`ncu_data`	`DataFrame`	DataFrame containing raw kernel metrics.
`pca_df`	`DataFrame`	DataFrame with PCA-transformed kernel data.
`cluster_count`	`int`	Optimal number of clusters determined.
`kernel_df`	`DataFrame`	DataFrame with kernel details and cluster assignments.
`cluster_df`	`DataFrame`	DataFrame with cluster details and centroids.
`output_dir`	`str`	Directory for saving output files.
`bypass`	`bool`	Flag indicating if PKS should be bypassed.
`sum_lts_t_sectors_op_write`	`float`	Sum of 'lts__t_sectors_op_write.sum' for all kernels.

Source code in python-scripts/pks.py

class PrincipalKernelSelector:
    """Analyzes CUDA kernel metrics to identify representative kernels.

    Implements the Principal Kernel Selection (PKS) algorithm using PCA and
    K-means clustering to select representative CUDA kernels for simulation.

    Args:
        ncu_input_file (str): Path to the NCU report file to analyze.
        output_dir (str, optional): Directory for saving output files. Defaults to script directory.

    Returns:
        None

    Attributes:
        metrics (List[str]): List of collected NCU metrics for analysis.
        ncu_data (pd.DataFrame): DataFrame containing raw kernel metrics.
        pca_df (pd.DataFrame): DataFrame with PCA-transformed kernel data.
        cluster_count (int): Optimal number of clusters determined.
        kernel_df (pd.DataFrame): DataFrame with kernel details and cluster assignments.
        cluster_df (pd.DataFrame): DataFrame with cluster details and centroids.
        output_dir (str): Directory for saving output files.
        bypass (bool): Flag indicating if PKS should be bypassed.
        sum_lts_t_sectors_op_write (float): Sum of 'lts__t_sectors_op_write.sum' for all kernels.
    """

    def __init__(self, ncu_input_file: str, output_dir: str = None):
        """Initialize the Principal Kernel Selector with an NCU report.

        Args:
            ncu_input_file (str): Path to the NCU report file to analyze.
            output_dir (str, optional): Directory for saving output files. Should be the directory
                containing trace files (kernel-1.traceg, etc.). Defaults to the
                directory of this script.

        Raises:
            ValueError: If the input file contains fewer than 2 kernels (unless bypass is triggered).
            FileNotFoundError: If the metrics_list.json file is not found.
            ImportError: If the ncu_report module cannot be imported.
        """
        # Class member variables for main-function state
        self.metrics = None
        self.ncu_data = None
        self.pca_df = None

        # Path to output directory for generated files
        # Defaults to path of this script if not provided or invalid
        self.output_dir = output_dir if output_dir and os.path.exists(output_dir) \
            else os.path.dirname(os.path.realpath(__file__))

        # Number of clusters
        self.cluster_count = None

        # DataFrame containing kernel details and cluster assignments
        self.kernel_df = None

        # DataFrame containing cluster details and centroids
        self.cluster_df = None

        # Load metrics from configuration file
        metrics_path = os.path.join(os.path.dirname(
            os.path.realpath(__file__)), "metrics_list.json")
        with open(metrics_path, "r") as metrics_file:
            metrics = json.load(metrics_file)

        # Flatten metrics dictionary into a list
        concatenated_metrics = []
        for key, value in metrics.items():
            concatenated_metrics.extend(value)
        self.metrics = concatenated_metrics

        # Load NCU report data
        ncu_context = ncu_report.load_report(ncu_input_file)
        ncu_range = ncu_context[0]  # Use first range in report

        if ncu_context.num_ranges() > 1:
            print("Warning: Multiple ranges found in the input file. "
                  "Using the first range.")

        print(
            f"Loaded {ncu_range.num_actions()} kernels from {ncu_input_file}")

        # Ensure we have enough kernels for clustering
        # if ncu_range.num_actions() <= 2:
        #     raise ValueError("The input file must contain at least 2 kernels.")
        self.bypass = ncu_range.num_actions() <= 20

        # Create initial dataframe with kernel information
        kernel_ids = range(ncu_range.num_actions())
        kernel_names = [action.name() for action in ncu_range]

        data = {
            "Kernel Name": kernel_names,
            "Kernel ID": list(kernel_ids),
        }

        # Add metrics data to dataframe
        for metric in self.metrics:
            data[metric] = [action[metric].value() for action in ncu_range]

        self.ncu_data = pd.DataFrame(data)

        # Initialize kernel_df with basic kernel information
        self.kernel_df = self.ncu_data[["Kernel ID", "Kernel Name"]].copy()

        self.sum_lts_t_sectors_op_write = self.ncu_data["lts__t_sectors_op_write.sum"].sum()

    def pca(self, data: pd.DataFrame, var_threshold: float = 0.95) -> Tuple[pd.DataFrame, np.ndarray]:
        """Perform Principal Component Analysis on kernel metrics.

        Args:
            data (pd.DataFrame): DataFrame containing kernel metrics.
            var_threshold (float): Variance threshold for PCA dimensionality reduction. Defaults to 0.95.

        Returns:
            Tuple[pd.DataFrame, np.ndarray]: A tuple containing:
                - pd.DataFrame: DataFrame with transformed data, columns named 'PC0', 'PC1', etc.
                - np.ndarray: Raw numpy array of the transformed data.
        """
        # Create a copy to avoid modifying original data
        data_copy = data.copy()

        # Remove non-metric columns
        data_copy.drop(columns=["Kernel Name", "Kernel ID"], inplace=True)

        # Standardize the data
        data_copy = StandardScaler().fit_transform(data_copy)

        # Apply PCA
        pca_model = PCA(n_components=var_threshold)
        transformed_data = pca_model.fit_transform(data_copy)

        # Create a dataframe with the transformed data
        transformed_df = pd.DataFrame(
            transformed_data,
            columns=[f"PC{i}" for i in range(pca_model.n_components_)]
        )

        return transformed_df, transformed_data

    def kmeans(self, data: pd.DataFrame, n_clusters: int = 3) -> Tuple[np.ndarray, np.ndarray, float]:
        """Perform K-means clustering with a specified number of clusters.

        Calculates cluster assignments, centroids, and a custom score based on the
        difference between the sum of 'lts__t_sectors_op_write.sum' for centroid-representative
        kernels (weighted by cluster size) and the total sum for all kernels.

        Args:
            data (pd.DataFrame): Data to cluster (typically PCA-transformed).
            n_clusters (int): Number of clusters to form. Defaults to 3.

        Returns:
            Tuple[np.ndarray, np.ndarray, float]: A tuple containing:
                - np.ndarray: Array of cluster labels for each data point.
                - np.ndarray: Array of cluster centers (centroids).
                - float: Custom score representing the relative error in 'lts__t_sectors_op_write.sum'.
        """
        # Apply K-means clustering
        kmeans_result = KMeans(
            n_clusters=n_clusters,
            random_state=12,  # For reproducibility
            n_init="auto"  # Use default initialization
        ).fit(data)

        labels = kmeans_result.labels_
        centers = kmeans_result.cluster_centers_

        cluster_lts_t_sectors_op_write = 0

        for i, center in enumerate(centers):
            closest_cluster = i
            # get number of kernels in the cluster
            num_kernels = np.sum(labels == closest_cluster)
            # get the indices of the points in the closest cluster
            indices = np.where(labels == closest_cluster)[0]
            min_distance = np.inf
            closest_point = None
            closest_index = None
            for index in indices:
                # get the principal component values for the point
                point = self.pca_df.iloc[index].values
                # calculate the distance from the point to the center of the cluster
                distance = np.linalg.norm(point - center)
                if distance < min_distance:
                    min_distance = distance
                    closest_point = point
                    closest_index = index
            # get the original data for the closest point
            original_data = self.ncu_data.iloc[closest_index]
            # get lts__t_sectors_op_write.sum
            lts_t_sectors_op_write = original_data["lts__t_sectors_op_write.sum"]
            cluster_lts_t_sectors_op_write += lts_t_sectors_op_write * num_kernels

        # Calculate custom score based on write count difference
        score = np.abs(cluster_lts_t_sectors_op_write -
                       self.sum_lts_t_sectors_op_write) / self.sum_lts_t_sectors_op_write

        # Return labels, centers, and the custom score
        return kmeans_result.labels_, kmeans_result.cluster_centers_, score

    def kmeans_scan(self, data: pd.DataFrame, lower_bound: int = 2, upper_bound: int = 20) -> None:
        """Find optimal number of clusters by scanning a range of values.

        Tries K-means with different numbers of clusters (from lower_bound to upper_bound).
        Selects the number of clusters corresponding to the lowest custom score calculated by `kmeans`.
        If multiple cluster counts yield scores within 5% of the minimum, the one with the
        fewest clusters is chosen. Updates class attributes `cluster_count`, `kernel_df`,
        and `cluster_df` with the results of the best clustering.

        Args:
            data (pd.DataFrame): Data to cluster (typically PCA-transformed).
            lower_bound (int): Minimum number of clusters to try. Defaults to 2.
            upper_bound (int): Maximum number of clusters to try. Defaults to 20.

        Returns:
            None: Updates class attributes with clustering results.
        """
        scores = []
        centers_list = []
        kmeans_labels_list = []

        # Try different numbers of clusters
        for i in range(lower_bound, upper_bound + 1):
            labels, centers, score = self.kmeans(data, i)
            print(f"Number of clusters: {i}, Write count error: {score}")
            scores.append(score)
            centers_list.append(centers)
            kmeans_labels_list.append(labels)

        # Find minimum score
        min_score = min(scores)

        # Use the first clustering within 5% of minimum score
        for i, score in enumerate(scores):
            if score <= 1.05 * min_score:
                self.cluster_count = i + lower_bound

                # Update kernel_df with cluster assignments
                self.kernel_df["Cluster ID"] = kmeans_labels_list[i]

                # Create cluster_df with centers and counts
                centers = centers_list[i]
                cluster_ids = range(self.cluster_count)
                counts = np.bincount(kmeans_labels_list[i])

                # Create dataframe with cluster details
                self.cluster_df = pd.DataFrame({
                    "Cluster ID": cluster_ids,
                    "Kernel Count": counts
                })

                # Ensure integer data types
                self.cluster_df["Cluster ID"] = pd.to_numeric(
                    self.cluster_df["Cluster ID"], errors='coerce').fillna(-1).astype(int)
                self.cluster_df["Kernel Count"] = pd.to_numeric(
                    self.cluster_df["Kernel Count"], errors='coerce').fillna(-1).astype(int)

                # Add center coordinates as separate columns
                for j in range(centers.shape[1]):
                    self.cluster_df[f"Center_PC{j}"] = [
                        centers[k, j] for k in range(self.cluster_count)]
                break

    def select_centroid(self) -> None:
        """Select representative kernels by finding points closest to cluster centers.

        For each cluster identified by K-means, this method finds the kernel
        (data point) in the PCA space that is closest to the cluster's center
        (centroid) using Euclidean distance. It marks this kernel as the
        representative for that cluster. Updates both `kernel_df` and `cluster_df`
        with centroid information (Kernel ID and Name).

        Returns:
            None: Updates `kernel_df` and `cluster_df` with centroid information.
        """
        # For each cluster, find the kernel closest to the center
        centroid_kernel_ids = []

        for cluster_id in range(self.cluster_count):
            # Get indices of kernels in this cluster
            cluster_mask = self.kernel_df["Cluster ID"] == cluster_id
            cluster_kernel_indices = self.kernel_df.index[cluster_mask].tolist(
            )

            # Handle empty clusters
            if not cluster_kernel_indices:
                print(f"Warning: No kernels found in cluster {cluster_id}")
                centroid_kernel_ids.append(None)
                continue

            # Get PCA coordinates for kernels in this cluster
            cluster_data = self.pca_df.iloc[cluster_kernel_indices]

            # Get center coordinates for this cluster
            center_coords = []
            for j in range(self.pca_df.shape[1]):
                center_coords.append(
                    self.cluster_df.loc[cluster_id, f"Center_PC{j}"])

            # Calculate Euclidean distances to center
            distances = np.linalg.norm(
                cluster_data.values - center_coords, axis=1)

            # Find the kernel closest to the center
            closest_idx = cluster_kernel_indices[np.argmin(distances)]
            kernel_id = int(self.kernel_df.loc[closest_idx, "Kernel ID"])
            centroid_kernel_ids.append(kernel_id)

            # Add centroid information to cluster_df
            self.cluster_df.at[cluster_id, "Centroid Kernel ID"] = kernel_id
            self.cluster_df.at[cluster_id, "Centroid Kernel Name"] = \
                self.kernel_df.loc[closest_idx, "Kernel Name"]

        # Update kernel_df with centroid assignments
        for idx, row in self.kernel_df.iterrows():
            cluster_id = row["Cluster ID"]
            centroid_id = self.cluster_df.loc[cluster_id, "Centroid Kernel ID"]
            self.kernel_df.at[idx, "Centroid Kernel ID"] = centroid_id

        # Ensure integer data types throughout
        self.kernel_df["Cluster ID"] = self.kernel_df["Cluster ID"].astype(int)
        self.kernel_df["Centroid Kernel ID"] = pd.to_numeric(
            self.kernel_df["Centroid Kernel ID"], errors='coerce').fillna(-1).astype(int)
        self.kernel_df["Kernel ID"] = self.kernel_df["Kernel ID"].astype(int)
        self.cluster_df["Centroid Kernel ID"] = pd.to_numeric(
            self.cluster_df["Centroid Kernel ID"], errors='coerce').fillna(-1).astype(int)

    def generate_kernelslist(self, delete: bool = False) -> None:
        """Generate a list file containing the selected representative kernels.

        Creates a `kernelslist.g` file in the `output_dir`. This file lists the
        trace file names (e.g., `kernel-1.traceg`, `kernel-5.traceg`) corresponding
        to the selected centroid kernels. Optionally, deletes trace files in the
        `output_dir` that do not correspond to the selected centroids.

        Args:
            delete (bool): If True, deletes trace files (`*.traceg`) in the `output_dir`
                           that are not selected as centroids. Defaults to False.

        Returns:
            None: Writes `kernelslist.g` to the output directory and potentially deletes files.
        """
        # Get unique centroid kernel IDs (already 1-indexed from run_pks)
        centroid_ids = self.cluster_df["Centroid Kernel ID"].unique()
        centroid_ids = sorted([cid for cid in centroid_ids if cid >= 0])

        # Write kernelslist.g file
        with open(os.path.join(self.output_dir, "kernelslist.g"), "w") as f:
            for index in centroid_ids:
                f.write(f"kernel-{index}.traceg\n")

        # Optionally delete non-centroid trace files
        if delete:
            # iterate over all files in the output directory
            for filename in os.listdir(self.output_dir):
                # check if the file is a trace file and not in the centroid list
                if filename.startswith("kernel-") and (filename.endswith(".traceg") or filename.endswith(".traceg.xz")):
                    kernel_id = int(filename.split("-")[1].split(".")[0])
                    if kernel_id not in centroid_ids:
                        os.remove(os.path.join(self.output_dir, filename))
                        print(f"Deleted {filename}")

    def generate_kernels_csv(self) -> None:
        """Generate CSV files containing the kernel and cluster data.

        Creates two CSV files in the `output_dir`:
        - `kernels.csv`: Contains information about all kernels, including their
          original ID, name, assigned cluster ID, and the ID of the centroid
          representing their cluster.
        - `clusters.csv`: Contains information about each cluster, including its ID,
          the number of kernels it contains, and the ID and name of its
          representative centroid kernel.

        Returns:
            None: Writes `kernels.csv` and `clusters.csv` to the output directory.
        """
        # Export kernel_df to CSV
        self.kernel_df.to_csv(os.path.join(
            self.output_dir, "kernels.csv"), index=False)

        # Export cluster_df for reference (without PCA center coordinates)
        cluster_cols = ["Cluster ID", "Kernel Count",
                        "Centroid Kernel ID", "Centroid Kernel Name"]
        cluster_cols_available = [
            col for col in cluster_cols if col in self.cluster_df.columns]
        cluster_df_to_export = self.cluster_df[cluster_cols_available].copy()

        cluster_df_to_export.to_csv(os.path.join(
            self.output_dir, "clusters.csv"), index=False)

    def run_pks(self, output_file: str = None, delete: bool = False) -> None:
        """Execute the complete Principal Kernel Selection workflow.

        Performs PCA on the kernel metrics, finds the optimal number of clusters
        using K-means scanning, selects representative centroid kernels for each cluster,
        and generates output files (`kernelslist.g`, `kernels.csv`, `clusters.csv`)
        summarizing the results. If the number of kernels is small (<= 20),
        it bypasses the PCA and clustering steps, treating each kernel as its own cluster.

        Args:
            output_file (str, optional): Path to the output `kernelslist.g` file.
                If None, defaults to `kernelslist.g` within the `output_dir`.
                The directory of this file will be used as the output directory
                for other generated files (`.csv`). Defaults to None.
            delete (bool): If True, deletes non-centroid trace files during the
                           `generate_kernelslist` step. Defaults to False.

        Returns:
            None: Generates output files with the results of the analysis.
        """

        if not self.bypass:
            # Apply PCA to reduce dimensionality
            self.pca_df, _ = self.pca(self.ncu_data)
            print(f"PCA shape: {self.pca_df.shape}")

            # Find optimal number of clusters
            self.kmeans_scan(self.pca_df)
            print(f"Optimal number of clusters: {self.cluster_count}")

            # Select representative kernels
            self.select_centroid()

        else:
            # Copy Kernel ID as Centroid Kernel ID for bypass case
            self.kernel_df["Centroid Kernel ID"] = self.kernel_df["Kernel ID"]
            # Copy Kernel ID as Cluster ID for bypass case
            self.kernel_df["Cluster ID"] = self.kernel_df["Kernel ID"]
            # Create cluster_df with all kernels as clusters
            self.cluster_df = pd.DataFrame({
                "Cluster ID": self.kernel_df["Kernel ID"],
                "Kernel Count": 1,
                "Centroid Kernel ID": self.kernel_df["Kernel ID"],
                "Centroid Kernel Name": self.kernel_df["Kernel Name"]
            })

        # Adjust for 1-based indexing used by trace files
        self.kernel_df["Kernel ID"] = \
            self.kernel_df["Kernel ID"].astype(int) + 1
        self.kernel_df["Centroid Kernel ID"] = \
            self.kernel_df["Centroid Kernel ID"].astype(int) + 1
        self.cluster_df["Centroid Kernel ID"] = \
            self.cluster_df["Centroid Kernel ID"].astype(int) + 1

        # Generate output files
        if output_file:
            output_dir = os.path.dirname(output_file)
            if output_dir and not os.path.exists(output_dir):
                os.makedirs(output_dir)
            self.output_dir = output_dir if output_dir else self.output_dir

        self.generate_kernelslist(delete=delete)
        self.generate_kernels_csv()

        # Display results summary
        print(
            f"Generated kernelslist.g with {len(self.cluster_df)} representative kernels")
        print(f"Selected {len(self.cluster_df)} out of {len(self.kernel_df)} kernels "
              f"({len(self.cluster_df)/len(self.kernel_df):.1%})")

`init(ncu_input_file, output_dir=None)`

Initialize the Principal Kernel Selector with an NCU report.

Parameters:

Name	Type	Description	Default
`ncu_input_file`	`str`	Path to the NCU report file to analyze.	required
`output_dir`	`str`	Directory for saving output files. Should be the directory containing trace files (kernel-1.traceg, etc.). Defaults to the directory of this script.	`None`

Raises:

Type	Description
`ValueError`	If the input file contains fewer than 2 kernels (unless bypass is triggered).
`FileNotFoundError`	If the metrics_list.json file is not found.
`ImportError`	If the ncu_report module cannot be imported.

Source code in python-scripts/pks.py

def __init__(self, ncu_input_file: str, output_dir: str = None):
    """Initialize the Principal Kernel Selector with an NCU report.

    Args:
        ncu_input_file (str): Path to the NCU report file to analyze.
        output_dir (str, optional): Directory for saving output files. Should be the directory
            containing trace files (kernel-1.traceg, etc.). Defaults to the
            directory of this script.

    Raises:
        ValueError: If the input file contains fewer than 2 kernels (unless bypass is triggered).
        FileNotFoundError: If the metrics_list.json file is not found.
        ImportError: If the ncu_report module cannot be imported.
    """
    # Class member variables for main-function state
    self.metrics = None
    self.ncu_data = None
    self.pca_df = None

    # Path to output directory for generated files
    # Defaults to path of this script if not provided or invalid
    self.output_dir = output_dir if output_dir and os.path.exists(output_dir) \
        else os.path.dirname(os.path.realpath(__file__))

    # Number of clusters
    self.cluster_count = None

    # DataFrame containing kernel details and cluster assignments
    self.kernel_df = None

    # DataFrame containing cluster details and centroids
    self.cluster_df = None

    # Load metrics from configuration file
    metrics_path = os.path.join(os.path.dirname(
        os.path.realpath(__file__)), "metrics_list.json")
    with open(metrics_path, "r") as metrics_file:
        metrics = json.load(metrics_file)

    # Flatten metrics dictionary into a list
    concatenated_metrics = []
    for key, value in metrics.items():
        concatenated_metrics.extend(value)
    self.metrics = concatenated_metrics

    # Load NCU report data
    ncu_context = ncu_report.load_report(ncu_input_file)
    ncu_range = ncu_context[0]  # Use first range in report

    if ncu_context.num_ranges() > 1:
        print("Warning: Multiple ranges found in the input file. "
              "Using the first range.")

    print(
        f"Loaded {ncu_range.num_actions()} kernels from {ncu_input_file}")

    # Ensure we have enough kernels for clustering
    # if ncu_range.num_actions() <= 2:
    #     raise ValueError("The input file must contain at least 2 kernels.")
    self.bypass = ncu_range.num_actions() <= 20

    # Create initial dataframe with kernel information
    kernel_ids = range(ncu_range.num_actions())
    kernel_names = [action.name() for action in ncu_range]

    data = {
        "Kernel Name": kernel_names,
        "Kernel ID": list(kernel_ids),
    }

    # Add metrics data to dataframe
    for metric in self.metrics:
        data[metric] = [action[metric].value() for action in ncu_range]

    self.ncu_data = pd.DataFrame(data)

    # Initialize kernel_df with basic kernel information
    self.kernel_df = self.ncu_data[["Kernel ID", "Kernel Name"]].copy()

    self.sum_lts_t_sectors_op_write = self.ncu_data["lts__t_sectors_op_write.sum"].sum()

`generate_kernels_csv()`

Generate CSV files containing the kernel and cluster data.

Creates two CSV files in the output_dir: - kernels.csv: Contains information about all kernels, including their original ID, name, assigned cluster ID, and the ID of the centroid representing their cluster. - clusters.csv: Contains information about each cluster, including its ID, the number of kernels it contains, and the ID and name of its representative centroid kernel.

Returns:

Name	Type	Description
`None`	`None`	Writes `kernels.csv` and `clusters.csv` to the output directory.

Source code in python-scripts/pks.py

def generate_kernels_csv(self) -> None:
    """Generate CSV files containing the kernel and cluster data.

    Creates two CSV files in the `output_dir`:
    - `kernels.csv`: Contains information about all kernels, including their
      original ID, name, assigned cluster ID, and the ID of the centroid
      representing their cluster.
    - `clusters.csv`: Contains information about each cluster, including its ID,
      the number of kernels it contains, and the ID and name of its
      representative centroid kernel.

    Returns:
        None: Writes `kernels.csv` and `clusters.csv` to the output directory.
    """
    # Export kernel_df to CSV
    self.kernel_df.to_csv(os.path.join(
        self.output_dir, "kernels.csv"), index=False)

    # Export cluster_df for reference (without PCA center coordinates)
    cluster_cols = ["Cluster ID", "Kernel Count",
                    "Centroid Kernel ID", "Centroid Kernel Name"]
    cluster_cols_available = [
        col for col in cluster_cols if col in self.cluster_df.columns]
    cluster_df_to_export = self.cluster_df[cluster_cols_available].copy()

    cluster_df_to_export.to_csv(os.path.join(
        self.output_dir, "clusters.csv"), index=False)

`generate_kernelslist(delete=False)`

Generate a list file containing the selected representative kernels.

Creates a kernelslist.g file in the output_dir. This file lists the trace file names (e.g., kernel-1.traceg, kernel-5.traceg) corresponding to the selected centroid kernels. Optionally, deletes trace files in the output_dir that do not correspond to the selected centroids.

Parameters:

Name	Type	Description	Default
`delete`	`bool`	If True, deletes trace files (`*.traceg`) in the `output_dir` that are not selected as centroids. Defaults to False.	`False`

Returns:

Name	Type	Description
`None`	`None`	Writes `kernelslist.g` to the output directory and potentially deletes files.

Source code in python-scripts/pks.py

def generate_kernelslist(self, delete: bool = False) -> None:
    """Generate a list file containing the selected representative kernels.

    Creates a `kernelslist.g` file in the `output_dir`. This file lists the
    trace file names (e.g., `kernel-1.traceg`, `kernel-5.traceg`) corresponding
    to the selected centroid kernels. Optionally, deletes trace files in the
    `output_dir` that do not correspond to the selected centroids.

    Args:
        delete (bool): If True, deletes trace files (`*.traceg`) in the `output_dir`
                       that are not selected as centroids. Defaults to False.

    Returns:
        None: Writes `kernelslist.g` to the output directory and potentially deletes files.
    """
    # Get unique centroid kernel IDs (already 1-indexed from run_pks)
    centroid_ids = self.cluster_df["Centroid Kernel ID"].unique()
    centroid_ids = sorted([cid for cid in centroid_ids if cid >= 0])

    # Write kernelslist.g file
    with open(os.path.join(self.output_dir, "kernelslist.g"), "w") as f:
        for index in centroid_ids:
            f.write(f"kernel-{index}.traceg\n")

    # Optionally delete non-centroid trace files
    if delete:
        # iterate over all files in the output directory
        for filename in os.listdir(self.output_dir):
            # check if the file is a trace file and not in the centroid list
            if filename.startswith("kernel-") and (filename.endswith(".traceg") or filename.endswith(".traceg.xz")):
                kernel_id = int(filename.split("-")[1].split(".")[0])
                if kernel_id not in centroid_ids:
                    os.remove(os.path.join(self.output_dir, filename))
                    print(f"Deleted {filename}")

`kmeans(data, n_clusters=3)`

Perform K-means clustering with a specified number of clusters.

Calculates cluster assignments, centroids, and a custom score based on the difference between the sum of 'lts__t_sectors_op_write.sum' for centroid-representative kernels (weighted by cluster size) and the total sum for all kernels.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Data to cluster (typically PCA-transformed).	required
`n_clusters`	`int`	Number of clusters to form. Defaults to 3.	`3`

Returns:

Type	Description
`Tuple[ndarray, ndarray, float]`	Tuple[np.ndarray, np.ndarray, float]: A tuple containing: - np.ndarray: Array of cluster labels for each data point. - np.ndarray: Array of cluster centers (centroids). - float: Custom score representing the relative error in 'lts__t_sectors_op_write.sum'.

Source code in python-scripts/pks.py

def kmeans(self, data: pd.DataFrame, n_clusters: int = 3) -> Tuple[np.ndarray, np.ndarray, float]:
    """Perform K-means clustering with a specified number of clusters.

    Calculates cluster assignments, centroids, and a custom score based on the
    difference between the sum of 'lts__t_sectors_op_write.sum' for centroid-representative
    kernels (weighted by cluster size) and the total sum for all kernels.

    Args:
        data (pd.DataFrame): Data to cluster (typically PCA-transformed).
        n_clusters (int): Number of clusters to form. Defaults to 3.

    Returns:
        Tuple[np.ndarray, np.ndarray, float]: A tuple containing:
            - np.ndarray: Array of cluster labels for each data point.
            - np.ndarray: Array of cluster centers (centroids).
            - float: Custom score representing the relative error in 'lts__t_sectors_op_write.sum'.
    """
    # Apply K-means clustering
    kmeans_result = KMeans(
        n_clusters=n_clusters,
        random_state=12,  # For reproducibility
        n_init="auto"  # Use default initialization
    ).fit(data)

    labels = kmeans_result.labels_
    centers = kmeans_result.cluster_centers_

    cluster_lts_t_sectors_op_write = 0

    for i, center in enumerate(centers):
        closest_cluster = i
        # get number of kernels in the cluster
        num_kernels = np.sum(labels == closest_cluster)
        # get the indices of the points in the closest cluster
        indices = np.where(labels == closest_cluster)[0]
        min_distance = np.inf
        closest_point = None
        closest_index = None
        for index in indices:
            # get the principal component values for the point
            point = self.pca_df.iloc[index].values
            # calculate the distance from the point to the center of the cluster
            distance = np.linalg.norm(point - center)
            if distance < min_distance:
                min_distance = distance
                closest_point = point
                closest_index = index
        # get the original data for the closest point
        original_data = self.ncu_data.iloc[closest_index]
        # get lts__t_sectors_op_write.sum
        lts_t_sectors_op_write = original_data["lts__t_sectors_op_write.sum"]
        cluster_lts_t_sectors_op_write += lts_t_sectors_op_write * num_kernels

    # Calculate custom score based on write count difference
    score = np.abs(cluster_lts_t_sectors_op_write -
                   self.sum_lts_t_sectors_op_write) / self.sum_lts_t_sectors_op_write

    # Return labels, centers, and the custom score
    return kmeans_result.labels_, kmeans_result.cluster_centers_, score

`kmeans_scan(data, lower_bound=2, upper_bound=20)`

Find optimal number of clusters by scanning a range of values.

Tries K-means with different numbers of clusters (from lower_bound to upper_bound). Selects the number of clusters corresponding to the lowest custom score calculated by kmeans. If multiple cluster counts yield scores within 5% of the minimum, the one with the fewest clusters is chosen. Updates class attributes cluster_count, kernel_df, and cluster_df with the results of the best clustering.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Data to cluster (typically PCA-transformed).	required
`lower_bound`	`int`	Minimum number of clusters to try. Defaults to 2.	`2`
`upper_bound`	`int`	Maximum number of clusters to try. Defaults to 20.	`20`

Returns:

Name	Type	Description
`None`	`None`	Updates class attributes with clustering results.

Source code in python-scripts/pks.py

def kmeans_scan(self, data: pd.DataFrame, lower_bound: int = 2, upper_bound: int = 20) -> None:
    """Find optimal number of clusters by scanning a range of values.

    Tries K-means with different numbers of clusters (from lower_bound to upper_bound).
    Selects the number of clusters corresponding to the lowest custom score calculated by `kmeans`.
    If multiple cluster counts yield scores within 5% of the minimum, the one with the
    fewest clusters is chosen. Updates class attributes `cluster_count`, `kernel_df`,
    and `cluster_df` with the results of the best clustering.

    Args:
        data (pd.DataFrame): Data to cluster (typically PCA-transformed).
        lower_bound (int): Minimum number of clusters to try. Defaults to 2.
        upper_bound (int): Maximum number of clusters to try. Defaults to 20.

    Returns:
        None: Updates class attributes with clustering results.
    """
    scores = []
    centers_list = []
    kmeans_labels_list = []

    # Try different numbers of clusters
    for i in range(lower_bound, upper_bound + 1):
        labels, centers, score = self.kmeans(data, i)
        print(f"Number of clusters: {i}, Write count error: {score}")
        scores.append(score)
        centers_list.append(centers)
        kmeans_labels_list.append(labels)

    # Find minimum score
    min_score = min(scores)

    # Use the first clustering within 5% of minimum score
    for i, score in enumerate(scores):
        if score <= 1.05 * min_score:
            self.cluster_count = i + lower_bound

            # Update kernel_df with cluster assignments
            self.kernel_df["Cluster ID"] = kmeans_labels_list[i]

            # Create cluster_df with centers and counts
            centers = centers_list[i]
            cluster_ids = range(self.cluster_count)
            counts = np.bincount(kmeans_labels_list[i])

            # Create dataframe with cluster details
            self.cluster_df = pd.DataFrame({
                "Cluster ID": cluster_ids,
                "Kernel Count": counts
            })

            # Ensure integer data types
            self.cluster_df["Cluster ID"] = pd.to_numeric(
                self.cluster_df["Cluster ID"], errors='coerce').fillna(-1).astype(int)
            self.cluster_df["Kernel Count"] = pd.to_numeric(
                self.cluster_df["Kernel Count"], errors='coerce').fillna(-1).astype(int)

            # Add center coordinates as separate columns
            for j in range(centers.shape[1]):
                self.cluster_df[f"Center_PC{j}"] = [
                    centers[k, j] for k in range(self.cluster_count)]
            break

`pca(data, var_threshold=0.95)`

Perform Principal Component Analysis on kernel metrics.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	DataFrame containing kernel metrics.	required
`var_threshold`	`float`	Variance threshold for PCA dimensionality reduction. Defaults to 0.95.	`0.95`

Returns:

Type	Description
`Tuple[DataFrame, ndarray]`	Tuple[pd.DataFrame, np.ndarray]: A tuple containing: - pd.DataFrame: DataFrame with transformed data, columns named 'PC0', 'PC1', etc. - np.ndarray: Raw numpy array of the transformed data.

Source code in python-scripts/pks.py

def pca(self, data: pd.DataFrame, var_threshold: float = 0.95) -> Tuple[pd.DataFrame, np.ndarray]:
    """Perform Principal Component Analysis on kernel metrics.

    Args:
        data (pd.DataFrame): DataFrame containing kernel metrics.
        var_threshold (float): Variance threshold for PCA dimensionality reduction. Defaults to 0.95.

    Returns:
        Tuple[pd.DataFrame, np.ndarray]: A tuple containing:
            - pd.DataFrame: DataFrame with transformed data, columns named 'PC0', 'PC1', etc.
            - np.ndarray: Raw numpy array of the transformed data.
    """
    # Create a copy to avoid modifying original data
    data_copy = data.copy()

    # Remove non-metric columns
    data_copy.drop(columns=["Kernel Name", "Kernel ID"], inplace=True)

    # Standardize the data
    data_copy = StandardScaler().fit_transform(data_copy)

    # Apply PCA
    pca_model = PCA(n_components=var_threshold)
    transformed_data = pca_model.fit_transform(data_copy)

    # Create a dataframe with the transformed data
    transformed_df = pd.DataFrame(
        transformed_data,
        columns=[f"PC{i}" for i in range(pca_model.n_components_)]
    )

    return transformed_df, transformed_data

`run_pks(output_file=None, delete=False)`

Execute the complete Principal Kernel Selection workflow.

Performs PCA on the kernel metrics, finds the optimal number of clusters using K-means scanning, selects representative centroid kernels for each cluster, and generates output files (kernelslist.g, kernels.csv, clusters.csv) summarizing the results. If the number of kernels is small (<= 20), it bypasses the PCA and clustering steps, treating each kernel as its own cluster.

Parameters:

Name	Type	Description	Default
`output_file`	`str`	Path to the output `kernelslist.g` file. If None, defaults to `kernelslist.g` within the `output_dir`. The directory of this file will be used as the output directory for other generated files (`.csv`). Defaults to None.	`None`
`delete`	`bool`	If True, deletes non-centroid trace files during the `generate_kernelslist` step. Defaults to False.	`False`

Returns:

Name	Type	Description
`None`	`None`	Generates output files with the results of the analysis.

Source code in python-scripts/pks.py

def run_pks(self, output_file: str = None, delete: bool = False) -> None:
    """Execute the complete Principal Kernel Selection workflow.

    Performs PCA on the kernel metrics, finds the optimal number of clusters
    using K-means scanning, selects representative centroid kernels for each cluster,
    and generates output files (`kernelslist.g`, `kernels.csv`, `clusters.csv`)
    summarizing the results. If the number of kernels is small (<= 20),
    it bypasses the PCA and clustering steps, treating each kernel as its own cluster.

    Args:
        output_file (str, optional): Path to the output `kernelslist.g` file.
            If None, defaults to `kernelslist.g` within the `output_dir`.
            The directory of this file will be used as the output directory
            for other generated files (`.csv`). Defaults to None.
        delete (bool): If True, deletes non-centroid trace files during the
                       `generate_kernelslist` step. Defaults to False.

    Returns:
        None: Generates output files with the results of the analysis.
    """

    if not self.bypass:
        # Apply PCA to reduce dimensionality
        self.pca_df, _ = self.pca(self.ncu_data)
        print(f"PCA shape: {self.pca_df.shape}")

        # Find optimal number of clusters
        self.kmeans_scan(self.pca_df)
        print(f"Optimal number of clusters: {self.cluster_count}")

        # Select representative kernels
        self.select_centroid()

    else:
        # Copy Kernel ID as Centroid Kernel ID for bypass case
        self.kernel_df["Centroid Kernel ID"] = self.kernel_df["Kernel ID"]
        # Copy Kernel ID as Cluster ID for bypass case
        self.kernel_df["Cluster ID"] = self.kernel_df["Kernel ID"]
        # Create cluster_df with all kernels as clusters
        self.cluster_df = pd.DataFrame({
            "Cluster ID": self.kernel_df["Kernel ID"],
            "Kernel Count": 1,
            "Centroid Kernel ID": self.kernel_df["Kernel ID"],
            "Centroid Kernel Name": self.kernel_df["Kernel Name"]
        })

    # Adjust for 1-based indexing used by trace files
    self.kernel_df["Kernel ID"] = \
        self.kernel_df["Kernel ID"].astype(int) + 1
    self.kernel_df["Centroid Kernel ID"] = \
        self.kernel_df["Centroid Kernel ID"].astype(int) + 1
    self.cluster_df["Centroid Kernel ID"] = \
        self.cluster_df["Centroid Kernel ID"].astype(int) + 1

    # Generate output files
    if output_file:
        output_dir = os.path.dirname(output_file)
        if output_dir and not os.path.exists(output_dir):
            os.makedirs(output_dir)
        self.output_dir = output_dir if output_dir else self.output_dir

    self.generate_kernelslist(delete=delete)
    self.generate_kernels_csv()

    # Display results summary
    print(
        f"Generated kernelslist.g with {len(self.cluster_df)} representative kernels")
    print(f"Selected {len(self.cluster_df)} out of {len(self.kernel_df)} kernels "
          f"({len(self.cluster_df)/len(self.kernel_df):.1%})")

`select_centroid()`

Select representative kernels by finding points closest to cluster centers.

For each cluster identified by K-means, this method finds the kernel (data point) in the PCA space that is closest to the cluster's center (centroid) using Euclidean distance. It marks this kernel as the representative for that cluster. Updates both kernel_df and cluster_df with centroid information (Kernel ID and Name).

Returns:

Name	Type	Description
`None`	`None`	Updates `kernel_df` and `cluster_df` with centroid information.

Source code in python-scripts/pks.py

def select_centroid(self) -> None:
    """Select representative kernels by finding points closest to cluster centers.

    For each cluster identified by K-means, this method finds the kernel
    (data point) in the PCA space that is closest to the cluster's center
    (centroid) using Euclidean distance. It marks this kernel as the
    representative for that cluster. Updates both `kernel_df` and `cluster_df`
    with centroid information (Kernel ID and Name).

    Returns:
        None: Updates `kernel_df` and `cluster_df` with centroid information.
    """
    # For each cluster, find the kernel closest to the center
    centroid_kernel_ids = []

    for cluster_id in range(self.cluster_count):
        # Get indices of kernels in this cluster
        cluster_mask = self.kernel_df["Cluster ID"] == cluster_id
        cluster_kernel_indices = self.kernel_df.index[cluster_mask].tolist(
        )

        # Handle empty clusters
        if not cluster_kernel_indices:
            print(f"Warning: No kernels found in cluster {cluster_id}")
            centroid_kernel_ids.append(None)
            continue

        # Get PCA coordinates for kernels in this cluster
        cluster_data = self.pca_df.iloc[cluster_kernel_indices]

        # Get center coordinates for this cluster
        center_coords = []
        for j in range(self.pca_df.shape[1]):
            center_coords.append(
                self.cluster_df.loc[cluster_id, f"Center_PC{j}"])

        # Calculate Euclidean distances to center
        distances = np.linalg.norm(
            cluster_data.values - center_coords, axis=1)

        # Find the kernel closest to the center
        closest_idx = cluster_kernel_indices[np.argmin(distances)]
        kernel_id = int(self.kernel_df.loc[closest_idx, "Kernel ID"])
        centroid_kernel_ids.append(kernel_id)

        # Add centroid information to cluster_df
        self.cluster_df.at[cluster_id, "Centroid Kernel ID"] = kernel_id
        self.cluster_df.at[cluster_id, "Centroid Kernel Name"] = \
            self.kernel_df.loc[closest_idx, "Kernel Name"]

    # Update kernel_df with centroid assignments
    for idx, row in self.kernel_df.iterrows():
        cluster_id = row["Cluster ID"]
        centroid_id = self.cluster_df.loc[cluster_id, "Centroid Kernel ID"]
        self.kernel_df.at[idx, "Centroid Kernel ID"] = centroid_id

    # Ensure integer data types throughout
    self.kernel_df["Cluster ID"] = self.kernel_df["Cluster ID"].astype(int)
    self.kernel_df["Centroid Kernel ID"] = pd.to_numeric(
        self.kernel_df["Centroid Kernel ID"], errors='coerce').fillna(-1).astype(int)
    self.kernel_df["Kernel ID"] = self.kernel_df["Kernel ID"].astype(int)
    self.cluster_df["Centroid Kernel ID"] = pd.to_numeric(
        self.cluster_df["Centroid Kernel ID"], errors='coerce').fillna(-1).astype(int)

Sampling Utilities: NVIDIA NSight Compute Coarse-Grained Profiling

The two scripts nsight_nvbit.py and ncu_exec.py are used for coarse-grained profiling of GPU kernels using NVIDIA NSight Compute. The profile results are used in the sampling process to select the most representative kernels for simulation.

Class to run Nsight Compute and NVBit on a given program with the specified arguments.

Provides methods to execute NVIDIA Nsight Compute (ncu) and a custom NVBit tool for profiling GPU applications. It handles environment setup, command construction, and log file management.

The ability to run this script as a standalone program is deprecated. Please use ncu_exec.py instead.

`NsightNVBitRunner`

A runner class to execute Nsight Compute and NVBit profiling tools.

Manages the configuration and execution of Nsight Compute (ncu) and a custom NVBit tool (ncu-nvbit.so) on a specified target program. It sets up necessary environment variables, constructs command lines, runs the tools, and manages output log files.

Attributes:

Name	Type	Description
`program`	`str`	Path to the executable program or interpreter.
`program_args`	`List[str]`	Arguments for the target program.
`log_file_name`	`str`	Base name for generated log files (e.g., program_timestamp).
`log_file_path`	`str`	Directory path where log files are stored.
`mangled`	`bool`	Flag indicating whether to use mangled kernel names (primarily for NVBit).

Source code in python-scripts/nsight_nvbit.py

class NsightNVBitRunner:
    """
    A runner class to execute Nsight Compute and NVBit profiling tools.

    Manages the configuration and execution of Nsight Compute (ncu) and a custom
    NVBit tool (`ncu-nvbit.so`) on a specified target program. It sets up
    necessary environment variables, constructs command lines, runs the tools,
    and manages output log files.

    Attributes:
        program (str): Path to the executable program or interpreter.
        program_args (List[str]): Arguments for the target program.
        log_file_name (str): Base name for generated log files (e.g., program_timestamp).
        log_file_path (str): Directory path where log files are stored.
        mangled (bool): Flag indicating whether to use mangled kernel names (primarily for NVBit).
    """
    def __init__(self):
        """Initialize the NsightNVBitRunner with default None values."""
        # Initialize member variables that will hold the state from main.
        self.program = None
        self.program_args = None
        self.log_file_name = None
        self.log_file_path = None
        self.mangled = True

    def init_from_params(self, program_name, program_args, log_file_name, log_file_path, mangled=True):
        """
        Initialize the runner with explicitly provided parameters.

        Args:
            program_name (str): The program executable or interpreter name/path.
            program_args (List[str]): List of arguments for the program.
            log_file_name (str): Base name for log files.
            log_file_path (str): Directory path for log files.
            mangled (bool): Whether to use mangled kernel names. Defaults to True.

        Returns:
            None
        """
        # Initialize the class with the provided arguments.
        self.program = program_name
        self.program_args = program_args
        self.log_file_name = log_file_name
        self.log_file_path = log_file_path
        self.mangled = mangled 

    def init_from_program(self, program_name, program_args, mangled=True):
        """
        Initialize the runner based on program name and arguments.

        Determines the actual executable (handling Python scripts), generates
        log file names and paths based on the program name and timestamp, and
        prints a configuration summary.

        Args:
            program_name (str): The program executable or script name/path.
            program_args (List[str]): List of arguments for the program/script.
            mangled (bool): Whether to use mangled kernel names. Defaults to True.

        Returns:
            None
        """
        self.program = os.path.realpath(program_name)
        output_program_name = os.path.basename(program_name)
        self.program_args = program_args
        self.mangled = mangled
        if program_name == "python" or program_name.startswith("python3") or program_name == "torchrun":
            self.program = sys.executable
            output_program_name = os.path.basename(
                self.program_args[0]).replace(".py", "")
        if self.program.endswith(".py"):
            output_program_name = os.path.basename(
                self.program).replace(".py", "")
            self.program_args = [self.program] + self.program_args
            self.program = sys.executable

        # Create a log file to store the profiling results
        basename = os.path.basename(output_program_name)
        timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        self.log_file_name = f"{basename}_{timestamp}"
        self.log_file_path = os.path.join(
            os.getenv('PROJECT_ROOT', '.'), 'logs', basename)
        os.makedirs(self.log_file_path, exist_ok=True)

        # Print configuration summary
        print("*" * 27 + " Profiling Configuration " + "*" * 27)
        print(
            f"Running program: {self.program} with arguments: {' '.join(self.program_args)}")
        print(f"Log file name: {self.log_file_name}")
        print(f"Log file path: {self.log_file_path}")
        print("*" * 80 + "\n")

    def get_metrics(self):
        """
        Load and return the list of Nsight Compute metrics from a JSON file.

        Reads metrics specified in 'metrics_list.json' located in the same
        directory as this script. The JSON file should contain a dictionary where
        values are lists of metric names.

        Returns:
            str: A comma-separated string of all metric names found in the JSON file.

        Raises:
            FileNotFoundError: If 'metrics_list.json' is not found.
            ValueError: If 'metrics_list.json' is empty or contains no metrics.
            json.JSONDecodeError: If 'metrics_list.json' is not valid JSON.
        """
        metrics_path = os.path.join(
            os.path.dirname(__file__), "metrics_list.json")
        with open(metrics_path, 'r') as metrics_file:
            metrics = json.load(metrics_file)
        metrics_str = ""
        for key, value in metrics.items():
            metrics_str += ",".join(value) + ","
        if metrics_str == "":
            raise ValueError(
                "Error: No metrics found in the metrics_list.json file")
        return metrics_str[:-1]

    def run_nvbit(self, dry_run=True):
        """
        Run the custom NVBit tool (`ncu-nvbit.so`) on the target program.

        Sets up the environment (CUDA_INJECTION64_PATH, etc.), constructs the
        command, and executes the program under the NVBit tool. If `dry_run` is False,
        it captures the output to a `.nvbit` log file.

        Args:
            dry_run (bool): If True, print the command without executing it.
                            Defaults to True.

        Returns:
            str: The full path to the generated `.nvbit` output file.

        Raises:
            FileNotFoundError: If the `ncu-nvbit.so` library is not found and cannot be compiled.
            subprocess.CalledProcessError: If the NVBit execution fails (when dry_run=False).
        """
        nvbit_lib = os.path.join(
            os.getenv('PROJECT_ROOT', '.'),
            'backend', 'ncu-nvbit', 'ncu-nvbit.so')
        if not os.path.isfile(nvbit_lib):
            print("Compiling NVBit")
            subprocess.run(['make', '-C',
                            os.path.join(os.getenv('PROJECT_ROOT', '.'), 'backend', 'ncu-nvbit')])
        nvbit_output_file = os.path.join(
            self.log_file_path, f"{self.log_file_name}.nvbit")
        nvbit_env = os.environ.copy()
        nvbit_env_dict = {
            'NOBANNER': '1',
            'MANGLED_NAMES': '1' if self.mangled else '0',
            'CUDA_INJECTION64_PATH': nvbit_lib,
            'PATH': os.getenv('CUDA_INSTALL_PATH', '/usr/local/cuda') + '/bin:' + os.getenv('PATH', '')
        }
        for key, value in nvbit_env_dict.items():
            nvbit_env[key] = value
        nvbit_env_str = ' '.join(
            [f'{k}={v}' for k, v in nvbit_env_dict.items()])
        print(f"Running NVBit with the following command: \
              {nvbit_env_str} {self.program} {' '.join(self.program_args)}")
        if not dry_run:
            with open(nvbit_output_file, 'w') as log_file:
                subprocess.run([self.program] + self.program_args, stdout=log_file,
                               stderr=subprocess.STDOUT, env=nvbit_env)
            print(f"Check {nvbit_output_file} for the NVBit output")
        return nvbit_output_file

    def run_ncu(self, dry_run=True):
        """
        Run NVIDIA Nsight Compute (ncu) on the target program.

        Constructs the `ncu` command line with specified metrics (from `get_metrics`),
        configuration flags (e.g., `--force-overwrite`, `--replay-mode`), and output
        options (`--export`). Sets the `TMPDIR` environment variable. If `dry_run` is False,
        it executes `ncu` and captures its command-line output to a `.exec_ncu.log` file.
        The main report is saved to a `.ncu-rep` file.

        Args:
            dry_run (bool): If True, print the command without executing it.
                            Defaults to True.

        Returns:
            str: The full path to the generated `.ncu-rep` report file.

        Raises:
            FileNotFoundError: If the `ncu` executable is not found in the expected CUDA path.
            subprocess.CalledProcessError: If the `ncu` execution fails (when dry_run=False).
        """
        ncu = os.path.join(os.getenv('CUDA_INSTALL_PATH',
                           '/usr/local/cuda'), 'bin', 'ncu')
        if not os.path.isfile(ncu):
            raise FileNotFoundError(
                f"Error: Nsight Compute CLI not found at {ncu}")
        ncu_output_file = os.path.join(
            self.log_file_path, f"{self.log_file_name}.ncu-rep")
        ncu_log_file = os.path.join(
            self.log_file_path, f"{self.log_file_name}.exec_ncu.log")
        export_args = [
            '--rename-kernels-export', 'yes',
            '--rename-kernels-path', os.path.join(
                self.log_file_path, f"{self.log_file_name}.exec.kernels"),
            '--export', ncu_output_file
        ]
        ncu_args = [
            '--config-file', 'off',
            '--force-overwrite',
            '--clock-control', 'none',
            '--rename-kernels', 'off',
            '--replay-mode', 'kernel',
            '--launch-count', '1000'
        ]
        temp_dir = os.path.join(os.getenv('PROJECT_ROOT', '.'), '.tmp')
        os.makedirs(temp_dir, exist_ok=True)
        ncu_env = os.environ.copy()
        ncu_env['TMPDIR'] = temp_dir
        ncu_env['USER_DEFINED_FOLDERS'] = '1'
        cmd = [ncu] + ncu_args + ['--metrics', self.get_metrics()] + \
            export_args + [self.program] + self.program_args
        print(
            f"Running Nsight Compute with the following command: {' '.join(cmd)}")
        if not dry_run:
            with open(ncu_log_file, 'w') as log_file:
                subprocess.run(cmd, stdout=log_file,
                               stderr=subprocess.STDOUT, env=ncu_env)
            print(f"Check {ncu_output_file} for the Nsight Compute report")
        return ncu_output_file

`init()`

Initialize the NsightNVBitRunner with default None values.

Source code in python-scripts/nsight_nvbit.py

def __init__(self):
    """Initialize the NsightNVBitRunner with default None values."""
    # Initialize member variables that will hold the state from main.
    self.program = None
    self.program_args = None
    self.log_file_name = None
    self.log_file_path = None
    self.mangled = True

`get_metrics()`

Load and return the list of Nsight Compute metrics from a JSON file.

Reads metrics specified in 'metrics_list.json' located in the same directory as this script. The JSON file should contain a dictionary where values are lists of metric names.

Returns:

Name	Type	Description
`str`		A comma-separated string of all metric names found in the JSON file.

Raises:

Type	Description
`FileNotFoundError`	If 'metrics_list.json' is not found.
`ValueError`	If 'metrics_list.json' is empty or contains no metrics.
`JSONDecodeError`	If 'metrics_list.json' is not valid JSON.

Source code in python-scripts/nsight_nvbit.py

def get_metrics(self):
    """
    Load and return the list of Nsight Compute metrics from a JSON file.

    Reads metrics specified in 'metrics_list.json' located in the same
    directory as this script. The JSON file should contain a dictionary where
    values are lists of metric names.

    Returns:
        str: A comma-separated string of all metric names found in the JSON file.

    Raises:
        FileNotFoundError: If 'metrics_list.json' is not found.
        ValueError: If 'metrics_list.json' is empty or contains no metrics.
        json.JSONDecodeError: If 'metrics_list.json' is not valid JSON.
    """
    metrics_path = os.path.join(
        os.path.dirname(__file__), "metrics_list.json")
    with open(metrics_path, 'r') as metrics_file:
        metrics = json.load(metrics_file)
    metrics_str = ""
    for key, value in metrics.items():
        metrics_str += ",".join(value) + ","
    if metrics_str == "":
        raise ValueError(
            "Error: No metrics found in the metrics_list.json file")
    return metrics_str[:-1]

`init_from_params(program_name, program_args, log_file_name, log_file_path, mangled=True)`

Initialize the runner with explicitly provided parameters.

Parameters:

Name	Type	Description	Default
`program_name`	`str`	The program executable or interpreter name/path.	required
`program_args`	`List[str]`	List of arguments for the program.	required
`log_file_name`	`str`	Base name for log files.	required
`log_file_path`	`str`	Directory path for log files.	required
`mangled`	`bool`	Whether to use mangled kernel names. Defaults to True.	`True`

Returns:

Type	Description
	None

Source code in python-scripts/nsight_nvbit.py

def init_from_params(self, program_name, program_args, log_file_name, log_file_path, mangled=True):
    """
    Initialize the runner with explicitly provided parameters.

    Args:
        program_name (str): The program executable or interpreter name/path.
        program_args (List[str]): List of arguments for the program.
        log_file_name (str): Base name for log files.
        log_file_path (str): Directory path for log files.
        mangled (bool): Whether to use mangled kernel names. Defaults to True.

    Returns:
        None
    """
    # Initialize the class with the provided arguments.
    self.program = program_name
    self.program_args = program_args
    self.log_file_name = log_file_name
    self.log_file_path = log_file_path
    self.mangled = mangled 

`init_from_program(program_name, program_args, mangled=True)`

Initialize the runner based on program name and arguments.

Determines the actual executable (handling Python scripts), generates log file names and paths based on the program name and timestamp, and prints a configuration summary.

Parameters:

Name	Type	Description	Default
`program_name`	`str`	The program executable or script name/path.	required
`program_args`	`List[str]`	List of arguments for the program/script.	required
`mangled`	`bool`	Whether to use mangled kernel names. Defaults to True.	`True`

Returns:

Type	Description
	None

Source code in python-scripts/nsight_nvbit.py

def init_from_program(self, program_name, program_args, mangled=True):
    """
    Initialize the runner based on program name and arguments.

    Determines the actual executable (handling Python scripts), generates
    log file names and paths based on the program name and timestamp, and
    prints a configuration summary.

    Args:
        program_name (str): The program executable or script name/path.
        program_args (List[str]): List of arguments for the program/script.
        mangled (bool): Whether to use mangled kernel names. Defaults to True.

    Returns:
        None
    """
    self.program = os.path.realpath(program_name)
    output_program_name = os.path.basename(program_name)
    self.program_args = program_args
    self.mangled = mangled
    if program_name == "python" or program_name.startswith("python3") or program_name == "torchrun":
        self.program = sys.executable
        output_program_name = os.path.basename(
            self.program_args[0]).replace(".py", "")
    if self.program.endswith(".py"):
        output_program_name = os.path.basename(
            self.program).replace(".py", "")
        self.program_args = [self.program] + self.program_args
        self.program = sys.executable

    # Create a log file to store the profiling results
    basename = os.path.basename(output_program_name)
    timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    self.log_file_name = f"{basename}_{timestamp}"
    self.log_file_path = os.path.join(
        os.getenv('PROJECT_ROOT', '.'), 'logs', basename)
    os.makedirs(self.log_file_path, exist_ok=True)

    # Print configuration summary
    print("*" * 27 + " Profiling Configuration " + "*" * 27)
    print(
        f"Running program: {self.program} with arguments: {' '.join(self.program_args)}")
    print(f"Log file name: {self.log_file_name}")
    print(f"Log file path: {self.log_file_path}")
    print("*" * 80 + "\n")

`run_ncu(dry_run=True)`

Run NVIDIA Nsight Compute (ncu) on the target program.

Constructs the ncu command line with specified metrics (from get_metrics), configuration flags (e.g., --force-overwrite, --replay-mode), and output options (--export). Sets the TMPDIR environment variable. If dry_run is False, it executes ncu and captures its command-line output to a .exec_ncu.log file. The main report is saved to a .ncu-rep file.

Parameters:

Name	Type	Description	Default
`dry_run`	`bool`	If True, print the command without executing it. Defaults to True.	`True`

Returns:

Name	Type	Description
`str`		The full path to the generated `.ncu-rep` report file.

Raises:

Type	Description
`FileNotFoundError`	If the `ncu` executable is not found in the expected CUDA path.
`CalledProcessError`	If the `ncu` execution fails (when dry_run=False).

Source code in python-scripts/nsight_nvbit.py

def run_ncu(self, dry_run=True):
    """
    Run NVIDIA Nsight Compute (ncu) on the target program.

    Constructs the `ncu` command line with specified metrics (from `get_metrics`),
    configuration flags (e.g., `--force-overwrite`, `--replay-mode`), and output
    options (`--export`). Sets the `TMPDIR` environment variable. If `dry_run` is False,
    it executes `ncu` and captures its command-line output to a `.exec_ncu.log` file.
    The main report is saved to a `.ncu-rep` file.

    Args:
        dry_run (bool): If True, print the command without executing it.
                        Defaults to True.

    Returns:
        str: The full path to the generated `.ncu-rep` report file.

    Raises:
        FileNotFoundError: If the `ncu` executable is not found in the expected CUDA path.
        subprocess.CalledProcessError: If the `ncu` execution fails (when dry_run=False).
    """
    ncu = os.path.join(os.getenv('CUDA_INSTALL_PATH',
                       '/usr/local/cuda'), 'bin', 'ncu')
    if not os.path.isfile(ncu):
        raise FileNotFoundError(
            f"Error: Nsight Compute CLI not found at {ncu}")
    ncu_output_file = os.path.join(
        self.log_file_path, f"{self.log_file_name}.ncu-rep")
    ncu_log_file = os.path.join(
        self.log_file_path, f"{self.log_file_name}.exec_ncu.log")
    export_args = [
        '--rename-kernels-export', 'yes',
        '--rename-kernels-path', os.path.join(
            self.log_file_path, f"{self.log_file_name}.exec.kernels"),
        '--export', ncu_output_file
    ]
    ncu_args = [
        '--config-file', 'off',
        '--force-overwrite',
        '--clock-control', 'none',
        '--rename-kernels', 'off',
        '--replay-mode', 'kernel',
        '--launch-count', '1000'
    ]
    temp_dir = os.path.join(os.getenv('PROJECT_ROOT', '.'), '.tmp')
    os.makedirs(temp_dir, exist_ok=True)
    ncu_env = os.environ.copy()
    ncu_env['TMPDIR'] = temp_dir
    ncu_env['USER_DEFINED_FOLDERS'] = '1'
    cmd = [ncu] + ncu_args + ['--metrics', self.get_metrics()] + \
        export_args + [self.program] + self.program_args
    print(
        f"Running Nsight Compute with the following command: {' '.join(cmd)}")
    if not dry_run:
        with open(ncu_log_file, 'w') as log_file:
            subprocess.run(cmd, stdout=log_file,
                           stderr=subprocess.STDOUT, env=ncu_env)
        print(f"Check {ncu_output_file} for the Nsight Compute report")
    return ncu_output_file

`run_nvbit(dry_run=True)`

Run the custom NVBit tool (ncu-nvbit.so) on the target program.

Sets up the environment (CUDA_INJECTION64_PATH, etc.), constructs the command, and executes the program under the NVBit tool. If dry_run is False, it captures the output to a .nvbit log file.

Parameters:

Name	Type	Description	Default
`dry_run`	`bool`	If True, print the command without executing it. Defaults to True.	`True`

Returns:

Name	Type	Description
`str`		The full path to the generated `.nvbit` output file.

Raises:

Type	Description
`FileNotFoundError`	If the `ncu-nvbit.so` library is not found and cannot be compiled.
`CalledProcessError`	If the NVBit execution fails (when dry_run=False).

Source code in python-scripts/nsight_nvbit.py

def run_nvbit(self, dry_run=True):
    """
    Run the custom NVBit tool (`ncu-nvbit.so`) on the target program.

    Sets up the environment (CUDA_INJECTION64_PATH, etc.), constructs the
    command, and executes the program under the NVBit tool. If `dry_run` is False,
    it captures the output to a `.nvbit` log file.

    Args:
        dry_run (bool): If True, print the command without executing it.
                        Defaults to True.

    Returns:
        str: The full path to the generated `.nvbit` output file.

    Raises:
        FileNotFoundError: If the `ncu-nvbit.so` library is not found and cannot be compiled.
        subprocess.CalledProcessError: If the NVBit execution fails (when dry_run=False).
    """
    nvbit_lib = os.path.join(
        os.getenv('PROJECT_ROOT', '.'),
        'backend', 'ncu-nvbit', 'ncu-nvbit.so')
    if not os.path.isfile(nvbit_lib):
        print("Compiling NVBit")
        subprocess.run(['make', '-C',
                        os.path.join(os.getenv('PROJECT_ROOT', '.'), 'backend', 'ncu-nvbit')])
    nvbit_output_file = os.path.join(
        self.log_file_path, f"{self.log_file_name}.nvbit")
    nvbit_env = os.environ.copy()
    nvbit_env_dict = {
        'NOBANNER': '1',
        'MANGLED_NAMES': '1' if self.mangled else '0',
        'CUDA_INJECTION64_PATH': nvbit_lib,
        'PATH': os.getenv('CUDA_INSTALL_PATH', '/usr/local/cuda') + '/bin:' + os.getenv('PATH', '')
    }
    for key, value in nvbit_env_dict.items():
        nvbit_env[key] = value
    nvbit_env_str = ' '.join(
        [f'{k}={v}' for k, v in nvbit_env_dict.items()])
    print(f"Running NVBit with the following command: \
          {nvbit_env_str} {self.program} {' '.join(self.program_args)}")
    if not dry_run:
        with open(nvbit_output_file, 'w') as log_file:
            subprocess.run([self.program] + self.program_args, stdout=log_file,
                           stderr=subprocess.STDOUT, env=nvbit_env)
        print(f"Check {nvbit_output_file} for the NVBit output")
    return nvbit_output_file

Execution-based, kernel-level, runtime cache analysis using Nsight Compute and NVBit.

This script profiles a specified program using NVIDIA Nsight Compute CLI and a custom NVBit tool. It runs both tools on the target program, parses their outputs, computes derived cache metrics (like lifetime, frequency, utilization), and generates CSV reports and optional plots.

Usage

python3 ncu_exec.py [--dry-run][--mangled] [--histogram]

Pre-requisites

The program must exist and be executable.
The PROJECT_ROOT and CUDA_INSTALL_PATH environment variables must be set.
The Nsight Compute CLI must be installed and available in CUDA_INSTALL_PATH.
The NVBit library (ncu-nvbit.so) must be compiled and available in $PROJECT_ROOT/backend/ncu-nvbit.

Parameters:

Name	Type	Description	Default
`program`	`str`	The program to profile.	required
`args`	`List[str]`	Arguments to pass to the program.	required
`--dry-run`	`bool`	Print commands without running tools. Defaults to False.	required
`--mangled`	`bool`	Use mangled kernel names in output. Defaults to False.	required
`--histogram`	`bool`	Generate plots of computed metrics. Defaults to False.	required

Output

Log files are saved under $PROJECT_ROOT/logs/<program_name>/<program_name>_<timestamp>: - .ncu-rep: Raw Nsight Compute report file. - .nvbit: Raw NVBit log file. - .exec_ncu.log: Nsight Compute CLI command output log. - .exec_cmd.log: Command used to run the script. - .exec.kernels: (If generated by ncu) Kernel name mapping file. - .exec.csv: Computed metrics for each kernel (CSV format). - .exec_l1.png: (Optional) L1 cache metrics plots. - .exec_l2.png: (Optional) L2 cache metrics plots.

`compute_kernel_metrics(kernel_id, kernel_action, unique_sector_counts)`

Compute derived cache metrics for a single kernel based on NCU and NVBit data.

Calculates metrics like cache active time, read/write frequency, lifetime, and utilization for both L1 and L2 caches using raw counter values from the Nsight Compute report (kernel_action) and unique sector counts from NVBit data (unique_sector_counts).

Parameters:

Name	Type	Description	Default
`kernel_id`	`int`	The ID (index) of the kernel being processed.	required
`kernel_action`	`IAction`	The Nsight Compute action object containing metrics for this kernel.	required
`unique_sector_counts`	`List[int]`	A list containing `[l1_unique_sectors, l2_unique_sectors]` for this kernel.	required

Returns:

Type

Description

Optional[pd.DataFrame]: A one-row DataFrame containing the computed metrics for the kernel, or None if the kernel execution time or relevant cache access times are zero. Columns include "Kernel ID", "Function Name", "Total Cycles", "Kernel Execution Time", "L1 Active Time", "L1 Read Frequency", "L1 Write Frequency", "L1 Lifetime", "L1 Utilization", "L2 Active Time", "L2 Read Frequency", "L2 Write Frequency", "L2 Lifetime", "L2 Utilization". Times are in microseconds, frequencies in MHz.

Source code in python-scripts/ncu_exec.py

def compute_kernel_metrics(kernel_id, kernel_action, unique_sector_counts):
    """
    Compute derived cache metrics for a single kernel based on NCU and NVBit data.

    Calculates metrics like cache active time, read/write frequency, lifetime, and
    utilization for both L1 and L2 caches using raw counter values from the Nsight
    Compute report (`kernel_action`) and unique sector counts from NVBit data
    (`unique_sector_counts`).

    Args:
        kernel_id (int): The ID (index) of the kernel being processed.
        kernel_action (ncu_report.IAction): The Nsight Compute action object containing metrics for this kernel.
        unique_sector_counts (List[int]): A list containing `[l1_unique_sectors, l2_unique_sectors]` for this kernel.

    Returns:
        Optional[pd.DataFrame]: A one-row DataFrame containing the computed metrics for the kernel,
                                or None if the kernel execution time or relevant cache access times are zero.
                                Columns include "Kernel ID", "Function Name", "Total Cycles",
                                "Kernel Execution Time", "L1 Active Time", "L1 Read Frequency",
                                "L1 Write Frequency", "L1 Lifetime", "L1 Utilization",
                                "L2 Active Time", "L2 Read Frequency", "L2 Write Frequency",
                                "L2 Lifetime", "L2 Utilization". Times are in microseconds,
                                frequencies in MHz.
    """
    # Get kernel execution time
    function_name = kernel_action.name()
    print(f"Processing kernel {kernel_id} ({function_name})...")
    kernel_cyc_avg = kernel_action['sm__cycles_elapsed.avg'].value()
    kernel_execution_time = kernel_cyc_avg * CYCLE_TIME
    if kernel_execution_time == 0:
        return None

    # Get number of unique L1 and L2 sectors
    l1_unique_sector_count = unique_sector_counts[0]
    l2_unique_sector_count = unique_sector_counts[1]

    # Get L1 load and store access metrics
    l1_load_access_global = kernel_action['l1tex__t_requests_pipe_lsu_mem_global_op_ld.avg'].value(
    )
    l1_load_access_local = kernel_action['l1tex__t_requests_pipe_lsu_mem_local_op_ld.avg'].value(
    )
    l1_store_access_global = kernel_action['l1tex__t_requests_pipe_lsu_mem_global_op_st.avg'].value(
    )
    l1_store_access_local = kernel_action['l1tex__t_requests_pipe_lsu_mem_local_op_st.avg'].value(
    )
    l1_store_hit_global = kernel_action['l1tex__t_sectors_pipe_lsu_mem_global_op_st_lookup_hit.avg'].value(
    )
    l1_store_hit_local = kernel_action['l1tex__t_sectors_pipe_lsu_mem_local_op_st_lookup_hit.avg'].value(
    )

    # Get other L1 cache metrics
    l1_fetch = kernel_action['l1tex__m_xbar2l1tex_read_sectors_mem_lg_op_ld.avg'].value(
    )
    l1_read_from_xbar = kernel_action['l1tex__m_xbar2l1tex_read_sectors.avg'].value(
    )
    l1_write_to_xbar = kernel_action['l1tex__m_l1tex2xbar_write_sectors.avg'].value(
    )
    l1_access_cycles = kernel_action['l1tex__cycles_active.avg'].value()
    l1_access_time = l1_access_cycles * CYCLE_TIME
    if l1_access_time == 0:
        print(f"\tWarning: Kernel {kernel_id} has no L1 cache accesses.")
        return None

    # Calculate L1 read and write frequencies
    l1_read_count = l1_load_access_global + l1_load_access_local + l1_write_to_xbar
    l1_read_freq = l1_read_count / l1_access_time
    l1_write_count = l1_store_access_global + \
        l1_store_access_local + l1_read_from_xbar
    l1_write_freq = l1_write_count / l1_access_time
    print(f"\tL1 read count: {l1_read_count}")
    print(
        f"\tL1 read frequency: {1/l1_read_freq:.2f} ns/read or {1e3 * l1_read_freq:.2f} MHz")
    print(f"\tL1 write count: {l1_write_count}")
    print(
        f"\tL1 write frequency: {1/l1_write_freq:.2f} ns/write or {1e3 * l1_write_freq:.2f} MHz")

    # Report L1 store hit rate
    l1_store_access = l1_store_access_global + l1_store_access_local
    l1_store_hit = l1_store_hit_global + l1_store_hit_local
    if l1_store_access == 0:
        print(f"\tWarning: Kernel {kernel_id} has no L1 store accesses.")
        l1_store_hit_rate = 0
    else:
        l1_store_hit_rate = l1_store_hit / l1_store_access
    print(f"\tL1 store hit rate: {l1_store_hit_rate * 100:.2f}%")

    # Calculate L1 lifetime
    l1_sector_count = l1_unique_sector_count if l1_unique_sector_count < L1_SECTOR_COUNT else L1_SECTOR_COUNT
    if l1_store_hit + l1_fetch == 0:
        print(
            f"\tWarning: Kernel {kernel_id} has no L1 store hits or fetches.")
        l1_lifetime = 0
        l1_refreshes = 0
    else:
        l1_lifetime = l1_access_time * \
            l1_sector_count / (l1_store_hit + l1_fetch)
        l1_refreshes = math.floor(l1_lifetime / RETENTION_TIME)
    print(f"\tL1 lifetime: {l1_lifetime:.2f} ns")

    # Get L2 cache metrics
    l2_write_access = kernel_action['lts__t_sectors_op_write.sum'].value()
    l2_write_hit = kernel_action['lts__t_sectors_op_write_lookup_hit.sum'].value(
    )
    l2_read_requests = kernel_action['lts__t_requests_op_read.sum'].value()
    l2_write_requests = kernel_action['lts__t_requests_op_write.sum'].value()
    l2_fetch_device = kernel_action['lts__t_sectors_aperture_device_lookup_miss.sum'].value(
    )
    l2_fetch_sysmem = kernel_action['lts__t_sectors_aperture_sysmem_lookup_miss.sum'].value(
    )
    l2_fetch_peer = kernel_action['lts__t_sectors_aperture_peer_lookup_miss.sum'].value(
    )
    l2_fetch = l2_fetch_device + l2_fetch_sysmem + l2_fetch_peer
    l2_access_cycles = kernel_action['lts__cycles_active.avg'].value()
    l2_access_time = l2_access_cycles * CYCLE_TIME
    if l2_access_time == 0:
        print(f"\tWarning: Kernel {kernel_id} has no L2 cache accesses.")
        return None

    # calculate L2 read and write frequencies
    l2_read_freq = l2_read_requests / l2_access_time
    l2_write_freq = l2_write_requests / l2_access_time
    print(f"\tL2 read count: {l2_read_requests}")
    print(
        f"\tL2 read frequency: {1/l2_read_freq:.2f} ns/read or {1e3 * l2_read_freq:.2f} MHz")
    print(f"\tL2 write count: {l2_write_requests}")
    print(
        f"\tL2 write frequency: {1/l2_write_freq:.2f} ns/write or {1e3 * l2_write_freq:.2f} MHz")
    # Report L2 write hit rate
    l2_write_hit_rate = l2_write_hit / l2_write_access
    print(f"\tL2 write hit rate: {l2_write_hit_rate * 100:.2f}%")
    # Calculate L2 lifetime
    l2_sector_count = l2_unique_sector_count if l2_unique_sector_count < L2_SECTOR_COUNT else L2_SECTOR_COUNT
    if l2_write_hit + l2_fetch == 0:
        print(
            f"\tWarning: Kernel {kernel_id} has no L2 write hits or fetches.")
        l2_lifetime = 0
        l2_refreshes = 0
    else:
        l2_lifetime = l2_access_time * \
            l2_sector_count / (l2_write_hit + l2_fetch)
        l2_refreshes = math.floor(l2_lifetime / RETENTION_TIME)
    print(f"\tL2 lifetime: {l2_lifetime:.2f} ns")
    print()

    # Add results to DataFrame
    return pd.DataFrame([{
        "Kernel ID": kernel_id,
        "Function Name": function_name,
        "Total Cycles": kernel_cyc_avg,
        "Kernel Execution Time": kernel_execution_time,
        "L1 Active Time": l1_access_time / 1e3,  # convert to microseconds
        "L1 Read Frequency": l1_read_freq * 1e3,  # convert to MHz
        "L1 Write Frequency": l1_write_freq * 1e3,  # convert to MHz
        "L1 Lifetime": l1_lifetime / 1e3,  # convert to microseconds
        # "L1 Refreshes": l1_refreshes,
        "L1 Utilization": l1_sector_count / L1_SECTOR_COUNT,
        "L2 Active Time": l2_access_time / 1e3,  # convert to microseconds
        "L2 Read Frequency": l2_read_freq * 1e3,  # convert to MHz
        "L2 Write Frequency": l2_write_freq * 1e3,  # convert to MHz
        "L2 Lifetime": l2_lifetime / 1e3,  # convert to microseconds
        # "L2 Refreshes": l2_refreshes,
        "L2 Utilization": l2_sector_count / L2_SECTOR_COUNT
    }])

`get_correspondence_table(results, mangled)`

Extract and print a mapping between kernel IDs and kernel names.

Creates a DataFrame containing either the mangled or unmangled kernel names based on the mangled flag, alongside their corresponding kernel IDs. Prints this table to the console.

Parameters:

Name	Type	Description	Default
`results`	`DataFrame`	DataFrame containing computed metrics and kernel names (Mangled Names, Unmangled Names columns).	required
`mangled`	`bool`	If True, use 'Mangled Names' column; otherwise, use 'Unmangled Names'.	required

Returns:

Type	Description
	pd.DataFrame: A DataFrame with two columns: 'Kernel ID' and either 'Mangled Names' or 'Unmangled Names'.

Source code in python-scripts/ncu_exec.py

def get_correspondence_table(results, mangled):
    """
    Extract and print a mapping between kernel IDs and kernel names.

    Creates a DataFrame containing either the mangled or unmangled kernel names
    based on the `mangled` flag, alongside their corresponding kernel IDs. Prints
    this table to the console.

    Args:
        results (pd.DataFrame): DataFrame containing computed metrics and kernel names
                                (Mangled Names, Unmangled Names columns).
        mangled (bool): If True, use 'Mangled Names' column; otherwise, use 'Unmangled Names'.

    Returns:
        pd.DataFrame: A DataFrame with two columns: 'Kernel ID' and either
                      'Mangled Names' or 'Unmangled Names'.
    """
    if mangled:
        correspondence_table = results[['Kernel ID', 'Mangled Names']]
    else:
        correspondence_table = results[['Kernel ID', 'Unmangled Names']]
    pd.set_option('display.max_colwidth', 120)
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    print("Correspondence table:")
    print(correspondence_table)
    print()
    return correspondence_table

`parse_arguments()`

Parse command-line arguments for the NCU/NVBit execution script.

Defines and parses arguments for specifying the target program, its arguments, and options like dry run, mangled names, and histogram generation.

Parameters:

Name	Type	Description	Default
`None`		Uses defined CLI options.	required

Returns:

Type	Description
	argparse.Namespace: Parsed arguments including: program (str): Program to profile. args (List[str]): Arguments to the program. dry_run (bool): Print commands without running tools. mangled (bool): Use mangled kernel names. histogram (bool): Generate metric plots.

Source code in python-scripts/ncu_exec.py

def parse_arguments():
    """Parse command-line arguments for the NCU/NVBit execution script.

    Defines and parses arguments for specifying the target program, its arguments,
    and options like dry run, mangled names, and histogram generation.

    Args:
        None: Uses defined CLI options.

    Returns:
        argparse.Namespace: Parsed arguments including:
            program (str): Program to profile.
            args (List[str]): Arguments to the program.
            dry_run (bool): Print commands without running tools.
            mangled (bool): Use mangled kernel names.
            histogram (bool): Generate metric plots.
    """
    parser = argparse.ArgumentParser(
        description="Process files generated by the Nsight Compute/NVBit backend")
    # parser.add_argument(
    #     "input_file", help="Path to the input file", action="store")
    parser.add_argument('program', type=str, help='Program to profile')
    parser.add_argument('args', nargs=argparse.REMAINDER,
                        help='Arguments to the program')
    parser.add_argument('--dry-run', action='store_true',
                        help='Print the command without running it', default=False)
    parser.add_argument("--mangled", action="store_true",
                        help="Use mangled kernel names", default=False)
    parser.add_argument("--histogram", action="store_true",
                        help="Generate histograms of the metrics", default=False)
    return parser.parse_args()

`plot_l1_metrics(results, basename)`

Create and save plots summarizing L1 cache metrics across kernels.

Generates a PNG image file containing bar plots for L1 Utilization, L1 Lifetime, and L1 Refreshes (calculated based on a fixed retention time).

Parameters:

Name	Type	Description	Default
`results`	`DataFrame`	DataFrame containing computed metrics for all kernels (output of `compute_kernel_metrics`).	required
`basename`	`str`	The base path and filename prefix for the output PNG file (e.g., `/path/to/logs/program_timestamp`). The suffix `.exec_l1.png` will be appended.	required

Returns:

Name	Type	Description
`None`		Saves the plot to a file.

Source code in python-scripts/ncu_exec.py

def plot_l1_metrics(results, basename):
    """
    Create and save plots summarizing L1 cache metrics across kernels.

    Generates a PNG image file containing bar plots for L1 Utilization, L1 Lifetime,
    and L1 Refreshes (calculated based on a fixed retention time).

    Args:
        results (pd.DataFrame): DataFrame containing computed metrics for all kernels
                                (output of `compute_kernel_metrics`).
        basename (str): The base path and filename prefix for the output PNG file
                        (e.g., `/path/to/logs/program_timestamp`). The suffix `.exec_l1.png`
                        will be appended.

    Returns:
        None: Saves the plot to a file.
    """
    fig, ax = plt.subplots(1, 3, figsize=(20, 9))

    # sns.barplot(y="Kernel ID", x="L1 Active Time", data=results, ax=ax[0, 0], orient='h')
    # ax[0, 0].set_title("L1 Active Time (μs)")

    sns.barplot(y="Kernel ID", x="L1 Utilization",
                data=results, ax=ax[0], orient='h')
    ax[0].set_title("L1 Utilization")

    # sns.barplot(y="Kernel ID", x="L1 Read Frequency", data=results, ax=ax[1, 0], orient='h', color='b')
    # ax[1, 0].set_title("L1 Read Frequency (MHz)")
    # ax[1, 0].set_xlabel("L1 Read Frequency (MHz)", color='b')

    # sns.barplot(y="Kernel ID", x="L1 Write Frequency", data=results, ax=ax[1, 1], orient='h', color='r')
    # ax[1, 1].set_title("L1 Write Frequency (MHz)")
    # ax[1, 1].set_xlabel("L1 Write Frequency (MHz)", color='r')

    sns.barplot(y="Kernel ID", x="L1 Lifetime",
                data=results, ax=ax[1], orient='h')
    ax[1].set_title("L1 Lifetime (μs)")

    sns.barplot(y="Kernel ID", x="L1 Refreshes",
                data=results, ax=ax[2], orient='h')
    ax[2].set_title("L1 Refreshes for 77 μs retention time")

    plt.tight_layout()
    plt.savefig(basename + ".exec_l1.png")
    plt.clf()

`plot_l2_metrics(results, basename)`

Create and save plots summarizing L2 cache metrics across kernels.

Generates a PNG image file containing bar plots for L2 Utilization and L2 Lifetime. Adds a horizontal line indicating the assumed retention time on the Lifetime plot.

Parameters:

Name	Type	Description	Default
`results`	`DataFrame`	DataFrame containing computed metrics for all kernels (output of `compute_kernel_metrics`).	required
`basename`	`str`	The base path and filename prefix for the output PNG file (e.g., `/path/to/logs/program_timestamp`). The suffix `.exec_l2.png` will be appended.	required

Returns:

Name	Type	Description
`None`		Saves the plot to a file.

Source code in python-scripts/ncu_exec.py

def plot_l2_metrics(results, basename):
    """
    Create and save plots summarizing L2 cache metrics across kernels.

    Generates a PNG image file containing bar plots for L2 Utilization and L2 Lifetime.
    Adds a horizontal line indicating the assumed retention time on the Lifetime plot.

    Args:
        results (pd.DataFrame): DataFrame containing computed metrics for all kernels
                                (output of `compute_kernel_metrics`).
        basename (str): The base path and filename prefix for the output PNG file
                        (e.g., `/path/to/logs/program_timestamp`). The suffix `.exec_l2.png`
                        will be appended.

    Returns:
        None: Saves the plot to a file.
    """
    font_size = 25
    fig, ax = plt.subplots(2, 1, figsize=(16, 8))

    # sns.barplot(y="Kernel ID", x="L2 Active Time", data=results, ax=ax[0, 0], orient='h')
    # ax[0, 0].set_title("L2 Active Time (μs)")
    sns.set(font_scale=2.2)
    sns.barplot(x="Kernel ID", y="L2 Utilization", data=results, ax=ax[0])
    ax[0].set_title("L2 Utilization")
    ax[0].tick_params(axis='x', labelsize=font_size)
    ax[0].tick_params(axis='y', labelsize=font_size)
    # change the x-axis label to be more readable
    ax[0].set_xlabel("Kernel ID", fontsize=font_size)
    ax[0].set_ylabel("Utilization", fontsize=font_size)

    # sns.barplot(y="Kernel ID", x="L2 Read Frequency", data=results, ax=ax[1, 0], orient='h')
    # ax[1, 0].set_title("L2 Read Frequency (MHz)")

    # sns.barplot(y="Kernel ID", x="L2 Write Frequency", data=results, ax=ax[1, 1], orient='h')
    # ax[1, 1].set_title("L2 Write Frequency (MHz)")

    sns.barplot(x="Kernel ID", y="L2 Lifetime", data=results, ax=ax[1])
    ax[1].set_title("L2 Lifetime (μs)")
    # Add horizontal line for 77 μs retention time with label
    ax[1].axhline(y=77, color='r', linestyle='--',
                  label='77 μs retention time')
    ax[1].legend()
    plt.xticks(fontsize=font_size)
    plt.yticks(fontsize=font_size)
    ax[1].set_xlabel("Kernel ID", fontsize=font_size)
    ax[1].set_ylabel("Lifetime (μs)", fontsize=font_size)

    plt.tight_layout()
    plt.savefig(basename + ".exec_l2.png")
    plt.clf()

`read_nvbit_data(nvbit_input_file, kernel_count)`

Read and parse the NVBit log file to extract unique sector counts and kernel names.

Parses the .nvbit output file generated by the custom NVBit tool. It extracts the number of unique L1 and L2 cache sectors accessed by each kernel, as well as the mangled and unmangled names for each kernel ID.

Parameters:

Name	Type	Description	Default
`nvbit_input_file`	`str`	Path to the NVBit log file (`.nvbit`).	required
`kernel_count`	`int`	The expected number of kernels (obtained from NCU report).	required

Returns:

Type	Description
	Tuple[List[List[int]], List[str], List[str]]: A tuple containing: - List[List[int]]: A list where each inner list contains `[l1_unique_sectors, l2_unique_sectors]` for a kernel ID. - List[str]: A list of unmangled kernel names indexed by kernel ID. - List[str]: A list of mangled kernel names indexed by kernel ID.

Raises:

Type	Description
`FileNotFoundError`	If `nvbit_input_file` does not exist.
`AssertionError`	If the kernel count reported in the NVBit log does not match `kernel_count`.

Source code in python-scripts/ncu_exec.py

def read_nvbit_data(nvbit_input_file, kernel_count):
    """
    Read and parse the NVBit log file to extract unique sector counts and kernel names.

    Parses the `.nvbit` output file generated by the custom NVBit tool. It extracts
    the number of unique L1 and L2 cache sectors accessed by each kernel, as well
    as the mangled and unmangled names for each kernel ID.

    Args:
        nvbit_input_file (str): Path to the NVBit log file (`.nvbit`).
        kernel_count (int): The expected number of kernels (obtained from NCU report).

    Returns:
        Tuple[List[List[int]], List[str], List[str]]: A tuple containing:
            - List[List[int]]: A list where each inner list contains `[l1_unique_sectors, l2_unique_sectors]` for a kernel ID.
            - List[str]: A list of unmangled kernel names indexed by kernel ID.
            - List[str]: A list of mangled kernel names indexed by kernel ID.

    Raises:
        FileNotFoundError: If `nvbit_input_file` does not exist.
        AssertionError: If the kernel count reported in the NVBit log does not match `kernel_count`.
    """
    nvbit_lines = None
    try:
        with open(nvbit_input_file, newline='') as input_file:
            nvbit_lines = input_file.readlines()
    except FileNotFoundError:
        print(f"Error: File '{nvbit_input_file}' not found.")
        exit(1)

    unmangled_names = []
    mangled_names = []
    nvbit_results = []

    # initialize the results with empty values
    for i in range(kernel_count):
        # Zero out metrics for L1 and L2 cache
        nvbit_results.append([0, 0])
        unmangled_names.append("")
        mangled_names.append("")

    for line in nvbit_lines:
        # Assert correct number of kernels
        if "MEMTRACE: Size of id_kernel_map" in line:
            # Line format: MEMTRACE: Size of id_kernel_map: 308
            kernel_count_nvbit = int(line.split(": ")[-1])
            assert kernel_count == kernel_count_nvbit, f"Kernel count mismatch: {kernel_count} != {kernel_count_nvbit}"
        # Process L1 and L2 cache metrics
        if "L1" in line:
            # Line format: MEMTRACE: Kernel ID 57 - L1 unique sectors: 453
            kernel_id = int(line.split("Kernel ID ")[1].split(" -")[0])
            nvbit_results[kernel_id][0] = int(
                line.split("L1 unique sectors: ")[1])
        elif "L2" in line:
            # Line format: MEMTRACE: Kernel ID 0 - L2 unique sectors: 542
            kernel_id = int(line.split("Kernel ID ")[1].split(" -")[0])
            nvbit_results[kernel_id][1] = int(
                line.split("L2 unique sectors: ")[1])
        # Process kernel names
        elif "Mangled name" in line:
            kernel_id = int(line.split("Kernel ID ")[1].split(" -")[0])
            mangled_name = line.split("Mangled name: ")[1].split(" -")[0]
            unmangled_name = line.split("Unmangled name: ")[1]
            mangled_names[kernel_id] = mangled_name
            unmangled_names[kernel_id] = unmangled_name
    return nvbit_results, unmangled_names, mangled_names

Parser for Accel-Sim Trace Files

The accel_sim_parser.py script is responsible for parsing the trace files generated by the Accel-Sim simulator. It extracts relevant information about data lifetime, read and write operations, and other performance metrics from the trace files.

Module for parsing Accel-Sim simulation logs and generating cache lifetime statistics.

This module provides classes and functions to process GPGPU-Sim simulation logs (specifically the cache access logs generated when running Accel-Sim) to calculate cache line lifetime metrics. It parses log lines, tracks cache line residency, computes lifetimes, and aggregates statistics per kernel and across the entire run. It outputs results in CSV format.

`LifetimeType`

Bases: object

Represents the lifetime of a cache line (sector) at a specific address.

Stores the start and end simulation cycles for a cache line's residency.

Attributes:

Name	Type	Description
`address`	`int`	The memory address of the cache line.
`start`	`int`	The simulation cycle when the cache line entered the cache.
`end`	`Optional[int]`	The simulation cycle when the cache line was evicted or the last cycle it was accessed before the simulation ended. Initially None.

Source code in python-scripts/accel_sim_parser.py

class LifetimeType(object):
    """
    Represents the lifetime of a cache line (sector) at a specific address.

    Stores the start and end simulation cycles for a cache line's residency.

    Attributes:
        address (int): The memory address of the cache line.
        start (int): The simulation cycle when the cache line entered the cache.
        end (Optional[int]): The simulation cycle when the cache line was evicted
                             or the last cycle it was accessed before the simulation ended.
                             Initially None.
    """

    def __init__(self, address, start, end):
        """
        Initialize a LifetimeType object.

        Args:
            address (int): The memory address.
            start (int): The start cycle.
            end (Optional[int]): The end cycle (can be None initially).
        """
        self.address = address
        self.start = start
        self.end = end

    def __dict__(self):
        """
        Return a dictionary representation of the lifetime entry.

        Returns:
            dict: A dictionary with 'address' (hex), 'start', and 'end' keys.
        """
        return {
            # convert address to hex
            "address": hex(int(self.address)),
            "start": self.start,
            "end": self.end
        }

    def calculate_lifetime(self):
        """
        Calculate the duration of the cache line lifetime in cycles.

        Returns:
            int: The difference between the end and start cycles.

        Raises:
            TypeError: If `end` or `start` is None or not a number.
        """
        return self.end - self.start

`dict()`

Return a dictionary representation of the lifetime entry.

Returns:

Name	Type	Description
`dict`		A dictionary with 'address' (hex), 'start', and 'end' keys.

Source code in python-scripts/accel_sim_parser.py

def __dict__(self):
    """
    Return a dictionary representation of the lifetime entry.

    Returns:
        dict: A dictionary with 'address' (hex), 'start', and 'end' keys.
    """
    return {
        # convert address to hex
        "address": hex(int(self.address)),
        "start": self.start,
        "end": self.end
    }

`init(address, start, end)`

Initialize a LifetimeType object.

Parameters:

Name	Type	Description	Default
`address`	`int`	The memory address.	required
`start`	`int`	The start cycle.	required
`end`	`Optional[int]`	The end cycle (can be None initially).	required

Source code in python-scripts/accel_sim_parser.py

def __init__(self, address, start, end):
    """
    Initialize a LifetimeType object.

    Args:
        address (int): The memory address.
        start (int): The start cycle.
        end (Optional[int]): The end cycle (can be None initially).
    """
    self.address = address
    self.start = start
    self.end = end

`calculate_lifetime()`

Calculate the duration of the cache line lifetime in cycles.

Returns:

Name	Type	Description
`int`		The difference between the end and start cycles.

Raises:

Type	Description
`TypeError`	If `end` or `start` is None or not a number.

Source code in python-scripts/accel_sim_parser.py

def calculate_lifetime(self):
    """
    Calculate the duration of the cache line lifetime in cycles.

    Returns:
        int: The difference between the end and start cycles.

    Raises:
        TypeError: If `end` or `start` is None or not a number.
    """
    return self.end - self.start

`SimulationParser`

Parses Accel-Sim cache log lines for a single kernel to calculate lifetime statistics.

Processes log lines associated with one kernel launch, tracks cache line entries and exits based on load/store operations and cache status (hit/miss), considering the configured cache policies (write-allocate, write-back). It calculates the lifetime for each cache line instance and aggregates statistics.

Attributes:

Name	Type	Description
`GPU_FREQ`	`int`	GPU frequency in MHz.
`CYCLE_TIME`	`float`	GPU cycle time in nanoseconds.
`log_path`	`str`	Path to the log directory.
`log_file`	`str`	Base name of the log file.
`kernel_name`	`str`	Name of the kernel being processed.
`kernel_id`	`int`	ID of the kernel being processed.
`sim_cycles`	`int`	Simulation cycles for this kernel.
`sim_insn`	`int`	Instructions executed by this kernel.
`ipc`	`float`	Instructions per cycle for this kernel.
`total_sim_cycles`	`int`	Total simulation cycles up to this kernel.
`total_sim_insn`	`int`	Total instructions executed up to this kernel.
`log_lines`	`list[str]`	Log lines from `.sim_cache.log` for this kernel.
`sector_size`	`int`	Cache line size in bytes.
`l1_size`	`int`	L1 cache size in bytes.
`l1_cache_lines`	`float`	Number of cache lines in L1.
`l2_size`	`int`	L2 cache size in bytes.
`l2_cache_lines`	`float`	Number of cache lines in L2.
`l1_lifetimes`	`list[LifetimeType]`	List of completed L1 lifetime objects.
`l2_lifetimes`	`list[LifetimeType]`	List of completed L2 lifetime objects.
`l1_most_recent_read`	`dict[int, int]`	Maps L1 address to the cycle of its most recent read hit.
`l2_most_recent_read`	`dict[int, int]`	Maps L2 address to the cycle of its most recent read hit.
`l1_current_lifetime_index`	`dict[int, int]`	Maps L1 address to the index of its currently active lifetime in the internal list.
`l2_current_lifetime_index`	`dict[int, int]`	Maps L2 address to the index of its currently active lifetime in the internal list.
`l1_lifetime_cycles`	`ndarray`	Array of completed L1 lifetimes in cycles.
`l1_lifetime_ns`	`ndarray`	Array of completed L1 lifetimes in nanoseconds.
`l2_lifetime_cycles`	`ndarray`	Array of completed L2 lifetimes in cycles.
`l2_lifetime_ns`	`ndarray`	Array of completed L2 lifetimes in nanoseconds.
`l1_read_count`	`int`	Total L1 read operations.
`l1_write_count`	`int`	Total L1 write operations.
`l2_read_count`	`int`	Total L2 read operations.
`l2_write_count`	`int`	Total L2 write operations.
`l1_read_cycles`	`list[int]`	List of unique cycles with L1 reads.
`l1_write_cycles`	`list[int]`	List of unique cycles with L1 writes.
`l2_read_cycles`	`list[int]`	List of unique cycles with L2 reads.
`l2_write_cycles`	`list[int]`	List of unique cycles with L2 writes.
`l1_read_cycle_count`	`int`	Count of unique cycles with L1 reads.
`l1_write_cycle_count`	`int`	Count of unique cycles with L1 writes.
`l2_read_cycle_count`	`int`	Count of unique cycles with L2 reads.
`l2_write_cycle_count`	`int`	Count of unique cycles with L2 writes.
`l1_unique_addrs`	`int`	Count of unique addresses seen in L1.
`l2_unique_addrs`	`int`	Count of unique addresses seen in L2.
`l1_zero_count`	`int`	Count of L1 lifetimes calculated as zero or incomplete.
`l2_zero_count`	`int`	Count of L2 lifetimes calculated as zero or incomplete.
`l1_lifetimes_count`	`int`	Count of valid, non-zero L1 lifetimes calculated.
`l2_lifetimes_count`	`int`	Count of valid, non-zero L2 lifetimes calculated.
`l1_write_policy`	`WritePolicy`	L1 write policy enum.
`l1_write_allocation`	`WriteAllocation`	L1 write allocation enum.
`l2_write_policy`	`WritePolicy`	L2 write policy enum.
`l2_write_allocation`	`WriteAllocation`	L2 write allocation enum.

Parameters:

Name	Type	Description	Default
`kernel`	`dict`	Dictionary containing kernel metadata and log lines from `read_cache_log`.	required
`log_file_path`	`str`	Path to the log directory.	required
`log_file_base`	`str`	Base name of the log files.	required
`kernel_name`	`str`	Actual kernel name. Defaults to None.	`None`
`config_file_path`	`str`	Path to the GPGPU-Sim config file. Required.	`None`

Raises:

Type	Description
`AssertionError`	If `config_file_path` is None.
`FileNotFoundError`	If the config file cannot be read.

Returns:

Name	Type	Description
`None`		Initializes parser state for lifetime analysis.

Source code in python-scripts/accel_sim_parser.py

class SimulationParser:
    """
    Parses Accel-Sim cache log lines for a single kernel to calculate lifetime statistics.

    Processes log lines associated with one kernel launch, tracks cache line entries
    and exits based on load/store operations and cache status (hit/miss), considering
    the configured cache policies (write-allocate, write-back). It calculates the
    lifetime for each cache line instance and aggregates statistics.

    Attributes:
        GPU_FREQ (int): GPU frequency in MHz.
        CYCLE_TIME (float): GPU cycle time in nanoseconds.
        log_path (str): Path to the log directory.
        log_file (str): Base name of the log file.
        kernel_name (str): Name of the kernel being processed.
        kernel_id (int): ID of the kernel being processed.
        sim_cycles (int): Simulation cycles for this kernel.
        sim_insn (int): Instructions executed by this kernel.
        ipc (float): Instructions per cycle for this kernel.
        total_sim_cycles (int): Total simulation cycles up to this kernel.
        total_sim_insn (int): Total instructions executed up to this kernel.
        log_lines (list[str]): Log lines from `.sim_cache.log` for this kernel.
        sector_size (int): Cache line size in bytes.
        l1_size (int): L1 cache size in bytes.
        l1_cache_lines (float): Number of cache lines in L1.
        l2_size (int): L2 cache size in bytes.
        l2_cache_lines (float): Number of cache lines in L2.
        l1_lifetimes (list[LifetimeType]): List of completed L1 lifetime objects.
        l2_lifetimes (list[LifetimeType]): List of completed L2 lifetime objects.
        l1_most_recent_read (dict[int, int]): Maps L1 address to the cycle of its most recent read hit.
        l2_most_recent_read (dict[int, int]): Maps L2 address to the cycle of its most recent read hit.
        l1_current_lifetime_index (dict[int, int]): Maps L1 address to the index of its currently active lifetime in the internal list.
        l2_current_lifetime_index (dict[int, int]): Maps L2 address to the index of its currently active lifetime in the internal list.
        l1_lifetime_cycles (np.ndarray): Array of completed L1 lifetimes in cycles.
        l1_lifetime_ns (np.ndarray): Array of completed L1 lifetimes in nanoseconds.
        l2_lifetime_cycles (np.ndarray): Array of completed L2 lifetimes in cycles.
        l2_lifetime_ns (np.ndarray): Array of completed L2 lifetimes in nanoseconds.
        l1_read_count (int): Total L1 read operations.
        l1_write_count (int): Total L1 write operations.
        l2_read_count (int): Total L2 read operations.
        l2_write_count (int): Total L2 write operations.
        l1_read_cycles (list[int]): List of unique cycles with L1 reads.
        l1_write_cycles (list[int]): List of unique cycles with L1 writes.
        l2_read_cycles (list[int]): List of unique cycles with L2 reads.
        l2_write_cycles (list[int]): List of unique cycles with L2 writes.
        l1_read_cycle_count (int): Count of unique cycles with L1 reads.
        l1_write_cycle_count (int): Count of unique cycles with L1 writes.
        l2_read_cycle_count (int): Count of unique cycles with L2 reads.
        l2_write_cycle_count (int): Count of unique cycles with L2 writes.
        l1_unique_addrs (int): Count of unique addresses seen in L1.
        l2_unique_addrs (int): Count of unique addresses seen in L2.
        l1_zero_count (int): Count of L1 lifetimes calculated as zero or incomplete.
        l2_zero_count (int): Count of L2 lifetimes calculated as zero or incomplete.
        l1_lifetimes_count (int): Count of valid, non-zero L1 lifetimes calculated.
        l2_lifetimes_count (int): Count of valid, non-zero L2 lifetimes calculated.
        l1_write_policy (WritePolicy): L1 write policy enum.
        l1_write_allocation (WriteAllocation): L1 write allocation enum.
        l2_write_policy (WritePolicy): L2 write policy enum.
        l2_write_allocation (WriteAllocation): L2 write allocation enum.

    Args:
        kernel (dict): Dictionary containing kernel metadata and log lines from `read_cache_log`.
        log_file_path (str): Path to the log directory.
        log_file_base (str): Base name of the log files.
        kernel_name (str, optional): Actual kernel name. Defaults to None.
        config_file_path (str, optional): Path to the GPGPU-Sim config file. Required.

    Raises:
        AssertionError: If `config_file_path` is None.
        FileNotFoundError: If the config file cannot be read.

    Returns:
        None: Initializes parser state for lifetime analysis.
    """

    def __init__(self, kernel: dict, log_file_path: str, log_file_base: str, kernel_name: str = None, config_file_path: str = None):
        """
        Initialize the SimulationParser for a specific kernel.

        Args:
            kernel (dict): Dictionary containing kernel metadata and log lines from `read_cache_log`.
            log_file_path (str): Path to the log directory.
            log_file_base (str): Base name of the log files.
            kernel_name (str, optional): Actual kernel name (e.g., from `kernels.csv`). Defaults to None.
            config_file_path (str, optional): Path to the GPGPU-Sim config file. Required.

        Raises:
            AssertionError: If `config_file_path` is None.
            FileNotFoundError: If the config file cannot be read by `read_config_file`.
        """
        # Constants
        # Lovelace GPU has frequency of 2235
        self.GPU_FREQ = int(os.getenv('GPU_FREQ', 1593))
        # Time in ns for one cycle
        self.CYCLE_TIME = 1e9 / (self.GPU_FREQ * 1e6)

        self.log_path = log_file_path
        self.log_file = log_file_base

        # Kernel identifiers
        self.kernel_name = kernel_name
        self.kernel_id = kernel["kernel_id"]
        self.sim_cycles = kernel["gpu_sim_cycle"]
        self.sim_insn = kernel["gpu_sim_insn"]
        self.ipc = kernel["gpu_ipc"]
        self.total_sim_cycles = kernel["gpu_tot_sim_cycle"]
        self.total_sim_insn = kernel["gpu_tot_sim_insn"]

        # Log lines for this kernel
        self.log_lines = kernel["lines"]

        self.sector_size = 32
        self.l1_size = kernel.get("l1_config", 128 * 1024)
        self.l1_cache_lines = self.l1_size / self.sector_size
        self.l2_size = convert_size(os.getenv('L2_SIZE', "50MB"))
        self.l2_cache_lines = self.l2_size / self.sector_size

        # Data structures for cache lifetimes replaced by LifetimeType lists
        self.l1_lifetimes = []  # list of LifetimeType objects for L1
        self.l2_lifetimes = []  # list of LifetimeType objects for L2
        self.l1_most_recent_read = {}
        self.l2_most_recent_read = {}
        # Lookup tables remain unchanged:
        self.l1_current_lifetime_index = {}
        self.l2_current_lifetime_index = {}

        self.l1_lifetime_cycles = []
        self.l1_lifetime_ns = np.array([], dtype=np.float64)
        self.l2_lifetime_cycles = []
        self.l2_lifetime_ns = np.array([], dtype=np.float64)

        # Counters for instructions
        self.l1_read_cycles = []
        self.l1_read_cycle_count = 0
        self.l1_read_count = 0
        self.l1_write_cycles = []
        self.l1_write_cycle_count = 0
        self.l1_write_count = 0
        self.l2_read_cycles = []
        self.l2_read_cycle_count = 0
        self.l2_read_count = 0
        self.l2_write_cycles = []
        self.l2_write_cycle_count = 0
        self.l2_write_count = 0
        self.l1_unique_addrs = 0
        self.l2_unique_addrs = 0

        # Cache configurations
        assert config_file_path is not None, "Config file path must be provided."
        l1_config, l2_config = read_config_file(config_file_path)
        self.l1_write_policy = l1_config["write_policy"]
        self.l1_write_allocation = l1_config["write_allocation"]
        self.l2_write_policy = l2_config["write_policy"]
        self.l2_write_allocation = l2_config["write_allocation"]

    def process_cycle(self, line: str) -> int:
        """
        Extract the simulation cycle number from a log line.

        Args:
            line (str): A log line containing 'Cycle <num>'.

        Returns:
            int: The extracted cycle number.

        Raises:
            ValueError: If the cycle number cannot be parsed as an integer.
            IndexError: If the line format is unexpected.
        """
        cycle_str = line.split("Cycle ")[1].split()[0]
        cycle = int(cycle_str.strip(":"))
        return cycle

    def process_line(self, line: str):
        """
        Process a single cache log line to update lifetime tracking.

        Parses the line to identify L1 or L2 cache accesses, address, cycle,
        status (hit/miss), and operation type (load/store). Updates the internal
        lifetime tracking structures (`l1_lifetimes`, `l2_lifetimes`,
        `l1_current_lifetime_index`, `l2_current_lifetime_index`,
        `l1_most_recent_read`, `l2_most_recent_read`) based on the access and
        configured cache policies. Also updates read/write counters.

        Args:
            line (str): The cache log line to process.

        Returns:
            None: Modifies internal state.
        """
        # Process L1 cache lines
        # GPGPU-Sim Cycle 11097: Load instr from L1D cache at SM 0 bank 3 addr 2d61c4e0 status 2
        if "L1D cache at SM" in line:
            # Get the number after "status" to determine if it's a hit or miss
            status = int(line.split("status ")[1].split()[0])
            if status >= 3:
                return
            # Get the cycle number
            cycle = int(line.split("Cycle ")[1].split()[0].strip(":"))
            # Get the address after "addr" and convert to decimal int
            address = int(line.split("addr ")[1].split()[0], 16)

            # look up the index in the three lifetime lists
            index = self.l1_current_lifetime_index.get(address, None)

            l1_write = "Store" in line
            l1_read = "Load" in line

            # End the lifetime of the cache line
            if (status == 2 or l1_write) and index is not None:
                start = self.l1_lifetimes[index].start
                # If an active lifetime exists, end it using the latest read
                last_read = self.l1_most_recent_read.get(address, start)
                self.l1_lifetimes[index].end = last_read if last_read >= start else None

            # Decide if we need to create a new lifetime entry:
            # - For a miss (status==2):
            #   * If there is no active entry, create one. If NO_WRITE_ALLOCATE, only create one if this is not a store.
            #   * Otherwise, always create a new entry.
            # - For a store hit:
            #   * Create a new entry
            # Determine whether to create a new lifetime entry in a single expression:
            new_entry = (
                # Cache miss
                (status == 2 and
                 # cold-start miss
                 (index is not None or
                  # Start lifetime on read or write as determined by write allocation policy
                  (self.l1_write_allocation == WriteAllocation.WRITE_ALLOCATE or l1_read)))
                # Cache hit and write to existing entry
                or (l1_write and index is not None))

            # If a new lifetime entry is needed, create it.
            if new_entry:
                new_lt = LifetimeType(address, cycle, None)
                self.l1_lifetimes.append(new_lt)
                self.l1_current_lifetime_index[address] = len(
                    self.l1_lifetimes) - 1

            # Process the instruction type
            if l1_write:
                self.l1_write_count += 1
                if cycle not in self.l1_write_cycles:
                    self.l1_write_cycles.append(cycle)
            elif l1_read:
                self.l1_read_count += 1
                # store the most recent read if this is a hit
                if status == 0:
                    self.l1_most_recent_read[address] = cycle
                if cycle not in self.l1_read_cycles:
                    self.l1_read_cycles.append(cycle)
            else:
                print(f"Warning: Unknown L1D instruction type: {line}")
                return

        # Process L2 cache lines
        elif "L2 Address" in line:
            # GPGPU-Sim Cycle 12969: MEMORY_SUBPARTITION_UNIT -  2 - Store Request to L2 Address=71082d62bb60, status=0
            status = int(line.split("status=")[1].split()[0])
            if status >= 3:
                return
            cycle = int(line.split("Cycle ")[1].split()[0].strip(":"))
            address = int(line.split("Address=")[1].split(",")[0], 16)
            # Get the bank after MEMORY_SUBPARTITION_UNIT
            bank = int(line.split("MEMORY_SUBPARTITION_UNIT - ")
                       [1].split(" - ")[0].strip())
            # if bank != 0:
            #     return

            # look up the index in the three lifetime lists
            index = self.l2_current_lifetime_index.get(address, None)

            l2_write = "Store" in line
            l2_read = "Load" in line

            # End lifetime if needed
            if (status == 2 or l2_write) and index is not None:
                start = self.l2_lifetimes[index].start
                # If an active lifetime exists, end it using the latest read
                last_read = self.l2_most_recent_read.get(address, start)
                self.l2_lifetimes[index].end = last_read if last_read >= start else None
            # Decide if we need to create a new lifetime entry:
            # - For a miss (status==2):
            #   * If there is no active entry, create one. If NO_WRITE_ALLOCATE, only create one if this is not a store.
            #   * Otherwise, always create a new entry.
            # - For a store hit:
            #   * Create a new entry
            # Determine whether to create a new lifetime entry in a single expression:
            new_entry = (
                # Cache miss
                (status == 2 and
                    # cold-start miss
                    (index is not None or
                     # Start lifetime on read or write as determined by write allocation policy
                     (self.l2_write_allocation == WriteAllocation.WRITE_ALLOCATE or l2_read)))
                # Cache hit and write to existing entry
                or (l2_write and index is not None))
            # If a new lifetime entry is needed, create it.
            if new_entry:
                new_lt = LifetimeType(address, cycle, None)
                self.l2_lifetimes.append(new_lt)
                self.l2_current_lifetime_index[address] = len(
                    self.l2_lifetimes) - 1

            # Process the instruction type
            if l2_write:
                self.l2_write_count += 1
                if cycle not in self.l2_write_cycles:
                    self.l2_write_cycles.append(cycle)
            elif l2_read:
                self.l2_read_count += 1
                if status == 0:
                    self.l2_most_recent_read[address] = cycle
                if cycle not in self.l2_read_cycles:
                    self.l2_read_cycles.append(cycle)
            else:
                print(f"Warning: Unknown L2 instruction type: {line}")
                return

    def parse_log_file(self) -> list:
        """
        Parse all log lines for the kernel and finalize lifetime calculations.

        Iterates through `self.log_lines`, calling `process_line` for each.
        After processing all lines, it finalizes any remaining active lifetimes
        using the `l1_most_recent_read` and `l2_most_recent_read` dictionaries.
        Calculates lifetime durations in cycles and nanoseconds, storing them in
        `l1_lifetime_cycles`, `l1_lifetime_ns`, etc. Updates final counts for
        unique addresses and zero/incomplete lifetimes.

        Returns:
            list: A list containing the state needed to continue parsing for the
                  next kernel: `[l1_lifetimes_incomplete, l2_lifetimes_incomplete,
                  l1_most_recent_read, l2_most_recent_read]`.
                  `l1_lifetimes_incomplete` and `l2_lifetimes_incomplete` are lists
                  of `LifetimeType` objects whose `end` cycle is still None.
        """

        print(len(self.l1_lifetimes), len(self.l2_lifetimes))

        line_count = 0
        for line in self.log_lines:
            if line_count % 10000 == 0:
                print(
                    f"\tProcessing line {line_count} of {len(self.log_lines)}")
            self.process_line(line)
            line_count += 1

        # Finalize L1 lifetime entries
        valid_l1 = []
        self.l1_zero_count = 0
        for index, lt in enumerate(self.l1_lifetimes):
            if lt.end is not None:
                if lt.end > lt.start:
                    valid_l1.append(lt)
                elif lt.end == lt.start:
                    self.l1_zero_count += 1
            elif self.l1_most_recent_read.get(lt.address) is not None:
                lt.end = self.l1_most_recent_read[lt.address]
                if lt.end > lt.start:
                    valid_l1.append(lt)
                elif lt.end == lt.start:
                    self.l1_zero_count += 1

        self.l1_lifetime_cycles = np.array(
            [lt.calculate_lifetime() for lt in valid_l1])
        # Convert to nanoseconds
        self.l1_lifetime_ns = self.CYCLE_TIME * self.l1_lifetime_cycles

        # Finalize L2 lifetime entries
        valid_l2 = []
        self.l2_zero_count = 0
        for lt in self.l2_lifetimes:
            if lt.end is not None:
                if lt.end > lt.start:
                    valid_l2.append(lt)
                elif lt.end == lt.start:
                    self.l2_zero_count += 1
            elif self.l2_most_recent_read.get(lt.address) is not None:
                lt.end = self.l2_most_recent_read[lt.address]
                if lt.end > lt.start:
                    valid_l2.append(lt)
                elif lt.end == lt.start:
                    self.l2_zero_count += 1

        self.l2_lifetime_cycles = np.array(
            [lt.calculate_lifetime() for lt in valid_l2])
        # Convert to nanoseconds
        self.l2_lifetime_ns = self.CYCLE_TIME * self.l2_lifetime_cycles

        # Update read/write counters and unique addresses
        self.l1_read_cycle_count = len(self.l1_read_cycles)
        self.l1_write_cycle_count = len(self.l1_write_cycles)
        self.l2_read_cycle_count = len(self.l2_read_cycles)
        self.l2_write_cycle_count = len(self.l2_write_cycles)
        self.l1_unique_addrs = len({lt.address for lt in self.l1_lifetimes})
        self.l2_unique_addrs = len({lt.address for lt in self.l2_lifetimes})

        l1_lifetimes_export = [
            lt for lt in self.l1_lifetimes if lt.end is None]
        l2_lifetimes_export = [
            lt for lt in self.l2_lifetimes if lt.end is None]
        self.l1_zero_count += len(l1_lifetimes_export)
        self.l2_zero_count += len(l2_lifetimes_export)
        self.l1_lifetimes = valid_l1
        self.l2_lifetimes = valid_l2
        self.l1_lifetimes_count = len(self.l1_lifetimes)
        self.l2_lifetimes_count = len(self.l2_lifetimes)

        print(
            f"Kernel {self.kernel_name} (ID: {self.kernel_id}): Processed {len(self.log_lines)} lines."
        )
        print(
            f"Built L1 Lifetime DataFrame with {len(self.l1_lifetime_cycles)} entries."
        )
        print(
            f"Built L2 Lifetime DataFrame with {len(self.l2_lifetime_cycles)} entries."
        )
        print(
            f"Exporting {len(l1_lifetimes_export)} L1 lifetimes and {len(l2_lifetimes_export)} L2 lifetimes.")

        return [
            l1_lifetimes_export,
            l2_lifetimes_export,
            self.l1_most_recent_read,
            self.l2_most_recent_read
        ]

    def get_instruction_stats(self):
        """
        Calculate instruction frequencies and cache utilization statistics.

        Computes L1/L2 load/store frequencies (in MHz) based on the number of
        unique cycles with corresponding operations and the total simulation cycles
        for the kernel. Also calculates overall load/store frequencies and
        L1/L2 cache utilization based on unique addresses accessed.

        Returns:
            dict: A dictionary containing various statistics:
                - 'kernel_name', 'kernel_id'
                - 'l1_load_count', 'l1_store_count': Total L1 operations.
                - 'l1_read_cycles', 'l1_write_cycles': Unique cycles with L1 ops.
                - 'l1_load_frequency', 'l1_store_frequency': Frequencies in MHz.
                - 'l2_load_count', 'l2_store_count': Total L2 operations.
                - 'l2_read_cycles', 'l2_write_cycles': Unique cycles with L2 ops.
                - 'l2_load_frequency', 'l2_store_frequency': Frequencies in MHz.
                - 'l1_unique_addrs', 'l2_unique_addrs': Unique addresses accessed.
                - 'l1_utilization', 'l2_utilization': Cache utilization ratios.
                - 'load_count', 'store_count': Total unique cycles with loads/stores.
                - 'load_frequency', 'store_frequency': Overall frequencies in MHz.
                - 'l1_zero_count', 'l2_zero_count': Counts of zero/incomplete lifetimes.
                - 'l1_lifetimes_count', 'l2_lifetimes_count': Counts of valid lifetimes.
        """
        if self.sim_cycles > 0:
            l1_load_freq = self.GPU_FREQ * self.l1_read_cycle_count / self.sim_cycles
            l1_store_freq = self.GPU_FREQ * self.l1_write_cycle_count / self.sim_cycles
            l2_load_freq = self.GPU_FREQ * self.l2_read_cycle_count / self.sim_cycles
            l2_store_freq = self.GPU_FREQ * self.l2_write_cycle_count / self.sim_cycles
            # Get the union between the L1 and L2 read cycle sets
            load_cycles = set(self.l1_read_cycles).union(
                set(self.l2_read_cycles))
            # Get the union between the L1 and L2 write cycle sets
            store_cycles = set(self.l1_write_cycles).union(
                set(self.l2_write_cycles))
            load_freq = self.GPU_FREQ * len(load_cycles) / self.sim_cycles
            store_freq = self.GPU_FREQ * len(store_cycles) / self.sim_cycles
        else:
            l1_load_freq = 0
            l1_store_freq = 0
            l2_load_freq = 0
            l2_store_freq = 0
            load_freq = 0
            store_freq = 0

        return {
            'kernel_name': self.kernel_name,
            'kernel_id': self.kernel_id,
            # 'l1_load_count': self.l1_read_cycle_count,
            # 'l1_store_count': self.l1_write_cycle_count,
            'l1_load_count': self.l1_read_count,
            'l1_store_count': self.l1_write_count,
            'l1_read_cycles': self.l1_read_cycle_count,
            'l1_write_cycles': self.l1_write_cycle_count,
            'l1_load_frequency': l1_load_freq,
            'l1_store_frequency': l1_store_freq,
            # 'l2_load_count': self.l2_read_cycle_count,
            # 'l2_store_count': self.l2_write_cycle_count,
            'l2_load_count': self.l2_read_count,
            'l2_store_count': self.l2_write_count,
            'l2_read_cycles': self.l2_read_cycle_count,
            'l2_write_cycles': self.l2_write_cycle_count,
            'l2_load_frequency': l2_load_freq,
            'l2_store_frequency': l2_store_freq,
            "l1_unique_addrs": self.l1_unique_addrs,
            "l2_unique_addrs": self.l2_unique_addrs,
            "l1_utilization": self.l1_unique_addrs / self.l1_cache_lines,
            "l2_utilization": self.l2_unique_addrs / self.l2_cache_lines,
            'load_count': self.l1_read_cycle_count + self.l2_read_cycle_count,
            'store_count': self.l1_write_cycle_count + self.l2_write_cycle_count,
            'load_frequency': load_freq,
            'store_frequency': store_freq,
            'l1_zero_count': self.l1_zero_count,
            'l2_zero_count': self.l2_zero_count,
            'l1_lifetimes_count': self.l1_lifetimes_count,
            'l2_lifetimes_count': self.l2_lifetimes_count,
        }

    def fine_grained_df(self):
        """
        Create DataFrames containing fine-grained lifetime data for L1 and L2 caches.

        Generates two pandas DataFrames, one for L1 and one for L2, containing
        individual cache line lifetime entries. Each row represents a completed
        lifetime instance.

        Returns:
            Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing:
                - l1_df: DataFrame with columns 'kernel_id', 'address' (hex),
                         'lifetime_cycles', 'lifetime_ns'.
                - l2_df: DataFrame with columns 'kernel_id', 'address' (hex),
                         'lifetime_cycles', 'lifetime_ns'.
        """
        # Use valid lifetimes computed in parse_log_file
        min_l1_len = len(self.l1_lifetime_cycles)
        min_l2_len = len(self.l2_lifetime_cycles)
        l1_addresses = [hex(int(lt.address)) for lt in self.l1_lifetimes]
        l1_df = pd.DataFrame({
            "kernel_id": self.kernel_id,
            "address": l1_addresses[:min_l1_len],
            "lifetime_cycles": self.l1_lifetime_cycles[:min_l1_len],
            "lifetime_ns": self.l1_lifetime_ns[:min_l1_len]
        })
        l2_addresses = [hex(int(lt.address)) for lt in self.l2_lifetimes]
        l2_df = pd.DataFrame({
            "kernel_id": self.kernel_id,
            "address": l2_addresses[:min_l2_len],
            "lifetime_cycles": self.l2_lifetime_cycles[:min_l2_len],
            "lifetime_ns": self.l2_lifetime_ns[:min_l2_len]
        })
        return l1_df, l2_df

    def coarse_grained_dict(self):
        """
        Generate a dictionary containing coarse-grained, kernel-level statistics.

        Aggregates lifetime data (mean, median, 90th percentile, max) and combines
        it with instruction statistics from `get_instruction_stats()` and cache
        configuration details into a single dictionary summarizing the kernel's behavior.

        Returns:
            dict: A dictionary containing aggregated statistics for the kernel,
                  including lifetime metrics (mean, median, etc. in microseconds),
                  instruction counts and frequencies, utilization, cache configuration,
                  and lifetime counts. Keys match the column names used for the
                  coarse-grained CSV output.
        """
        # Check if lifetime data is empty
        if len(self.l1_lifetime_ns) == 0:
            l1_mean, l1_median, l1_90, l1_max = 0, 0, 0, 0
        else:
            l1_mean = np.mean(self.l1_lifetime_ns)
            l1_median = np.median(self.l1_lifetime_ns)
            l1_90 = np.percentile(self.l1_lifetime_ns, 90)
            l1_max = np.max(self.l1_lifetime_ns)
        if len(self.l2_lifetime_ns) == 0:
            l2_mean, l2_median, l2_90, l2_max = 0, 0, 0, 0
        else:
            l2_mean = np.mean(self.l2_lifetime_ns)
            l2_median = np.median(self.l2_lifetime_ns)
            l2_90 = np.percentile(self.l2_lifetime_ns, 90)
            l2_max = np.max(self.l2_lifetime_ns)
        stats = self.get_instruction_stats()
        return {
            "Kernel ID": self.kernel_id,
            "Mangled Names": self.kernel_name,
            "L1 Lifetime": l1_mean / 1e3,
            "L1 Lifetime Median": l1_median / 1e3,
            "L1 Lifetime 90%-tile": l1_90 / 1e3,
            "L1 Lifetime Max": l1_max / 1e3,
            "L2 Lifetime": l2_mean / 1e3,
            "L2 Lifetime Median": l2_median / 1e3,
            "L2 Lifetime 90%-tile": l2_90 / 1e3,
            "L2 Lifetime Max": l2_max / 1e3,
            "L1 Read Count": stats["l1_load_count"],
            "L1 Read Cycles": stats["l1_read_cycles"],
            "L1 Write Count": stats["l1_store_count"],
            "L1 Write Cycles": stats["l1_write_cycles"],
            "L2 Read Count": stats["l2_load_count"],
            "L2 Read Cycles": stats["l2_read_cycles"],
            "L2 Write Count": stats["l2_store_count"],
            "L2 Write Cycles": stats["l2_write_cycles"],
            "L1 Zero Count": stats["l1_zero_count"],
            "L2 Zero Count": stats["l2_zero_count"],
            "L1 Read Frequency": stats["l1_load_frequency"],
            "L1 Write Frequency": stats["l1_store_frequency"],
            "L2 Read Frequency": stats["l2_load_frequency"],
            "L2 Write Frequency": stats["l2_store_frequency"],
            "L1 Utilization": stats["l1_utilization"],
            "L1 Size": self.l1_size,
            "L1 Unique Addresses": stats["l1_unique_addrs"],
            "L2 Utilization": stats["l2_utilization"],
            "L2 Size": self.l2_size,
            "L2 Unique Addresses": stats["l2_unique_addrs"],
            "Total Read Frequency": stats["load_frequency"],
            "Total Write Frequency": stats["store_frequency"],
            "Total Cycles": self.sim_cycles,
            "L1 Write Policy": self.l1_write_policy.name,
            "L1 Write Allocation": self.l1_write_allocation.name,
            "L2 Write Policy": self.l2_write_policy.name,
            "L2 Write Allocation": self.l2_write_allocation.name,
            "L1 Lifetime Count": self.l1_lifetimes_count,
            "L2 Lifetime Count": self.l2_lifetimes_count,
        }

    def import_cache_states(self, cache_states):
        """
        Import cache state from the previous kernel to continue lifetime tracking.

        Takes the state returned by `parse_log_file` from the previously processed
        kernel and initializes the current parser instance with it. This allows
        lifetimes spanning across kernel boundaries to be tracked correctly.

        Args:
            cache_states (list): The list returned by `parse_log_file` containing
                                 incomplete lifetimes and most recent read dictionaries.
                                 `[l1_lifetimes_incomplete, l2_lifetimes_incomplete,
                                 l1_most_recent_read, l2_most_recent_read]`.

        Returns:
            None: Updates internal state (`l1_lifetimes`, `l2_lifetimes`,
                  `l1_most_recent_read`, `l2_most_recent_read`,
                  `l1_current_lifetime_index`, `l2_current_lifetime_index`).
        """
        # Unpack the cache states
        l1_lifetimes, l2_lifetimes, self.l1_most_recent_read, self.l2_most_recent_read = cache_states
        # Assign the lifetimes to the class variables
        self.l1_lifetimes = l1_lifetimes
        self.l2_lifetimes = l2_lifetimes
        # print length of the lifetimes
        print(
            f"Importing {len(self.l1_lifetimes)} L1 lifetimes and {len(self.l2_lifetimes)} L2 lifetimes.")
        # Rebuild the current lifetime index dictionaries
        # Find the last index for each address
        self.l1_current_lifetime_index = {}
        for i, lt in enumerate(self.l1_lifetimes):
            self.l1_current_lifetime_index[lt.address] = i
        self.l2_current_lifetime_index = {}
        for i, lt in enumerate(self.l2_lifetimes):
            self.l2_current_lifetime_index[lt.address] = i

`init(kernel, log_file_path, log_file_base, kernel_name=None, config_file_path=None)`

Initialize the SimulationParser for a specific kernel.

Parameters:

Name	Type	Description	Default
`kernel`	`dict`	Dictionary containing kernel metadata and log lines from `read_cache_log`.	required
`log_file_path`	`str`	Path to the log directory.	required
`log_file_base`	`str`	Base name of the log files.	required
`kernel_name`	`str`	Actual kernel name (e.g., from `kernels.csv`). Defaults to None.	`None`
`config_file_path`	`str`	Path to the GPGPU-Sim config file. Required.	`None`

Raises:

Type	Description
`AssertionError`	If `config_file_path` is None.
`FileNotFoundError`	If the config file cannot be read by `read_config_file`.

Source code in python-scripts/accel_sim_parser.py

def __init__(self, kernel: dict, log_file_path: str, log_file_base: str, kernel_name: str = None, config_file_path: str = None):
    """
    Initialize the SimulationParser for a specific kernel.

    Args:
        kernel (dict): Dictionary containing kernel metadata and log lines from `read_cache_log`.
        log_file_path (str): Path to the log directory.
        log_file_base (str): Base name of the log files.
        kernel_name (str, optional): Actual kernel name (e.g., from `kernels.csv`). Defaults to None.
        config_file_path (str, optional): Path to the GPGPU-Sim config file. Required.

    Raises:
        AssertionError: If `config_file_path` is None.
        FileNotFoundError: If the config file cannot be read by `read_config_file`.
    """
    # Constants
    # Lovelace GPU has frequency of 2235
    self.GPU_FREQ = int(os.getenv('GPU_FREQ', 1593))
    # Time in ns for one cycle
    self.CYCLE_TIME = 1e9 / (self.GPU_FREQ * 1e6)

    self.log_path = log_file_path
    self.log_file = log_file_base

    # Kernel identifiers
    self.kernel_name = kernel_name
    self.kernel_id = kernel["kernel_id"]
    self.sim_cycles = kernel["gpu_sim_cycle"]
    self.sim_insn = kernel["gpu_sim_insn"]
    self.ipc = kernel["gpu_ipc"]
    self.total_sim_cycles = kernel["gpu_tot_sim_cycle"]
    self.total_sim_insn = kernel["gpu_tot_sim_insn"]

    # Log lines for this kernel
    self.log_lines = kernel["lines"]

    self.sector_size = 32
    self.l1_size = kernel.get("l1_config", 128 * 1024)
    self.l1_cache_lines = self.l1_size / self.sector_size
    self.l2_size = convert_size(os.getenv('L2_SIZE', "50MB"))
    self.l2_cache_lines = self.l2_size / self.sector_size

    # Data structures for cache lifetimes replaced by LifetimeType lists
    self.l1_lifetimes = []  # list of LifetimeType objects for L1
    self.l2_lifetimes = []  # list of LifetimeType objects for L2
    self.l1_most_recent_read = {}
    self.l2_most_recent_read = {}
    # Lookup tables remain unchanged:
    self.l1_current_lifetime_index = {}
    self.l2_current_lifetime_index = {}

    self.l1_lifetime_cycles = []
    self.l1_lifetime_ns = np.array([], dtype=np.float64)
    self.l2_lifetime_cycles = []
    self.l2_lifetime_ns = np.array([], dtype=np.float64)

    # Counters for instructions
    self.l1_read_cycles = []
    self.l1_read_cycle_count = 0
    self.l1_read_count = 0
    self.l1_write_cycles = []
    self.l1_write_cycle_count = 0
    self.l1_write_count = 0
    self.l2_read_cycles = []
    self.l2_read_cycle_count = 0
    self.l2_read_count = 0
    self.l2_write_cycles = []
    self.l2_write_cycle_count = 0
    self.l2_write_count = 0
    self.l1_unique_addrs = 0
    self.l2_unique_addrs = 0

    # Cache configurations
    assert config_file_path is not None, "Config file path must be provided."
    l1_config, l2_config = read_config_file(config_file_path)
    self.l1_write_policy = l1_config["write_policy"]
    self.l1_write_allocation = l1_config["write_allocation"]
    self.l2_write_policy = l2_config["write_policy"]
    self.l2_write_allocation = l2_config["write_allocation"]

`coarse_grained_dict()`

Generate a dictionary containing coarse-grained, kernel-level statistics.

Aggregates lifetime data (mean, median, 90th percentile, max) and combines it with instruction statistics from get_instruction_stats() and cache configuration details into a single dictionary summarizing the kernel's behavior.

Returns:

Name	Type	Description
`dict`		A dictionary containing aggregated statistics for the kernel, including lifetime metrics (mean, median, etc. in microseconds), instruction counts and frequencies, utilization, cache configuration, and lifetime counts. Keys match the column names used for the coarse-grained CSV output.

Source code in python-scripts/accel_sim_parser.py

def coarse_grained_dict(self):
    """
    Generate a dictionary containing coarse-grained, kernel-level statistics.

    Aggregates lifetime data (mean, median, 90th percentile, max) and combines
    it with instruction statistics from `get_instruction_stats()` and cache
    configuration details into a single dictionary summarizing the kernel's behavior.

    Returns:
        dict: A dictionary containing aggregated statistics for the kernel,
              including lifetime metrics (mean, median, etc. in microseconds),
              instruction counts and frequencies, utilization, cache configuration,
              and lifetime counts. Keys match the column names used for the
              coarse-grained CSV output.
    """
    # Check if lifetime data is empty
    if len(self.l1_lifetime_ns) == 0:
        l1_mean, l1_median, l1_90, l1_max = 0, 0, 0, 0
    else:
        l1_mean = np.mean(self.l1_lifetime_ns)
        l1_median = np.median(self.l1_lifetime_ns)
        l1_90 = np.percentile(self.l1_lifetime_ns, 90)
        l1_max = np.max(self.l1_lifetime_ns)
    if len(self.l2_lifetime_ns) == 0:
        l2_mean, l2_median, l2_90, l2_max = 0, 0, 0, 0
    else:
        l2_mean = np.mean(self.l2_lifetime_ns)
        l2_median = np.median(self.l2_lifetime_ns)
        l2_90 = np.percentile(self.l2_lifetime_ns, 90)
        l2_max = np.max(self.l2_lifetime_ns)
    stats = self.get_instruction_stats()
    return {
        "Kernel ID": self.kernel_id,
        "Mangled Names": self.kernel_name,
        "L1 Lifetime": l1_mean / 1e3,
        "L1 Lifetime Median": l1_median / 1e3,
        "L1 Lifetime 90%-tile": l1_90 / 1e3,
        "L1 Lifetime Max": l1_max / 1e3,
        "L2 Lifetime": l2_mean / 1e3,
        "L2 Lifetime Median": l2_median / 1e3,
        "L2 Lifetime 90%-tile": l2_90 / 1e3,
        "L2 Lifetime Max": l2_max / 1e3,
        "L1 Read Count": stats["l1_load_count"],
        "L1 Read Cycles": stats["l1_read_cycles"],
        "L1 Write Count": stats["l1_store_count"],
        "L1 Write Cycles": stats["l1_write_cycles"],
        "L2 Read Count": stats["l2_load_count"],
        "L2 Read Cycles": stats["l2_read_cycles"],
        "L2 Write Count": stats["l2_store_count"],
        "L2 Write Cycles": stats["l2_write_cycles"],
        "L1 Zero Count": stats["l1_zero_count"],
        "L2 Zero Count": stats["l2_zero_count"],
        "L1 Read Frequency": stats["l1_load_frequency"],
        "L1 Write Frequency": stats["l1_store_frequency"],
        "L2 Read Frequency": stats["l2_load_frequency"],
        "L2 Write Frequency": stats["l2_store_frequency"],
        "L1 Utilization": stats["l1_utilization"],
        "L1 Size": self.l1_size,
        "L1 Unique Addresses": stats["l1_unique_addrs"],
        "L2 Utilization": stats["l2_utilization"],
        "L2 Size": self.l2_size,
        "L2 Unique Addresses": stats["l2_unique_addrs"],
        "Total Read Frequency": stats["load_frequency"],
        "Total Write Frequency": stats["store_frequency"],
        "Total Cycles": self.sim_cycles,
        "L1 Write Policy": self.l1_write_policy.name,
        "L1 Write Allocation": self.l1_write_allocation.name,
        "L2 Write Policy": self.l2_write_policy.name,
        "L2 Write Allocation": self.l2_write_allocation.name,
        "L1 Lifetime Count": self.l1_lifetimes_count,
        "L2 Lifetime Count": self.l2_lifetimes_count,
    }

`fine_grained_df()`

Create DataFrames containing fine-grained lifetime data for L1 and L2 caches.

Generates two pandas DataFrames, one for L1 and one for L2, containing individual cache line lifetime entries. Each row represents a completed lifetime instance.

Returns:

Type	Description
	Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing: - l1_df: DataFrame with columns 'kernel_id', 'address' (hex), 'lifetime_cycles', 'lifetime_ns'. - l2_df: DataFrame with columns 'kernel_id', 'address' (hex), 'lifetime_cycles', 'lifetime_ns'.

Source code in python-scripts/accel_sim_parser.py

def fine_grained_df(self):
    """
    Create DataFrames containing fine-grained lifetime data for L1 and L2 caches.

    Generates two pandas DataFrames, one for L1 and one for L2, containing
    individual cache line lifetime entries. Each row represents a completed
    lifetime instance.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing:
            - l1_df: DataFrame with columns 'kernel_id', 'address' (hex),
                     'lifetime_cycles', 'lifetime_ns'.
            - l2_df: DataFrame with columns 'kernel_id', 'address' (hex),
                     'lifetime_cycles', 'lifetime_ns'.
    """
    # Use valid lifetimes computed in parse_log_file
    min_l1_len = len(self.l1_lifetime_cycles)
    min_l2_len = len(self.l2_lifetime_cycles)
    l1_addresses = [hex(int(lt.address)) for lt in self.l1_lifetimes]
    l1_df = pd.DataFrame({
        "kernel_id": self.kernel_id,
        "address": l1_addresses[:min_l1_len],
        "lifetime_cycles": self.l1_lifetime_cycles[:min_l1_len],
        "lifetime_ns": self.l1_lifetime_ns[:min_l1_len]
    })
    l2_addresses = [hex(int(lt.address)) for lt in self.l2_lifetimes]
    l2_df = pd.DataFrame({
        "kernel_id": self.kernel_id,
        "address": l2_addresses[:min_l2_len],
        "lifetime_cycles": self.l2_lifetime_cycles[:min_l2_len],
        "lifetime_ns": self.l2_lifetime_ns[:min_l2_len]
    })
    return l1_df, l2_df

`get_instruction_stats()`

Calculate instruction frequencies and cache utilization statistics.

Computes L1/L2 load/store frequencies (in MHz) based on the number of unique cycles with corresponding operations and the total simulation cycles for the kernel. Also calculates overall load/store frequencies and L1/L2 cache utilization based on unique addresses accessed.

Returns:

Name	Type	Description
`dict`		A dictionary containing various statistics: - 'kernel_name', 'kernel_id' - 'l1_load_count', 'l1_store_count': Total L1 operations. - 'l1_read_cycles', 'l1_write_cycles': Unique cycles with L1 ops. - 'l1_load_frequency', 'l1_store_frequency': Frequencies in MHz. - 'l2_load_count', 'l2_store_count': Total L2 operations. - 'l2_read_cycles', 'l2_write_cycles': Unique cycles with L2 ops. - 'l2_load_frequency', 'l2_store_frequency': Frequencies in MHz. - 'l1_unique_addrs', 'l2_unique_addrs': Unique addresses accessed. - 'l1_utilization', 'l2_utilization': Cache utilization ratios. - 'load_count', 'store_count': Total unique cycles with loads/stores. - 'load_frequency', 'store_frequency': Overall frequencies in MHz. - 'l1_zero_count', 'l2_zero_count': Counts of zero/incomplete lifetimes. - 'l1_lifetimes_count', 'l2_lifetimes_count': Counts of valid lifetimes.

Source code in python-scripts/accel_sim_parser.py

def get_instruction_stats(self):
    """
    Calculate instruction frequencies and cache utilization statistics.

    Computes L1/L2 load/store frequencies (in MHz) based on the number of
    unique cycles with corresponding operations and the total simulation cycles
    for the kernel. Also calculates overall load/store frequencies and
    L1/L2 cache utilization based on unique addresses accessed.

    Returns:
        dict: A dictionary containing various statistics:
            - 'kernel_name', 'kernel_id'
            - 'l1_load_count', 'l1_store_count': Total L1 operations.
            - 'l1_read_cycles', 'l1_write_cycles': Unique cycles with L1 ops.
            - 'l1_load_frequency', 'l1_store_frequency': Frequencies in MHz.
            - 'l2_load_count', 'l2_store_count': Total L2 operations.
            - 'l2_read_cycles', 'l2_write_cycles': Unique cycles with L2 ops.
            - 'l2_load_frequency', 'l2_store_frequency': Frequencies in MHz.
            - 'l1_unique_addrs', 'l2_unique_addrs': Unique addresses accessed.
            - 'l1_utilization', 'l2_utilization': Cache utilization ratios.
            - 'load_count', 'store_count': Total unique cycles with loads/stores.
            - 'load_frequency', 'store_frequency': Overall frequencies in MHz.
            - 'l1_zero_count', 'l2_zero_count': Counts of zero/incomplete lifetimes.
            - 'l1_lifetimes_count', 'l2_lifetimes_count': Counts of valid lifetimes.
    """
    if self.sim_cycles > 0:
        l1_load_freq = self.GPU_FREQ * self.l1_read_cycle_count / self.sim_cycles
        l1_store_freq = self.GPU_FREQ * self.l1_write_cycle_count / self.sim_cycles
        l2_load_freq = self.GPU_FREQ * self.l2_read_cycle_count / self.sim_cycles
        l2_store_freq = self.GPU_FREQ * self.l2_write_cycle_count / self.sim_cycles
        # Get the union between the L1 and L2 read cycle sets
        load_cycles = set(self.l1_read_cycles).union(
            set(self.l2_read_cycles))
        # Get the union between the L1 and L2 write cycle sets
        store_cycles = set(self.l1_write_cycles).union(
            set(self.l2_write_cycles))
        load_freq = self.GPU_FREQ * len(load_cycles) / self.sim_cycles
        store_freq = self.GPU_FREQ * len(store_cycles) / self.sim_cycles
    else:
        l1_load_freq = 0
        l1_store_freq = 0
        l2_load_freq = 0
        l2_store_freq = 0
        load_freq = 0
        store_freq = 0

    return {
        'kernel_name': self.kernel_name,
        'kernel_id': self.kernel_id,
        # 'l1_load_count': self.l1_read_cycle_count,
        # 'l1_store_count': self.l1_write_cycle_count,
        'l1_load_count': self.l1_read_count,
        'l1_store_count': self.l1_write_count,
        'l1_read_cycles': self.l1_read_cycle_count,
        'l1_write_cycles': self.l1_write_cycle_count,
        'l1_load_frequency': l1_load_freq,
        'l1_store_frequency': l1_store_freq,
        # 'l2_load_count': self.l2_read_cycle_count,
        # 'l2_store_count': self.l2_write_cycle_count,
        'l2_load_count': self.l2_read_count,
        'l2_store_count': self.l2_write_count,
        'l2_read_cycles': self.l2_read_cycle_count,
        'l2_write_cycles': self.l2_write_cycle_count,
        'l2_load_frequency': l2_load_freq,
        'l2_store_frequency': l2_store_freq,
        "l1_unique_addrs": self.l1_unique_addrs,
        "l2_unique_addrs": self.l2_unique_addrs,
        "l1_utilization": self.l1_unique_addrs / self.l1_cache_lines,
        "l2_utilization": self.l2_unique_addrs / self.l2_cache_lines,
        'load_count': self.l1_read_cycle_count + self.l2_read_cycle_count,
        'store_count': self.l1_write_cycle_count + self.l2_write_cycle_count,
        'load_frequency': load_freq,
        'store_frequency': store_freq,
        'l1_zero_count': self.l1_zero_count,
        'l2_zero_count': self.l2_zero_count,
        'l1_lifetimes_count': self.l1_lifetimes_count,
        'l2_lifetimes_count': self.l2_lifetimes_count,
    }

`import_cache_states(cache_states)`

Import cache state from the previous kernel to continue lifetime tracking.

Takes the state returned by parse_log_file from the previously processed kernel and initializes the current parser instance with it. This allows lifetimes spanning across kernel boundaries to be tracked correctly.

Parameters:

Name	Type	Description	Default
`cache_states`	`list`	The list returned by `parse_log_file` containing incomplete lifetimes and most recent read dictionaries. `[l1_lifetimes_incomplete, l2_lifetimes_incomplete, l1_most_recent_read, l2_most_recent_read]`.	required

Returns:

Name	Type	Description
`None`		Updates internal state (`l1_lifetimes`, `l2_lifetimes`, `l1_most_recent_read`, `l2_most_recent_read`, `l1_current_lifetime_index`, `l2_current_lifetime_index`).

Source code in python-scripts/accel_sim_parser.py

def import_cache_states(self, cache_states):
    """
    Import cache state from the previous kernel to continue lifetime tracking.

    Takes the state returned by `parse_log_file` from the previously processed
    kernel and initializes the current parser instance with it. This allows
    lifetimes spanning across kernel boundaries to be tracked correctly.

    Args:
        cache_states (list): The list returned by `parse_log_file` containing
                             incomplete lifetimes and most recent read dictionaries.
                             `[l1_lifetimes_incomplete, l2_lifetimes_incomplete,
                             l1_most_recent_read, l2_most_recent_read]`.

    Returns:
        None: Updates internal state (`l1_lifetimes`, `l2_lifetimes`,
              `l1_most_recent_read`, `l2_most_recent_read`,
              `l1_current_lifetime_index`, `l2_current_lifetime_index`).
    """
    # Unpack the cache states
    l1_lifetimes, l2_lifetimes, self.l1_most_recent_read, self.l2_most_recent_read = cache_states
    # Assign the lifetimes to the class variables
    self.l1_lifetimes = l1_lifetimes
    self.l2_lifetimes = l2_lifetimes
    # print length of the lifetimes
    print(
        f"Importing {len(self.l1_lifetimes)} L1 lifetimes and {len(self.l2_lifetimes)} L2 lifetimes.")
    # Rebuild the current lifetime index dictionaries
    # Find the last index for each address
    self.l1_current_lifetime_index = {}
    for i, lt in enumerate(self.l1_lifetimes):
        self.l1_current_lifetime_index[lt.address] = i
    self.l2_current_lifetime_index = {}
    for i, lt in enumerate(self.l2_lifetimes):
        self.l2_current_lifetime_index[lt.address] = i

`parse_log_file()`

Parse all log lines for the kernel and finalize lifetime calculations.

Iterates through self.log_lines, calling process_line for each. After processing all lines, it finalizes any remaining active lifetimes using the l1_most_recent_read and l2_most_recent_read dictionaries. Calculates lifetime durations in cycles and nanoseconds, storing them in l1_lifetime_cycles, l1_lifetime_ns, etc. Updates final counts for unique addresses and zero/incomplete lifetimes.

Returns:

Name	Type	Description
`list`	`list`	A list containing the state needed to continue parsing for the next kernel: `[l1_lifetimes_incomplete, l2_lifetimes_incomplete, l1_most_recent_read, l2_most_recent_read]`. `l1_lifetimes_incomplete` and `l2_lifetimes_incomplete` are lists of `LifetimeType` objects whose `end` cycle is still None.

Source code in python-scripts/accel_sim_parser.py

def parse_log_file(self) -> list:
    """
    Parse all log lines for the kernel and finalize lifetime calculations.

    Iterates through `self.log_lines`, calling `process_line` for each.
    After processing all lines, it finalizes any remaining active lifetimes
    using the `l1_most_recent_read` and `l2_most_recent_read` dictionaries.
    Calculates lifetime durations in cycles and nanoseconds, storing them in
    `l1_lifetime_cycles`, `l1_lifetime_ns`, etc. Updates final counts for
    unique addresses and zero/incomplete lifetimes.

    Returns:
        list: A list containing the state needed to continue parsing for the
              next kernel: `[l1_lifetimes_incomplete, l2_lifetimes_incomplete,
              l1_most_recent_read, l2_most_recent_read]`.
              `l1_lifetimes_incomplete` and `l2_lifetimes_incomplete` are lists
              of `LifetimeType` objects whose `end` cycle is still None.
    """

    print(len(self.l1_lifetimes), len(self.l2_lifetimes))

    line_count = 0
    for line in self.log_lines:
        if line_count % 10000 == 0:
            print(
                f"\tProcessing line {line_count} of {len(self.log_lines)}")
        self.process_line(line)
        line_count += 1

    # Finalize L1 lifetime entries
    valid_l1 = []
    self.l1_zero_count = 0
    for index, lt in enumerate(self.l1_lifetimes):
        if lt.end is not None:
            if lt.end > lt.start:
                valid_l1.append(lt)
            elif lt.end == lt.start:
                self.l1_zero_count += 1
        elif self.l1_most_recent_read.get(lt.address) is not None:
            lt.end = self.l1_most_recent_read[lt.address]
            if lt.end > lt.start:
                valid_l1.append(lt)
            elif lt.end == lt.start:
                self.l1_zero_count += 1

    self.l1_lifetime_cycles = np.array(
        [lt.calculate_lifetime() for lt in valid_l1])
    # Convert to nanoseconds
    self.l1_lifetime_ns = self.CYCLE_TIME * self.l1_lifetime_cycles

    # Finalize L2 lifetime entries
    valid_l2 = []
    self.l2_zero_count = 0
    for lt in self.l2_lifetimes:
        if lt.end is not None:
            if lt.end > lt.start:
                valid_l2.append(lt)
            elif lt.end == lt.start:
                self.l2_zero_count += 1
        elif self.l2_most_recent_read.get(lt.address) is not None:
            lt.end = self.l2_most_recent_read[lt.address]
            if lt.end > lt.start:
                valid_l2.append(lt)
            elif lt.end == lt.start:
                self.l2_zero_count += 1

    self.l2_lifetime_cycles = np.array(
        [lt.calculate_lifetime() for lt in valid_l2])
    # Convert to nanoseconds
    self.l2_lifetime_ns = self.CYCLE_TIME * self.l2_lifetime_cycles

    # Update read/write counters and unique addresses
    self.l1_read_cycle_count = len(self.l1_read_cycles)
    self.l1_write_cycle_count = len(self.l1_write_cycles)
    self.l2_read_cycle_count = len(self.l2_read_cycles)
    self.l2_write_cycle_count = len(self.l2_write_cycles)
    self.l1_unique_addrs = len({lt.address for lt in self.l1_lifetimes})
    self.l2_unique_addrs = len({lt.address for lt in self.l2_lifetimes})

    l1_lifetimes_export = [
        lt for lt in self.l1_lifetimes if lt.end is None]
    l2_lifetimes_export = [
        lt for lt in self.l2_lifetimes if lt.end is None]
    self.l1_zero_count += len(l1_lifetimes_export)
    self.l2_zero_count += len(l2_lifetimes_export)
    self.l1_lifetimes = valid_l1
    self.l2_lifetimes = valid_l2
    self.l1_lifetimes_count = len(self.l1_lifetimes)
    self.l2_lifetimes_count = len(self.l2_lifetimes)

    print(
        f"Kernel {self.kernel_name} (ID: {self.kernel_id}): Processed {len(self.log_lines)} lines."
    )
    print(
        f"Built L1 Lifetime DataFrame with {len(self.l1_lifetime_cycles)} entries."
    )
    print(
        f"Built L2 Lifetime DataFrame with {len(self.l2_lifetime_cycles)} entries."
    )
    print(
        f"Exporting {len(l1_lifetimes_export)} L1 lifetimes and {len(l2_lifetimes_export)} L2 lifetimes.")

    return [
        l1_lifetimes_export,
        l2_lifetimes_export,
        self.l1_most_recent_read,
        self.l2_most_recent_read
    ]

`process_cycle(line)`

Extract the simulation cycle number from a log line.

Parameters:

Name	Type	Description	Default
`line`	`str`	A log line containing 'Cycle '.	required

Returns:

Name	Type	Description
`int`	`int`	The extracted cycle number.

Raises:

Type	Description
`ValueError`	If the cycle number cannot be parsed as an integer.
`IndexError`	If the line format is unexpected.

Source code in python-scripts/accel_sim_parser.py

def process_cycle(self, line: str) -> int:
    """
    Extract the simulation cycle number from a log line.

    Args:
        line (str): A log line containing 'Cycle <num>'.

    Returns:
        int: The extracted cycle number.

    Raises:
        ValueError: If the cycle number cannot be parsed as an integer.
        IndexError: If the line format is unexpected.
    """
    cycle_str = line.split("Cycle ")[1].split()[0]
    cycle = int(cycle_str.strip(":"))
    return cycle

`process_line(line)`

Process a single cache log line to update lifetime tracking.

Parses the line to identify L1 or L2 cache accesses, address, cycle, status (hit/miss), and operation type (load/store). Updates the internal lifetime tracking structures (l1_lifetimes, l2_lifetimes, l1_current_lifetime_index, l2_current_lifetime_index, l1_most_recent_read, l2_most_recent_read) based on the access and configured cache policies. Also updates read/write counters.

Parameters:

Name	Type	Description	Default
`line`	`str`	The cache log line to process.	required

Returns:

Name	Type	Description
`None`		Modifies internal state.

Source code in python-scripts/accel_sim_parser.py

def process_line(self, line: str):
    """
    Process a single cache log line to update lifetime tracking.

    Parses the line to identify L1 or L2 cache accesses, address, cycle,
    status (hit/miss), and operation type (load/store). Updates the internal
    lifetime tracking structures (`l1_lifetimes`, `l2_lifetimes`,
    `l1_current_lifetime_index`, `l2_current_lifetime_index`,
    `l1_most_recent_read`, `l2_most_recent_read`) based on the access and
    configured cache policies. Also updates read/write counters.

    Args:
        line (str): The cache log line to process.

    Returns:
        None: Modifies internal state.
    """
    # Process L1 cache lines
    # GPGPU-Sim Cycle 11097: Load instr from L1D cache at SM 0 bank 3 addr 2d61c4e0 status 2
    if "L1D cache at SM" in line:
        # Get the number after "status" to determine if it's a hit or miss
        status = int(line.split("status ")[1].split()[0])
        if status >= 3:
            return
        # Get the cycle number
        cycle = int(line.split("Cycle ")[1].split()[0].strip(":"))
        # Get the address after "addr" and convert to decimal int
        address = int(line.split("addr ")[1].split()[0], 16)

        # look up the index in the three lifetime lists
        index = self.l1_current_lifetime_index.get(address, None)

        l1_write = "Store" in line
        l1_read = "Load" in line

        # End the lifetime of the cache line
        if (status == 2 or l1_write) and index is not None:
            start = self.l1_lifetimes[index].start
            # If an active lifetime exists, end it using the latest read
            last_read = self.l1_most_recent_read.get(address, start)
            self.l1_lifetimes[index].end = last_read if last_read >= start else None

        # Decide if we need to create a new lifetime entry:
        # - For a miss (status==2):
        #   * If there is no active entry, create one. If NO_WRITE_ALLOCATE, only create one if this is not a store.
        #   * Otherwise, always create a new entry.
        # - For a store hit:
        #   * Create a new entry
        # Determine whether to create a new lifetime entry in a single expression:
        new_entry = (
            # Cache miss
            (status == 2 and
             # cold-start miss
             (index is not None or
              # Start lifetime on read or write as determined by write allocation policy
              (self.l1_write_allocation == WriteAllocation.WRITE_ALLOCATE or l1_read)))
            # Cache hit and write to existing entry
            or (l1_write and index is not None))

        # If a new lifetime entry is needed, create it.
        if new_entry:
            new_lt = LifetimeType(address, cycle, None)
            self.l1_lifetimes.append(new_lt)
            self.l1_current_lifetime_index[address] = len(
                self.l1_lifetimes) - 1

        # Process the instruction type
        if l1_write:
            self.l1_write_count += 1
            if cycle not in self.l1_write_cycles:
                self.l1_write_cycles.append(cycle)
        elif l1_read:
            self.l1_read_count += 1
            # store the most recent read if this is a hit
            if status == 0:
                self.l1_most_recent_read[address] = cycle
            if cycle not in self.l1_read_cycles:
                self.l1_read_cycles.append(cycle)
        else:
            print(f"Warning: Unknown L1D instruction type: {line}")
            return

    # Process L2 cache lines
    elif "L2 Address" in line:
        # GPGPU-Sim Cycle 12969: MEMORY_SUBPARTITION_UNIT -  2 - Store Request to L2 Address=71082d62bb60, status=0
        status = int(line.split("status=")[1].split()[0])
        if status >= 3:
            return
        cycle = int(line.split("Cycle ")[1].split()[0].strip(":"))
        address = int(line.split("Address=")[1].split(",")[0], 16)
        # Get the bank after MEMORY_SUBPARTITION_UNIT
        bank = int(line.split("MEMORY_SUBPARTITION_UNIT - ")
                   [1].split(" - ")[0].strip())
        # if bank != 0:
        #     return

        # look up the index in the three lifetime lists
        index = self.l2_current_lifetime_index.get(address, None)

        l2_write = "Store" in line
        l2_read = "Load" in line

        # End lifetime if needed
        if (status == 2 or l2_write) and index is not None:
            start = self.l2_lifetimes[index].start
            # If an active lifetime exists, end it using the latest read
            last_read = self.l2_most_recent_read.get(address, start)
            self.l2_lifetimes[index].end = last_read if last_read >= start else None
        # Decide if we need to create a new lifetime entry:
        # - For a miss (status==2):
        #   * If there is no active entry, create one. If NO_WRITE_ALLOCATE, only create one if this is not a store.
        #   * Otherwise, always create a new entry.
        # - For a store hit:
        #   * Create a new entry
        # Determine whether to create a new lifetime entry in a single expression:
        new_entry = (
            # Cache miss
            (status == 2 and
                # cold-start miss
                (index is not None or
                 # Start lifetime on read or write as determined by write allocation policy
                 (self.l2_write_allocation == WriteAllocation.WRITE_ALLOCATE or l2_read)))
            # Cache hit and write to existing entry
            or (l2_write and index is not None))
        # If a new lifetime entry is needed, create it.
        if new_entry:
            new_lt = LifetimeType(address, cycle, None)
            self.l2_lifetimes.append(new_lt)
            self.l2_current_lifetime_index[address] = len(
                self.l2_lifetimes) - 1

        # Process the instruction type
        if l2_write:
            self.l2_write_count += 1
            if cycle not in self.l2_write_cycles:
                self.l2_write_cycles.append(cycle)
        elif l2_read:
            self.l2_read_count += 1
            if status == 0:
                self.l2_most_recent_read[address] = cycle
            if cycle not in self.l2_read_cycles:
                self.l2_read_cycles.append(cycle)
        else:
            print(f"Warning: Unknown L2 instruction type: {line}")
            return

`WriteAllocation`

Bases: Enum

Cache write allocation policies.

Source code in python-scripts/accel_sim_parser.py

class WriteAllocation(enum.Enum):
    """Cache write allocation policies."""
    WRITE_ALLOCATE = 1
    NO_WRITE_ALLOCATE = 2

`WritePolicy`

Bases: Enum

Cache write policies.

Source code in python-scripts/accel_sim_parser.py

class WritePolicy(enum.Enum):
    """Cache write policies."""
    WRITE_BACK = 1
    WRITE_THROUGH = 2

`convert_size(size_str)`

Convert a size string (e.g., "128KB", "50MB") to bytes.

Parses strings with suffixes KB, MB, GB (case-insensitive) and returns the equivalent size in bytes.

Parameters:

Name	Type	Description	Default
`size_str`	`str`	The size string to convert (e.g., "256KB", "1GB").	required

Returns:

Name	Type	Description
`int`	`int`	The size in bytes.

Raises:

Type	Description
`ValueError`	If the string format or suffix is invalid.

Source code in python-scripts/accel_sim_parser.py

def convert_size(size_str: str) -> int:
    """
    Convert a size string (e.g., "128KB", "50MB") to bytes.

    Parses strings with suffixes KB, MB, GB (case-insensitive) and returns
    the equivalent size in bytes.

    Args:
        size_str (str): The size string to convert (e.g., "256KB", "1GB").

    Returns:
        int: The size in bytes.

    Raises:
        ValueError: If the string format or suffix is invalid.
    """
    size_str = size_str.upper()  # Ensure case-insensitivity
    if "GB" in size_str:
        return int(size_str.split("GB")[0]) * 1024 * 1024 * 1024
    elif "MB" in size_str:
        return int(size_str.split("MB")[0]) * 1024 * 1024
    elif "KB" in size_str:
        return int(size_str.split("KB")[0]) * 1024
    elif "B" in size_str:
        return int(size_str.split("B")[0])
    else:
        raise ValueError(f"Invalid size string: {size_str}")

`find_and_append(kernels, current_index, key, line, dtype=int)`

Find a key-value pair in a log line and append it to the current kernel's data.

Searches for ' = ' in the line. If found, converts the value to the specified dtype and adds it to the dictionary at kernels[current_index] with the given key. Handles potential errors during conversion.

Parameters:

Name	Type	Description	Default
`kernels`	`list`	A list of dictionaries, where each dictionary holds data for a kernel.	required
`current_index`	`int`	The index in the `kernels` list corresponding to the current kernel.	required
`key`	`str`	The key string to search for in the line (e.g., "gpu_sim_cycle").	required
`line`	`str`	The log line to parse.	required
`dtype`	`type`	The data type to convert the found value to (e.g., int, float). Defaults to int.	`int`

Returns:

Name	Type	Description
`None`		Modifies the `kernels` list in place.

Source code in python-scripts/accel_sim_parser.py

def find_and_append(kernels: list, current_index: int, key: str, line: str, dtype=int):
    """
    Find a key-value pair in a log line and append it to the current kernel's data.

    Searches for '<key> = <value>' in the `line`. If found, converts the value
    to the specified `dtype` and adds it to the dictionary at `kernels[current_index]`
    with the given `key`. Handles potential errors during conversion.

    Args:
        kernels (list): A list of dictionaries, where each dictionary holds data for a kernel.
        current_index (int): The index in the `kernels` list corresponding to the current kernel.
        key (str): The key string to search for in the line (e.g., "gpu_sim_cycle").
        line (str): The log line to parse.
        dtype (type): The data type to convert the found value to (e.g., int, float). Defaults to int.

    Returns:
        None: Modifies the `kernels` list in place.
    """
    if key in line:
        try:
            num = dtype(line.split("=")[1].strip())
            kernels[current_index][key] = num
        except Exception as e:
            print(f"Error parsing {key} in line: {line}")
            print(e)
            kernels[current_index][key] = None
        finally:
            return

`get_cache_policy(line)`

Extract cache write policy and allocation policy from a GPGPU-Sim config line.

Parses a GPGPU-Sim configuration line (e.g., starting with '-gpgpu_cache:dl1') to determine the cache's write policy (Write-Back/Write-Through) and write allocation policy (Write-Allocate/No-Write-Allocate).

Parameters:

Name	Type	Description	Default
`line`	`str`	A configuration line string (e.g., "-gpgpu_cache:dl1 S:4:128:256,L:T:m:L:L,A:384:48,16:0,32").	required

Returns:

Name	Type	Description
`dict`	`dict`	A dictionary containing: - 'write_policy' (WritePolicy): The parsed write policy enum. - 'write_allocation' (WriteAllocation): The parsed write allocation enum.

Source code in python-scripts/accel_sim_parser.py

def get_cache_policy(line: str) -> dict:
    """
    Extract cache write policy and allocation policy from a GPGPU-Sim config line.

    Parses a GPGPU-Sim configuration line (e.g., starting with '-gpgpu_cache:dl1')
    to determine the cache's write policy (Write-Back/Write-Through) and
    write allocation policy (Write-Allocate/No-Write-Allocate).

    Args:
        line (str): A configuration line string (e.g., "-gpgpu_cache:dl1 S:4:128:256,L:T:m:L:L,A:384:48,16:0,32").

    Returns:
        dict: A dictionary containing:
            - 'write_policy' (WritePolicy): The parsed write policy enum.
            - 'write_allocation' (WriteAllocation): The parsed write allocation enum.
    """
    line = line.split(" ")[1]
    # split by colon
    parts = line.split(":")
    # get the parts corresponding to wr and wr_alloc
    wr = parts[4]
    wr_alloc = parts[6]
    print(wr, wr_alloc)
    if wr == "T":
        write_policy = WritePolicy.WRITE_THROUGH
    else:
        write_policy = WritePolicy.WRITE_BACK
    if wr_alloc == "N":
        write_alloc = WriteAllocation.NO_WRITE_ALLOCATE
    else:
        write_alloc = WriteAllocation.WRITE_ALLOCATE
    return {
        "write_policy": write_policy,
        "write_allocation": write_alloc
    }

`read_cache_log(log_file_path, basename=None)`

Read Accel-Sim cache and simulation logs, grouping lines by kernel.

Parses the .sim_cache.log and .sim.log files. It identifies kernel launches and groups subsequent log lines belonging to each kernel. It also extracts kernel metadata like name, ID, simulation cycles, instructions, IPC, and detected cache configuration changes from the log lines.

Parameters:

Name	Type	Description	Default
`log_file_path`	`str`	The directory containing the log files.	required
`basename`	`str`	The base name of the log files (e.g., "program_timestamp").	`None`

Returns:

Type Description

list[dict]: A list of dictionaries. Each dictionary represents a kernel and contains: - 'kernel_name' (str): The name of the kernel. - 'kernel_id' (int): The ID of the kernel. - 'lines' (list[str]): A list of log lines associated with this kernel from .sim_cache.log. - 'gpu_sim_cycle' (Optional[int]): Simulation cycles for this kernel. - 'gpu_tot_sim_cycle' (Optional[int]): Total simulation cycles up to this kernel. - 'gpu_sim_insn' (Optional[int]): Instructions executed by this kernel. - 'gpu_tot_sim_insn' (Optional[int]): Total instructions executed up to this kernel. - 'gpu_ipc' (Optional[float]): Instructions per cycle for this kernel. - 'l1_config' (Optional[int]): Detected L1 cache size in bytes (if reconfigured).

Source code in python-scripts/accel_sim_parser.py

def read_cache_log(log_file_path: str, basename: str = None):
    """
    Read Accel-Sim cache and simulation logs, grouping lines by kernel.

    Parses the `.sim_cache.log` and `.sim.log` files. It identifies kernel
    launches and groups subsequent log lines belonging to each kernel. It also
    extracts kernel metadata like name, ID, simulation cycles, instructions, IPC,
    and detected cache configuration changes from the log lines.

    Args:
        log_file_path (str): The directory containing the log files.
        basename (str): The base name of the log files (e.g., "program_timestamp").

    Returns:
        list[dict]: A list of dictionaries. Each dictionary represents a kernel and contains:
            - 'kernel_name' (str): The name of the kernel.
            - 'kernel_id' (int): The ID of the kernel.
            - 'lines' (list[str]): A list of log lines associated with this kernel from `.sim_cache.log`.
            - 'gpu_sim_cycle' (Optional[int]): Simulation cycles for this kernel.
            - 'gpu_tot_sim_cycle' (Optional[int]): Total simulation cycles up to this kernel.
            - 'gpu_sim_insn' (Optional[int]): Instructions executed by this kernel.
            - 'gpu_tot_sim_insn' (Optional[int]): Total instructions executed up to this kernel.
            - 'gpu_ipc' (Optional[float]): Instructions per cycle for this kernel.
            - 'l1_config' (Optional[int]): Detected L1 cache size in bytes (if reconfigured).
    """

    # Read cache log file
    log_file = os.path.join(log_file_path, basename + ".sim_cache.log")
    print(f"Reading log file: {log_file}")
    with open(log_file, 'r') as f:
        lines = f.readlines()
    kernels = []
    current_index = -1
    current_kernel_name = ""
    current_kernel_id = 0
    for line in lines:
        if "-kernel name = " in line:
            current_kernel_name = line.split("-kernel name = ")[1].strip()
        if "-kernel id = " in line:
            current_kernel_id = int(line.split("-kernel id = ")[1].strip())
        if "launching kernel name" in line:
            current_index += 1
            print(
                f"Found kernel: {current_kernel_name} (ID: {current_kernel_id})")
            kernels.append({
                "kernel_name": current_kernel_name,
                "kernel_id": current_kernel_id,
                "lines": []
            })
        else:
            if current_index >= 0:
                kernels[current_index]["lines"].append(line)

    # Read report log file
    report_file = os.path.join(log_file_path, basename + ".sim.log")
    print(f"Reading log file: {report_file}")
    with open(report_file, 'r') as f:
        lines = f.readlines()

    current_index = -1
    for line in lines:
        if "launching kernel name" in line:
            m = re.search(r"launching kernel name: (\S+) uid: (\d+)", line)
            if m:
                current_index += 1

        find_and_append(kernels, current_index, "gpu_sim_cycle", line, int)
        find_and_append(kernels, current_index, "gpu_tot_sim_cycle", line, int)
        find_and_append(kernels, current_index, "gpu_sim_insn", line, int)
        find_and_append(kernels, current_index, "gpu_tot_sim_insn", line, int)
        find_and_append(kernels, current_index, "gpu_ipc", line, float)

        if "GPGPU-Sim: Reconfigure L1 cache to" in line:
            m = re.search(r"Reconfigure L1 cache to (\S+)", line)
            if m:
                kernels[current_index]["l1_config"] = convert_size(m.group(1))

    return kernels

`read_config_file(config_file_path=None)`

Read L1 and L2 cache policies from a GPGPU-Sim configuration file.

Parses the specified configuration file (or the default configs/gpgpusim.config) to find lines defining the L1 data cache (-gpgpu_cache:dl1) and L2 data cache (-gpgpu_cache:dl2) and extracts their write and allocation policies using get_cache_policy.

Parameters:

Name	Type	Description	Default
`config_file_path`	`str`	Path to the GPGPU-Sim config file. Defaults to `configs/gpgpusim.config` relative to this script's directory.	`None`

Returns:

Type	Description
	Tuple[dict, dict]: A tuple containing two dictionaries: - l1_config: Dictionary with 'write_policy' and 'write_allocation' for L1. - l2_config: Dictionary with 'write_policy' and 'write_allocation' for L2. Returns empty dictionaries if config lines are not found.

Raises:

Type	Description
`FileNotFoundError`	If the specified or default config file does not exist.

Source code in python-scripts/accel_sim_parser.py

def read_config_file(config_file_path: str = None):
    """
    Read L1 and L2 cache policies from a GPGPU-Sim configuration file.

    Parses the specified configuration file (or the default `configs/gpgpusim.config`)
    to find lines defining the L1 data cache (`-gpgpu_cache:dl1`) and L2 data cache
    (`-gpgpu_cache:dl2`) and extracts their write and allocation policies using
    `get_cache_policy`.

    Args:
        config_file_path (str, optional): Path to the GPGPU-Sim config file.
                                          Defaults to `configs/gpgpusim.config` relative
                                          to this script's directory.

    Returns:
        Tuple[dict, dict]: A tuple containing two dictionaries:
            - l1_config: Dictionary with 'write_policy' and 'write_allocation' for L1.
            - l2_config: Dictionary with 'write_policy' and 'write_allocation' for L2.
                         Returns empty dictionaries if config lines are not found.

    Raises:
        FileNotFoundError: If the specified or default config file does not exist.
    """
    if config_file_path is None:
        config_file_path = os.path.join(
            os.path.dirname(__file__), "configs", "gpgpusim.config")
    print(f"Reading config file: {config_file_path}")
    with open(config_file_path, 'r') as f:
        lines = f.readlines()
    config = {}
    for line in lines:
        if line.startswith("-gpgpu_cache:dl1 "):
            l1_config = get_cache_policy(line)
            print(f"L1 config: {l1_config}")
        if line.startswith("-gpgpu_cache:dl2 "):
            l2_config = get_cache_policy(line)
            print(f"L2 config: {l2_config}")
    return l1_config, l2_config

`read_function_names(log_file_path, basename)`

Read kernel ID to kernel name mapping from the kernels.csv file.

Parses the kernels.csv file (typically generated by PKS or tracing) to create a dictionary mapping kernel IDs (as integers) to their corresponding names (as strings).

Parameters:

Name	Type	Description	Default
`log_file_path`	`str`	The directory containing the `traces` subdirectory.	required
`basename`	`str`	The base name of the log files (unused in this function, but kept for consistency).	required

Returns:

Type	Description
	dict[int, str]: A dictionary mapping kernel IDs to kernel names. Returns an empty dictionary if `kernels.csv` is not found.

Source code in python-scripts/accel_sim_parser.py

def read_function_names(log_file_path: str, basename: str):
    """
    Read kernel ID to kernel name mapping from the `kernels.csv` file.

    Parses the `kernels.csv` file (typically generated by PKS or tracing)
    to create a dictionary mapping kernel IDs (as integers) to their
    corresponding names (as strings).

    Args:
        log_file_path (str): The directory containing the `traces` subdirectory.
        basename (str): The base name of the log files (unused in this function,
                        but kept for consistency).

    Returns:
        dict[int, str]: A dictionary mapping kernel IDs to kernel names. Returns
                        an empty dictionary if `kernels.csv` is not found.
    """
    kernels_file = os.path.join(log_file_path, "traces", "kernels.csv")
    if not os.path.exists(kernels_file):
        print(f"File {kernels_file} does not exist.")
        return {}
    print(f"Reading kernel names from {kernels_file}")
    # Read CSV file as pandas DataFrame
    df = pd.read_csv(kernels_file)
    # Construct a dictionary with "Kernel ID" as key and "Kernel Name" as value
    kernel_names = {}
    for index, row in df.iterrows():
        kernel_id = row["Kernel ID"]
        kernel_name = row["Kernel Name"]
        kernel_names[kernel_id] = kernel_name
    return kernel_names

`run_parser(log_file_path, log_file_base, config_file_path)`

Main function to run the simulation log parsing process for all kernels.

Reads the grouped kernel data using read_cache_log, iterates through each kernel, creates a SimulationParser instance, imports state from the previous kernel (if any), parses the current kernel's logs, and collects both fine-grained (per-lifetime) and coarse-grained (per-kernel) statistics. Finally, saves the aggregated statistics to CSV files using Dask DataFrames.

Parameters:

Name	Type	Description	Default
`log_file_path`	`str`	Path to the directory containing the log files.	required
`log_file_base`	`str`	Base name of the log files (e.g., "program_timestamp").	required
`config_file_path`	`str`	Path to the GPGPU-Sim configuration file used for simulation.	required

Returns:

Name	Type	Description
`None`		Generates CSV output files: - `<log_file_base>.sim.csv` (coarse-grained kernel stats) - `<log_file_base>.sim_l1.csv` (fine-grained L1 lifetimes) - `<log_file_base>.sim_l2.csv` (fine-grained L2 lifetimes)

Raises:

Type	Description
`SystemExit`	If no kernels are found in the log file.

Source code in python-scripts/accel_sim_parser.py

def run_parser(log_file_path: str, log_file_base: str, config_file_path: str):
    """
    Main function to run the simulation log parsing process for all kernels.

    Reads the grouped kernel data using `read_cache_log`, iterates through each
    kernel, creates a `SimulationParser` instance, imports state from the previous
    kernel (if any), parses the current kernel's logs, and collects both
    fine-grained (per-lifetime) and coarse-grained (per-kernel) statistics.
    Finally, saves the aggregated statistics to CSV files using Dask DataFrames.

    Args:
        log_file_path (str): Path to the directory containing the log files.
        log_file_base (str): Base name of the log files (e.g., "program_timestamp").
        config_file_path (str): Path to the GPGPU-Sim configuration file used for simulation.

    Returns:
        None: Generates CSV output files:
              - `<log_file_base>.sim.csv` (coarse-grained kernel stats)
              - `<log_file_base>.sim_l1.csv` (fine-grained L1 lifetimes)
              - `<log_file_base>.sim_l2.csv` (fine-grained L2 lifetimes)

    Raises:
        SystemExit: If no kernels are found in the log file.
    """
    groups = read_cache_log(log_file_path, log_file_base)
    kernel_names = read_function_names(log_file_path, log_file_base)
    overall_load = 0
    overall_store = 0
    kernels = []

    print(f"Found {len(groups)} kernels in log file.")
    if len(groups) == 0:
        print("No kernels found in log file. Exiting.")
        sys.exit(1)

    coarse_grain_stats = []
    l1_fine_grain_df = pd.DataFrame()
    l2_fine_grain_df = pd.DataFrame()

    cache_state = None
    print(config_file_path)

    for kernel in groups:
        kernel_id = kernel["kernel_id"]
        if kernel_id == 991:
            print("Skipping kernel ID 991")
            continue
        kernel_name = kernel_names.get(kernel_id, kernel["kernel_name"])
        print(
            f"Processing kernel {kernel_name} (ID: {kernel_id}) with {len(kernel['lines'])} lines.")
        try:
            parser_instance = SimulationParser(
                kernel, log_file_path, log_file_base, kernel_name, config_file_path)
        except Exception as e:
            print(f"Error parsing kernel {kernel_name} (ID: {kernel_id}): {e}")
            continue
        if cache_state is not None:
            parser_instance.import_cache_states(cache_state)
        cache_state = parser_instance.parse_log_file()
        l1_df, l2_df = parser_instance.fine_grained_df()
        # Concatenate the fine-grained dataframes
        l1_fine_grain_df = pd.concat([l1_fine_grain_df, l1_df])
        l2_fine_grain_df = pd.concat([l2_fine_grain_df, l2_df])
        coarse_grain_stats.append(parser_instance.coarse_grained_dict())

    coarse_grain_df = pd.DataFrame(coarse_grain_stats)
    coarse_csv = os.path.join(log_file_path, f"{log_file_base}.sim.csv")

    # coarse_grain_df.to_csv(coarse_csv, index=False)
    # before turning into a Dask dataframe, cast any StringDtype columns to object
    # coarse_dd = dd.from_pandas(coarse_grain_df, npartitions=32)
    # coarse_dd.to_csv(coarse_csv, index=False, single_file=True)
    coarse_grain_df.to_csv(coarse_csv, index=False)
    print(f"Coarse-grained CSV file for all kernels saved to: {log_file_path}")
    l1_csv = os.path.join(log_file_path, f"{log_file_base}.sim_l1.csv")
    l2_csv = os.path.join(log_file_path, f"{log_file_base}.sim_l2.csv")

    # l1_dd = dd.from_pandas(l1_fine_grain_df, npartitions=32)
    # l2_dd = dd.from_pandas(l2_fine_grain_df, npartitions=32)
    # l1_dd.to_csv(l1_csv, index=False, single_file=True)
    # l2_dd.to_csv(l2_csv, index=False, single_file=True)
    l1_fine_grain_df.to_csv(l1_csv, index=False)
    l2_fine_grain_df.to_csv(l2_csv, index=False)
    print(f"Fine-grained CSV files for all kernels saved to: {log_file_path}")
    print("All kernels processed successfully.")

GainSight Accel-Sim Backend Python API Documentation

Accel-Sim Runner Script

AccelSimRunner

__init__(program, args, arch='SM90_H100', verbose=False, delete=False, rename=None, ncu_file=None)

run_accel_sim(no_write_allocate=False)

run_post_processing(config_file, sample=False)

run_sampling(sample_delete=False)

run_tracer(sample=False)

parse_args()

Sampling via Principal Kernel Selection (PKS)

PrincipalKernelSelector

__init__(ncu_input_file, output_dir=None)

generate_kernels_csv()

generate_kernelslist(delete=False)

kmeans(data, n_clusters=3)

kmeans_scan(data, lower_bound=2, upper_bound=20)

pca(data, var_threshold=0.95)

run_pks(output_file=None, delete=False)

select_centroid()

Sampling Utilities: NVIDIA NSight Compute Coarse-Grained Profiling

NsightNVBitRunner

__init__()

get_metrics()

init_from_params(program_name, program_args, log_file_name, log_file_path, mangled=True)

init_from_program(program_name, program_args, mangled=True)

run_ncu(dry_run=True)

run_nvbit(dry_run=True)

compute_kernel_metrics(kernel_id, kernel_action, unique_sector_counts)

get_correspondence_table(results, mangled)

parse_arguments()

plot_l1_metrics(results, basename)

plot_l2_metrics(results, basename)

read_nvbit_data(nvbit_input_file, kernel_count)

Parser for Accel-Sim Trace Files

LifetimeType

__dict__()

__init__(address, start, end)

calculate_lifetime()

SimulationParser

__init__(kernel, log_file_path, log_file_base, kernel_name=None, config_file_path=None)

coarse_grained_dict()

fine_grained_df()

get_instruction_stats()

import_cache_states(cache_states)

parse_log_file()

process_cycle(line)

process_line(line)

WriteAllocation

WritePolicy

convert_size(size_str)

find_and_append(kernels, current_index, key, line, dtype=int)

get_cache_policy(line)

read_cache_log(log_file_path, basename=None)

read_config_file(config_file_path=None)

read_function_names(log_file_path, basename)

run_parser(log_file_path, log_file_base, config_file_path)

`AccelSimRunner`

`init(program, args, arch='SM90_H100', verbose=False, delete=False, rename=None, ncu_file=None)`

`run_accel_sim(no_write_allocate=False)`

`run_post_processing(config_file, sample=False)`

`run_sampling(sample_delete=False)`

`run_tracer(sample=False)`

`parse_args()`

`PrincipalKernelSelector`

`init(ncu_input_file, output_dir=None)`

`generate_kernels_csv()`

`generate_kernelslist(delete=False)`

`kmeans(data, n_clusters=3)`

`kmeans_scan(data, lower_bound=2, upper_bound=20)`

`pca(data, var_threshold=0.95)`

`run_pks(output_file=None, delete=False)`

`select_centroid()`

`NsightNVBitRunner`

`init()`

`get_metrics()`

`init_from_params(program_name, program_args, log_file_name, log_file_path, mangled=True)`

`init_from_program(program_name, program_args, mangled=True)`

`run_ncu(dry_run=True)`

`run_nvbit(dry_run=True)`

`compute_kernel_metrics(kernel_id, kernel_action, unique_sector_counts)`

`get_correspondence_table(results, mangled)`

`parse_arguments()`

`plot_l1_metrics(results, basename)`

`plot_l2_metrics(results, basename)`

`read_nvbit_data(nvbit_input_file, kernel_count)`

`LifetimeType`

`dict()`

`init(address, start, end)`

`calculate_lifetime()`

`SimulationParser`

`init(kernel, log_file_path, log_file_base, kernel_name=None, config_file_path=None)`

`coarse_grained_dict()`

`fine_grained_df()`

`get_instruction_stats()`

`import_cache_states(cache_states)`

`parse_log_file()`

`process_cycle(line)`

`process_line(line)`

`WriteAllocation`

`WritePolicy`

`convert_size(size_str)`

`find_and_append(kernels, current_index, key, line, dtype=int)`

`get_cache_policy(line)`

`read_cache_log(log_file_path, basename=None)`

`read_config_file(config_file_path=None)`

`read_function_names(log_file_path, basename)`

`run_parser(log_file_path, log_file_base, config_file_path)`