Data Formats for Gainsight Profiler

This documentation describes the data formats used by the Gainsight Profiler, including the backend memory access logs, processed lifetime statistics, and frontend reports.

When integrating new hardware backends, it is important to ensure that the backend would generate the correct logs specified in the first section. Once the correct logs are generated, the lifetime statistics and frontend reports can be automatically processed by the profiler frontend.

Backend Memory Access Logs

It is up to individual backends to produce the correctly formatted logs. An example log file for a memory access on an NVIDIA GPU simulator might look like the following -- each part of the log will be explained in detail below.

Processing kernel $PROJECT_ROOT/logs/generate/traces/kernel-25.traceg
-kernel name = _ZN8internal5gemvx6kernelIiiffffLb0ELb1ELb1ELb0ELi6ELb0E16cublasGemvParamsI30cublasGemvTensorStridedBatchedIKfES5_S3_IfEfEEENSt9enable_ifIXntT5_EvE4typeET11_
-kernel id = 25
GPGPU-Sim: Reconfigure L1 cache to 120KB
[...]
GPGPU-Sim Cycle 5401: Load instr from L1D cache at SM 0 bank 0 addr 93ce6a00 status 2
[...]
[After simulation ends for the kernel]
gpu_sim_cycle = 31057
gpu_sim_insn = 3310536
gpu_ipc =     106.5955
gpu_tot_sim_cycle = 103839
gpu_tot_sim_insn = 88674750
gpgpu_simulation_time = 0 days, 0 hrs, 0 min, 28 sec (28 sec)

There are two major types of entries in the log file -- the first part of the log contains information about the kernel being simulated, and the second part of the log contains information about the memory accesses that occurred during the simulation.

Kernel Information

The kernel information is logged at the start of the log file. The following fields are required:

Kernel or subroutine name: The name of the kernel or subroutine being simulated. This can be the same or similar across different kernels or subroutines, but each kernel or subroutine should have a unique numerical ID.
Kernel or subroutine ID: The ID of the kernel or subroutine being simulated. This should be a unique identifier for the kernel.
Memory hierarchy configuration: The configuration of the memory hierarchy being used for the simulation. For example, NVIDIA GPUs have L1 caches with configurable sizes, and this should be logged.
Kernel execution time: The time taken to execute the kernel. This should be logged at the end of the simulation for the kernel and can either be an absolute time or a cycle count.

Other statistics that can be logged but not strictly required include:

Number of instructions executed: The number of instructions executed during the simulation.
IPC: The instructions per cycle (IPC) for the kernel.
Total simulation time: The total time taken for the simulation as observed by the system.
Total instructions executed: The total number of instructions executed during the simulation up to that point.
Total simulation cycles: The total number of cycles taken for the simulation up to that point.

On-chip Memory Access Logs

Each access to each level or bank of the on-chip memory hierarchy should be logged. The logs should contain the following fields:

Timestamp: The time at which the access occurred, this could either be an absolute time or a cycle count.
Access Type: The type of access, e.g., read or write.
Access Size: The size of the access. This could also be inferred from the backend configuration -- e.g., NVIDIA GPUs are assumed to have a 32-byte cache block size, and this could be used to infer the access size.
Cache Level and Bank: if there are multiple levels or banks of cache, the level and bank of the cache that was accessed should be logged. For instance, NVIDIA GPUs have L1 and L2 caches, and multiple different banks of both L1 and L2 caches.
Address: The address of the memory location that was accessed.
Cache State: The state of the cache line at the time of access. This could be a simple hit/miss, or it could be more detailed, e.g., dirty/clean.

An example log entry for an L1 cache access on an NVIDIA GPU simulator might look like this, referring to a load instruction from the L1D cache at SM 0, bank 0, with address 93ce6c00 and status 2 (i.e., cache miss).

GPGPU-Sim Cycle 5380: Load instr from L1D cache at SM 0 bank 0 addr 93ce6c00 status 2

Processed Lifetime Statistics

The profiler frontend can automatically process the logs generated by the backend to generate lifetime statistics for the memory accesses. The profiler frontend can process the logs to generate two types of lifetime statistics:

Fine-grained individual lifetime statistics: This includes the lifetime of each individual memory access, including the start and end times of the access, the address of the memory location, and the size of the access. These files generally have the suffices of .sim_l1.csv or .sim_l2.csv.
Coarse-grained kernel-level lifetime statistics: This includes the aggregate lifetime statistics for each kernel or subroutine, including the mean, median, 90%-tile, and max lifetime for all levels of caches or the on-chip memory hierarchy. These files generally have the suffices of .sim.csv.

Fine-grained Individual Lifetime Statistics

The profiler frontend can process the previous logs to generate a lifetime statistics file in the comma-separated value (CSV) format, as shown below.

kernel_id,address,lifetime_cycles,lifetime_ns
25,2479779712.00,17116.00,7658.17
25,2479779584.00,17115.00,7657.72
25,2479779456.00,17114.00,7657.27
25,2479779328.00,17113.00,7656.82
25,2479779200.00,17112.00,7656.38
25,2479779072.00,17111.00,7655.93

The CSV file contains the following fields:

Kernel ID: The ID of the kernel or subroutine being simulated.
Address: The address of the memory location that was accessed.
Lifetime Cycles: The number of cycles that the memory location was accessed for, from the creation of a variable to the last time the variable was consumed.
Lifetime ns: The number of nanoseconds that the memory location was accessed for.

Note that each line in the CSV file corresponds to one memory access lifetime from a start to end. Thus if the program accesses the same memory location multiple times, there will be multiple lines in the CSV file for that memory location.

Coarse-grained Kernel-level Lifetime Statistics

The profiler frontend can also process the previous logs to generate a kernel-level lifetime statistics file in the comma-separated value (CSV) format. The CSV file contains the following fields:

Kernel ID: The ID of the kernel or subroutine being simulated.
Kernel Name: The name of the kernel or subroutine being simulated.
Lifetime statistics: mean, median, 90%-tile, and max lifetime for all levels of caches or the on-chip memory hierarchy.
Read and write frequencies: the number of read and write accesses to each level of the memory hierarchy.
Unique addresses: the number of unique addresses accessed at each level of the memory hierarchy.
Utilization: the utilization of each level of the memory hierarchy, as a ratio between the number of unique addresses accessed and the size of each level of the memory hierarchy.
Total cycles: the total number of cycles taken for the simulation up to that point.

An example log entry for a kernel-level lifetime statistics file might look like this:

Kernel ID,Mangled Names,L1 Lifetime,L1 Lifetime Median,L1 Lifetime 90%-tile,L1 Lifetime Max,L2 Lifetime,L2 Lifetime Median,L2 Lifetime 90%-tile,L2 Lifetime Max,L1 Read Frequency,L1 Write Frequency,L2 Read Frequency,L2 Write Frequency,L1 Utilization,L1 Size,L1 Unique Addresses,L2 Utilization,L2 Size,L2 Unique Addresses,Total Read Frequency,Total Write Frequency,Total Cycles
25,_ZN8internal5gemvx6kernelIiiffffLb0ELb1ELb1ELb0ELi6ELb0E16cublasGemvParamsI30cublasGemvTensorStridedBatchedIKfES5_S3_IfEfEEENSt9enable_ifIXntT5_EvE4typeET11_,3.596146539905556,3.4937360178970915,6.696868008948546,7.658165548098434,4.194980254624459,4.3,7.182102908277405,7.609395973154362,1193.795578397058,8.71383633866466,425.39364671481104,12.674671038057685,0.7802083333333333,122880,2996,0.08632568359375,52428800,141436,1.6191892251118694,0.021388507376722345,22571

Frontend Reports

The profiler frontend can generate reports in JSON format. The write frequencies and data lifetimes of the program execution are aligned with the write frequency and retention tima requirements of different gain cell device; then, the profiler frontend can project metrics such as number of refreshes required and the area of on-chip memory arrays, illustrating the impact of potentially replacing the on-chip SRAM of the target device with a gain cell array.

The JSON file contains the following fields:

Name of the program, as well as execution time and date
Total numbers of valid data lifetimes created across all levels of the memory hierarchy
Write frequencies of each level of the memory hierarchy
Projected refreshes required for each level of the memory hierarchy if they were replaced by each of silicon, hybrid, or oxide gain cells
Projected area of on-chip memory arrays for each level of the memory hierarchy if they were replaced by gain cells and how they compare to the original SRAM
Project read and refresh energy for each level of the memory hierarchy if they were replaced by gain cells and how they compare to the original SRAM

Frontend Report Example

{
    "Name": "generate",
    "Date": "2025-03-03",
    "Time": "18-20-21",
    "L1 Lifetime Count": 43181,
    "L2 Lifetime Count": 72470,
    "L1 Write Frequency": {
        "max": 334.9061770970687,
        "maxidx": 10,
        "90%-tile": 47.64617816321571,
        "weighted": 8.920562336156094
    },
    "L2 Write Frequency": {
        "max": 102.03507880020337,
        "maxidx": 12,
        "90%-tile": 81.64392629545705,
        "weighted": 18.811884671120435
    },
    "L1 Refreshes": {
        "5nm Silicon": 223554.0,
        "16nm Silicon": 0.0,
        "Hybrid": 4604.0,
        "Oxide": 0.0
    },
    "L2 Refreshes": {
        "5nm Silicon": 1100543.0,
        "16nm Silicon": 5296.0,
        "Hybrid": 79549.0,
        "Oxide": 0.0
    },
    "L1 Area": {
        "sram": 1691702.6133333333,
        "silicon": 879685.3589333334,
        "hybrid": 873813.3333333334,
        "oxide": 1342177.28
    },
    "L2 Area": {
        "sram": 845851.3066666666,
        "silicon": 439842.6794666667,
        "hybrid": 436906.6666666667,
        "oxide": 671088.64
    }
}