Show HN: gpu-core — see which silicon is actually doing the work on your NVIDIA GPU

The problem nobody talks about

You launch a PyTorch training run. nvidia-smi shows 80% GPU utilization. You think you're in good shape.

You're not. That 80% is a single number aggregated across the entire GPU. It tells you the percentage of time over the past sample period when at least one kernel was running. It says nothing about which part of the silicon was busy.

Was it the Tensor cores doing fast FP16 matrix math? Or did PyTorch silently promote your half-precision tensors to FP32 and run everything on CUDA cores at half the throughput? Is your FP64 workload using the dedicated double-precision pipe, or thrashing through the FP32 path at 1/64th the theoretical rate? Is the GPU actually compute-bound, or is it 80% busy just moving data around?

nvidia-smi can't tell you. Neither can gpustat, nvtop, or nvitop. Those tools are prettier wrappings around the same NVML metrics. They show utilization, temperature, VRAM, processes. None of them answer the question that actually matters when you're optimizing ML workloads: which hardware pipe is doing the work?

What gpu-core does

gpu-core is a terminal-based NVIDIA GPU monitor that shows what every part of the silicon is doing. Two Python dependencies (nvidia-ml-py and torch), no CUDA toolkit required for the monitor, works on every NVIDIA GPU from Volta through Hopper.

It does three things that nothing else does:

1. Core inventory with real TFLOPS math. For any NVIDIA GPU, it tells you exactly how many CUDA cores, Tensor cores, and RT cores you have, organized by SM (Streaming Multiprocessor). It computes theoretical peak TFLOPS per pipe using the formula:

peak_TFLOPS = cores × 2 × boost_clock_GHz

This isn't a look-up table approximation. It reads the actual SM count and boost clock from NVML at runtime and multiplies through the architecture table. The table covers every compute capability from 7.0 (Volta) through 9.0 (Hopper), with correct per-architecture values for FP32 cores/SM, Tensor cores/SM, FP64:FP32 ratios, FP16 CUDA ratios, and sparsity support.

2. Live per-pipe utilization on datacenter GPUs. On Hopper+ GPUs (H100, H200), the monitor uses NVML's GPU Performance Monitoring (GPM) interface to show live bars for FP32, FP16, FP64, and Tensor utilization separately. On datacenter Ampere and Ada (A100, L40), it falls back to DCGM (Data Center GPU Manager) for the same per-pipe breakdown. On consumer GPUs where NVIDIA locks these metrics behind the datacenter SKU, it's honest about the limitation and falls back to aggregate SM utilization.

3. Empirical pipe verification. This is the part that has no equivalent in any other tool I'm aware of. Instead of relying on counters that may or may not be available on your hardware, gpu-core includes a verifier (gpu_verify.py) that proves which pipe executed a workload by measuring throughput and comparing it to theoretical peaks.

The logic is simple and airtight:

RTX A3000 Laptop (Ampere):
  FP32 CUDA peak:    14.3 TFLOPS  (4,096 cores × 2 × 1.74 GHz)
  Tensor dense peak: 28.5 TFLOPS  (128 Tensor cores × 128 ops/clock × 1.74 GHz)

  If FP16 matmul measures > 14.3 TFLOPS → only Tensor cores could have done it.

Tensor cores on FP16 matrix multiply produce throughput that is physically impossible for CUDA cores alone. If your measured TFLOPS exceeds the FP32 CUDA peak, no CUDA-core-only path can explain it. That's the proof.

The verifier runs actual workloads (FP32 matmul, FP64 matmul, FP16 matmul targeting Tensor cores) and prints PASS/FAIL for each:

[FP32 CUDA]  measured: 13.8 TFLOPS  |  expected peak: 14.3
  [PASS] 97% of FP32 peak — consistent with FP32 CUDA cores

[Tensor]     measured: 28.4 TFLOPS  |  FP32 CUDA peak (must exceed): 14.3
  [PASS] 199% of FP32 peak — ONLY tensor cores can exceed this

If you want hardware-counter ground truth on top of the empirical measurement, there's verify_with_ncu.sh that reads sm__inst_executed_pipe_tensor and friends from Nsight Compute. But for most people, the throughput argument is sufficient and doesn't require root or the CUDA toolkit.

Why the existing tools don't solve this

I keep a terminal with nvtop running during every training job. It's a good tool. But here's what it shows me:

GPU utilization (aggregate, same number as nvidia-smi)
Memory utilization
Temperature, fan speed, power
Per-process breakdown

Here's what it doesn't show me:

Whether my Tensor cores are being used at all
Whether FP16 operations are actually running in half precision or getting promoted
Whether my GPU is compute-bound or memory-bound
Theoretical peak vs actual measured throughput per pipe

gpustat is even simpler. One line per GPU: utilization, temperature, memory, processes. Clean, but the same blind spot.

nvitop is the most feature-rich of the group. Beautiful TUI, per-process details, Windows support, Python API. Still no per-pipe breakdown, still no way to answer "are my Tensor cores firing?"

DCGM (NVIDIA's Data Center GPU Manager) does expose per-pipe metrics via DCGM_FI_PROF_PIPE_TENSOR_ACTIVE and similar fields. But DCGM only works on datacenter GPUs (A100, H100), requires a system service running as root, uses a custom C library with Python bindings buried in /usr/local/dcgm/, and none of the popular monitoring tools surface those fields in their TUI. gpu-core integrates DCGM when available and falls back gracefully when it's not.

Nsight Compute (ncu) gives you absolute ground truth via hardware instruction counters. It can tell you exactly which pipe executed which instruction. But it's a profiling tool, not a monitor. It intercepts kernel execution, adds significant overhead, requires root or CAP_SYS_ADMIN, and produces per-kernel reports rather than continuous monitoring. gpu-core wraps ncu as an optional verification step but doesn't depend on it.

The architecture table is the backbone

A GPU monitoring tool is only as good as its knowledge of the hardware. gpu-core maintains a mapping from CUDA compute capability to architecture specification:

CC	Architecture	CUDA cores/SM	Tensor cores/SM	RT cores/SM	FP64:FP32
7.0	Volta	64	8	0	1:2
7.5	Turing	64	8	1	1:32
8.0	Ampere (DC)	64	4	0	1:2
8.6	Ampere	128	4	1	1:64
8.9	Ada Lovelace	128	4	1	1:64
9.0	Hopper	128	4	0	1:2

Notice the FP64:FP32 ratios. Datacenter Ampere (A100, CC 8.0) gives you 1:2 FP64 throughput. Consumer Ampere (RTX 3090, CC 8.6) gives you 1:64. Same architecture name, dramatically different silicon. If your scientific computing workload is FP64-heavy on a consumer GPU, you're getting 1/64th the throughput you'd get on an A100. gpu-core shows you this immediately.

The stress kit makes it visual

The most satisfying part of the project: gpu_stress.py lets you hammer each pipe individually and watch the monitor react. Run gpu_stress.py tensor in one terminal and ./gpu-core in another. Watch the Tensor core utilization bar go from zero to saturated while the FP32 bar stays flat. Then switch to gpu_stress.py fp32 and watch the opposite happen.

This isn't just a demo. It's a debugging technique. If you think your training loop is using Tensor cores, run it alongside the monitor. If the Tensor bar doesn't move, your code has a problem, not your GPU.

The dedicated Tensor core script (tensor_core_only.py) runs four workloads that should exclusively light up Tensor cores: FP16 matmul, BF16 matmul, FP16 Linear layers, and FP16 Conv2D. After each, it prints CONFIRMED or WARNING based on whether measured throughput exceeded the FP32 CUDA peak. If you see WARNING, cuBLAS silently fell back to CUDA cores, and you need to investigate why.

What makes the empirical approach work

The key insight is that different hardware pipes have non-overlapping throughput ranges. Tensor core throughput on FP16 matmul is typically 2-8x higher than the FP32 CUDA peak (depending on architecture). If you measure 2x the FP32 peak on an FP16 workload, there is no physical explanation other than Tensor cores.

This works because:

Throughput ceilings are hard physical limits. 4,096 FP32 CUDA cores running at 1.74 GHz can produce at most 14.3 TFLOPS. You can't exceed it on FP32 CUDA cores. Period.
Tensor cores have a separate, higher ceiling. 128 Tensor cores at 128 ops/clock at 1.74 GHz give you 28.5 TFLOPS. When you observe throughput in the 20-28 TFLOPS range on an FP16 workload, only Tensor cores explain it.
The gap between ceilings is large enough to be unambiguous. Even with thermal throttling, power limits, and memory bottlenecks, a workload hitting 180% of the FP32 peak is clearly on Tensor cores. The verifier uses conservative thresholds (must exceed 110% of FP32 peak to claim Tensor) to avoid false positives.

The one case where this gets ambiguous: FP16 on CUDA cores. Ampere and Ada CUDA cores can process FP16 at 2x FP32 throughput natively. On those architectures, FP16 matmul on CUDA cores can hit up to 2x the FP32 peak. Tensor cores hit 4-8x. So the distinction still holds, but the threshold needs to account for the FP16 CUDA boost. The architecture table encodes the fp16_cuda_ratio per compute capability to handle this correctly.

Installation

git clone https://github.com/sumit-ai-ml/gpu-core.git
cd gpu-core
pip install nvidia-ml-py torch
./gpu-core                 # launch the monitor

In another terminal:

python3 gpu_verify.py      # prove which cores your GPU uses

Two pip packages. No CUDA toolkit for the monitor. No root access. Works on laptops, workstations, and datacenter GPUs.

Honest limitations

Single GPU only. Monitors device index 0. Multi-GPU support is planned but not shipped.
Consumer GPUs can't show live per-pipe bars. NVIDIA gates the per-pipe profiling metrics (GPM/DCGM) to datacenter SKUs. The empirical verifier and ncu still work. You just don't get a continuous live bar.
RT Core utilization has no public API. gpu-core shows the RT core count but can't monitor utilization. NVIDIA hasn't exposed this metric.
Fused PyTorch kernels on Ampere are sometimes ambiguous. When cuBLAS uses a fused FP16 kernel that internally mixes Tensor and CUDA core instructions, the throughput lands between the two peaks. The stress kit labels these honestly as "pipe unspecified."

These are NVIDIA API limitations, not bugs. The tool is transparent about what it can and can't measure on each GPU class.

The 48 tests

The project has 48 unit tests covering the architecture table, TFLOPS calculations, bar rendering, color thresholds, and the verifier's pass/fail decision logic. The verifier's decide_pass_fail function is pure (no GPU required) so the pass/fail thresholds are tested against a matrix of inputs. The actual pipe workloads need real hardware and are exercised manually via gpu_verify.py.

python3 -m pytest test_gpu_core.py -v

Who needs this

If you're training ML models and have ever wondered whether torch.cuda.amp.autocast is actually engaging Tensor cores, or whether your FP16 inference is running at full speed, or why your H100 feels no faster than an A100 on a particular workload, gpu-core gives you the answer in seconds.

If you're a systems engineer capacity-planning GPU clusters and need to know whether your workloads are compute-bound or memory-bound per pipe, the live monitor shows it.

If you're debugging a performance regression and suspect the wrong hardware unit is executing your kernels, the verifier proves it.

Or if you just want to watch the Tensor cores light up when you run a matmul, that's a valid reason too.

Repository: github.com/sumit-ai-ml/gpu-core

License: MIT

Requirements: Linux, NVIDIA GPU (Volta+), Python 3.8+, nvidia-ml-py, torch