CUDA Kernel Optimization: What Actually Moves the Needle on Jetson

Most CUDA optimization guides are written for datacenter GPUs with 80 GB of HBM and a 400W TDP. Jetson Orin has a unified memory architecture, a 15W power budget, and thermal constraints that will throttle you if you ignore them. The optimization priorities are different.

Here's the workflow I've landed on after deploying several inference pipelines on Orin at Ciena.

Start with Nsight Systems, Not Intuition

The single most important habit is profiling before touching code. nsys profile --trace=cuda,nvtx ./your_app gives you a timeline showing exactly where time goes — kernel execution, memory transfers, synchronization stalls, and CPU overhead.

On Jetson, the first surprise is usually memory bandwidth contention. Orin's GPU shares memory with the CPU, and if your pipeline simultaneously runs inference kernels and video decode (via NVDEC), they fight over the same memory bus. You'll see kernels queued but not running — the GPU is memory-starved, not compute-starved. No amount of kernel optimization fixes this without addressing the scheduling.

nsys profile --trace=cuda,nvtx,osrt --output=profile_run ./inference_app
nsys-ui profile_run.nsys-rep

Memory Coalescing Is the First Real Win

Uncoalesced global memory access is the most common performance killer I've seen in edge inference code. When threads in a warp access non-contiguous addresses, the hardware issues multiple memory transactions instead of one. On a bandwidth-constrained platform like Jetson, this multiplies your effective memory pressure.

For convolution inputs and feature maps, ensure your tensor layout matches the kernel's access pattern. NCHW vs NHWC matters — TensorRT defaults to NCHW for computation but NHWC for input/output if you configure it that way. A layout mismatch means a transpose at every layer boundary.

// Check tensor layout before passing to TRT
// Bad: forcing a transpose per inference call
auto input = allocate_tensor(NHWC);
convert_to_nchw(input, engine_input);  // unnecessary copy

// Better: configure TRT to accept NHWC directly
config->setFlag(BuilderFlag::kPREFER_PRECISION_CONSTRAINTS);
network->getInput(0)->setAllowedFormats(1U << static_cast<int>(TensorFormat::kHWC8));

Occupancy Tuning

High occupancy (many warps resident per SM) hides memory latency. Low occupancy means stalls while waiting for memory. But maximum occupancy isn't always optimal — register pressure increases with occupancy, and spilling registers to local memory costs more than you saved.

The occupancy sweet spot on Orin's Ampere GPU is usually around 50-75%. Use ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active to measure it. If you're below 40%, investigate register usage and shared memory allocation. If you're at 100% but still slow, the bottleneck is elsewhere.

ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active,\
l1tex__t_bytes.sum,\
sm__throughput.avg.pct_of_peak_sustained_elapsed \
./inference_app

INT8 Quantization: When It Helps and When It Doesn't

INT8 is commonly presented as a free 2x speedup. In practice, it depends on what's bottlenecking you.

If you're compute-bound on tensor core operations, INT8 roughly doubles throughput compared to FP16 — the Orin's tensor cores can do INT8 at twice the rate. If you're memory bandwidth-bound (which is common on Jetson for small batch sizes), the speedup is much lower because you're still moving similar amounts of data.

The calibration process also matters. PTQ (post-training quantization) with a representative calibration dataset produces acceptable accuracy for most detection models, but activation distributions that deviate significantly from the calibration set will cause accuracy regression at runtime. For production deployments, I always run accuracy benchmarks on field-representative data, not just the standard benchmarks.

# TensorRT INT8 calibration
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = MyEntropyCalibrator(
    calibration_data=calib_dataset,  # use real deployment data
    cache_file="calibration.cache"
)

Thermal Management Is Not Optional

Jetson will throttle under sustained load. The default thermal policies are conservative for a reason — sustained kernel-intensive workloads will push the SoC above 60°C in an enclosure, and the firmware will drop clock speeds to maintain thermal headroom.

For sustained inference deployments, I set a fixed power mode rather than letting the system scale dynamically:

sudo nvpmodel -m 0   # MAXN — full performance
sudo jetson_clocks   # lock clocks at max

Monitor thermals during profiling: tegrastats --interval 500. If you see CPU@65C and clocks dropping, your benchmark numbers are meaningless — you're measuring throttled performance. Either improve the thermal solution (heatsink, airflow, copper pour) or set a sustainable power mode and benchmark at that configuration.

The Workflow That Actually Works

1. Profile with Nsight Systems — identify the actual bottleneck
2. Fix memory layout and access patterns first
3. Tune occupancy — check register usage with ncu
4. Evaluate quantization against real accuracy requirements
5. Benchmark at steady-state thermal conditions, not cold-start

The gains compound. On one Ciena pipeline, fixing memory coalescing added 18% throughput, occupancy tuning added another 12%, and INT8 quantization added 31% on top of that — 74% total improvement before touching the model architecture.

Profiling first is what made each step count.