GPU VRAM Calculator
Estimate the GPU VRAM needed to run or train a model.
Recommended next steps
Related tools
Estimate text tokens and rough AI API costs before running a prompt.
Estimate the cost of generating images across model presets.
Strip EXIF and location data from images, locally in your browser.
Frequently asked questions
Weights come from the parameter count times the bytes per parameter for your chosen precision (fp16 = 2, 8-bit = 1, 4-bit = 0.5). On top of that we add the KV cache (which grows with hidden size, layers, context length, and batch size) plus a configurable activation overhead. Training mode also adds gradients and Adam optimizer state.
Every token you process keeps a key and value entry in the attention cache for each layer. Doubling the context length or the batch size roughly doubles that KV cache, which is why long-context or high-concurrency serving needs noticeably more memory than the weights alone suggest.
Quantizing the weights cuts their memory in proportion to the byte width: 8-bit halves the weight memory versus fp16, and 4-bit quarters it. It does not shrink the KV cache or activations unless you also quantize those, so the total saving is smaller than the weight saving alone.
Treat it as a planning ballpark, not a guarantee. Real usage shifts with the framework, the attention kernel, memory fragmentation, and how the runtime allocates buffers. Leave headroom above the estimate and confirm on the actual stack before committing to hardware.
Last updated 2026-06-23.