Matching CUDA arch and CUDA gencode for various NVIDIA architectures

Updated July 12th 2024

tl;dr

I’ve seen some confusion regarding NVIDIA’s nvcc sm flags and what they’re used for:
When compiling with NVCC, the arch flag (‘-arch‘) specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
Gencodes (‘-gencode‘) allows for more PTX generations and can be repeated many times for different architectures.

Here’s a list of NVIDIA architecture names, and which compute capabilities they have:

Fermi, Kepler^†	Maxwell^‡	Pascal	Volta	Turing	Ampere	Ada	Hopper	Blackwell	Rubin
sm_20	sm_50	sm_60	sm_70	sm_75	sm_80	sm_89	sm_90, sm_90a (Thor)	sm_100, sm_100a	sm_xxx?
sm_30, sm_35, sm_37	sm_52	sm_61	sm_72 (Xavier)		sm_86			sm_101, sm_101a
	sm_53	sm_62			sm_87 (Orin)			sm_120, sm_120a

† Fermi and Kepler are deprecated from CUDA 9 and 11 onwards
‡ Maxwell is deprecated from CUDA 11.6 onwards

When should different ‘gencodes’ or ‘cuda arch’ be used?

When you compile CUDA code, you should always compile only one ‘-arch‘ flag that matches your most used GPU cards. This will enable faster runtime, because code generation will occur during compilation.
If you only mention ‘-gencode‘, but omit the ‘-arch‘ flag, the GPU code generation will occur on the JIT compiler by the CUDA driver.

When you want to speed up CUDA compilation, you want to reduce the amount of irrelevant ‘-gencode‘ flags. However, sometimes you may wish to have better CUDA backwards compatibility by adding more comprehensive ‘-gencode‘ flags.

Before you continue, identify which GPU you have and which CUDA version you have installed first.

Supported SM and Gencode variations

Below are the supported sm variations and sample cards from that generation.

I’ve tried to supply representative NVIDIA GPU cards for each architecture name, and CUDA version.

Fermi cards (CUDA 3.2 until CUDA 8)

Deprecated from CUDA 9, support completely dropped from CUDA 10.

SM20 or SM_20, compute_30 –
GeForce 400, 500, 600, GT-630.
Completely dropped from CUDA 10 onwards.

Kepler cards (CUDA 5 until CUDA 10)

Deprecated from CUDA 11.

SM30 or SM_30, compute_30 –
Kepler architecture (e.g. generic Kepler, GeForce 700, GT-730).
Adds support for unified memory programming
Completely dropped from CUDA 11 onwards.
SM35 or SM_35, compute_35 –
Tesla K40.
Adds support for dynamic parallelism.
Deprecated from CUDA 11, will be dropped in future versions.
SM37 or SM_37, compute_37 –
Tesla K80.
Adds a few more registers.
Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a 32GB PCIe Tesla V100.

Maxwell cards (CUDA 6 until CUDA 11)

SM50 or SM_50, compute_50 –
Tesla/Quadro M series.
Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a Quadro RTX 4000 or A6000.
SM52 or SM_52, compute_52 –
Quadro M6000 , GeForce 900, GTX-970, GTX-980, GTX Titan X.
SM53 or SM_53, compute_53 –
Tegra (Jetson) TX1 / Tegra X1, Drive CX, Drive PX, Jetson Nano.

Pascal (CUDA 8 and later)

SM60 or SM_60, compute_60 –
Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
SM61 or SM_61, compute_61–
GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
SM62 or SM_62, compute_62 –
Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2

Volta (CUDA 9 and later)

SM70 or SM_70, compute_70 –
DGX-1 with Volta, Tesla V100, GTX 1180 (GV104), Titan V, Quadro GV100
SM72 or SM_72, compute_72 –
Jetson AGX Xavier, Drive AGX Pegasus, Xavier NX

Turing (CUDA 10 and later)

SM75 or SM_75, compute_75 –
GTX/RTX Turing – GTX 1660 Ti, RTX 2060, RTX 2070, RTX 2080, Titan RTX, Quadro RTX 4000, Quadro RTX 5000, Quadro RTX 6000, Quadro RTX 8000, Quadro T1000/T2000, Tesla T4

Ampere (CUDA 11.1 and later)

SM80 or SM_80, compute_80 –
NVIDIA A100 (the name “Tesla” has been dropped – GA100), NVIDIA DGX-A100
SM86 or SM_86, compute_86 – (from CUDA 11.1 onwards)
Tesla GA10x cards, RTX Ampere – RTX 3080, GA102 – RTX 3090, RTX A2000, A3000, RTX A4000, RTX A5000, RTX A6000 Ada, NVIDIA A40, GA106 – RTX 3060, GA104 – RTX 3070, GA107 – RTX 3050, RTX A10, RTX A16, RTX A40, A2 Tensor Core GPU, A800 40GB

SM87 or SM_87, compute_87 – (from CUDA 11.4 onwards, introduced with PTX ISA 7.4 / Driver r470 and newer) – for Jetson AGX Orin and Drive AGX Orin only

“Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.“

https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#improved_fp32

Ada Lovelace (CUDA 11.8 and later)

SM89 or SM_89, compute_89 –
NVIDIA GeForce RTX 4090, RTX 4080, RTX 6000 Ada, Tesla L40, L40s Ada, L4 Ada, RTX 4500

Hopper (CUDA 12 and later)

Requires PTX 8.0

SM90 or SM_90, compute_90 –
NVIDIA H100 (GH100), NVIDIA H200
SM90a or SM_90a, compute_90a – (not forwards compatible, specialized accelerated features) – adds acceleration for features like wgmma and setmaxnreg. This is required for NVIDIA CUTLASS

Blackwell (CUDA 12.6 and later)

Requires PTX 8.6

SM100 or SM_100, compute_100 –
NVIDIA B100 (GB100), B200, GB202, GB203, GB205, GB206, GB207, GeForce RTX 50 series, RTX 5080, RTX 5090, NVIDIA B40
SM100a or SM_100A, compute_100a – (not forwards compatible, specialized accelerated features)
NVIDIA B100 (GB100), B200, GB202, GB203, GB205, GB206, GB207, GeForce RTX 50 series, RTX 5080, RTX 5090, NVIDIA B40

Blackwell (CUDA 12.8 and later)

Requires PTX 8.7

SM120 or SM_120, compute_120 –
NVIDIA B100 (GB100), B200, GB202, GB203, GB205, GB206, GB207, GeForce RTX 50 series, RTX 5080, RTX 5090, NVIDIA B40
SM120a or SM_120a, compute_120a – (not forwards compatible, specialized accelerated features)
NVIDIA B100 (GB100), B200, GB202, GB203, GB205, GB206, GB207, GeForce RTX 50 series, RTX 5080, RTX 5090, NVIDIA B40

Sample `nvcc` `gencode` and `arch` Flags in GCC

According to NVIDIA:

The arch= clause of the -gencode= command-line option to nvcc specifies the front-end compilation target and must always be a PTX version. The code= clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by the code= clause will be retained in the resulting binary; at least one must be PTX to provide Ampere compatibility.

Sample flags for GCC generation on CUDA 7.0 for maximum compatibility with all cards from the era:

-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_52,code=compute_52

Sample flags for generation on CUDA 8.1 for maximum compatibility with cards predating Volta:

-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_60,code=sm_60 \
 -gencode=arch=compute_61,code=sm_61 \
 -gencode=arch=compute_61,code=compute_61

Sample flags for generation on CUDA 9.2 for maximum compatibility with Volta cards:

-arch=sm_50 \
-gencode=arch=compute_50,code=sm_50 \
-gencode=arch=compute_52,code=sm_52 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_70,code=compute_70

Sample flags for generation on CUDA 10.1 for maximum compatibility with V100 and T4 Turing cards:

-arch=sm_50 \ 
-gencode=arch=compute_50,code=sm_50 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_75,code=compute_75

Sample flags for generation on CUDA 11.0 for maximum compatibility with V100 and T4 Turing cards:

-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_80,code=compute_80

Sample flags for generation on CUDA 11.7 for maximum compatibility with V100 and T4 Turing Datacenter cards, but also support newer RTX 3080, and Drive AGX Orin:

-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_87,code=sm_87
-gencode=arch=compute_86,code=compute_86

Sample flags for generation on CUDA 11.4 for best performance with RTX 3080 cards:

-arch=sm_80 \ 
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_86,code=compute_86

Sample flags for generation on CUDA 12 for best performance with GeForce RTX 4080, L40s, L4, and RTX A6000 Ada cards:

-arch=sm_89 \ 
-gencode=arch=compute_89,code=sm_89 \
-gencode=arch=compute_89,code=compute_89

Sample flags for generation on CUDA 12 (PTX ISA version 8.0) for best performance with NVIDIA H100 and H200 (Hopper) GPUs, and no backwards OR FORWARDS compatibility for previous generations:

-arch=sm_90 \ 
-gencode=arch=compute_90,code=sm_90 \
-gencode=arch=compute_90a,code=sm_90a \
-gencode=arch=compute_90a,code=compute_90a

Note that sm_90a implies that it includes specific architecture-accelerated features that are not supported on other architectures, and can’t be run on later generation devices. They are neither forward nor backward compatible.

Sample flags for generation on CUDA 12.8 (PTX ISA version 8.7) for best performance with NVIDIA GB100 and GB20x (Blackwell) GPUs like the B40 or RTX 50 series:

-arch=sm_100 \ 
-gencode=arch=compute_100,code=sm_100 \
-gencode=arch=compute_100,code=compute_100

For RTX 50xx series specifically,

-arch=sm_100 \ 
-gencode=arch=compute_100,code=sm_100 \
-gencode=arch=compute_120,code=sm_120 \
-gencode=arch=compute_120,code=compute_120

To add more compatibility for Blackwell GPUs and some backwards compatibility:

-arch=sm_52 \  
-gencode=arch=compute_52,code=sm_52 \  
-gencode=arch=compute_60,code=sm_60 \  
-gencode=arch=compute_61,code=sm_61 \  
-gencode=arch=compute_70,code=sm_70 \  
-gencode=arch=compute_75,code=sm_75 \ 
-gencode=arch=compute_80,code=sm_80 \ 
-gencode=arch=compute_86,code=sm_86 \ 
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_90,code=sm_90 \ 
-gencode=arch=compute_100,code=sm_100 \
-gencode=arch=compute_100,code=compute_100

For RTX50xx, and full backwards compatibility, add sm_120 and compute_120, like this:

-arch=sm_52 \  
-gencode=arch=compute_52,code=sm_52 \  
-gencode=arch=compute_60,code=sm_60 \  
-gencode=arch=compute_61,code=sm_61 \  
-gencode=arch=compute_70,code=sm_70 \  
-gencode=arch=compute_75,code=sm_75 \ 
-gencode=arch=compute_80,code=sm_80 \ 
-gencode=arch=compute_86,code=sm_86 \ 
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_90,code=sm_90 \ 
-gencode=arch=compute_100,code=sm_100 \
-gencode=arch=compute_120,code=sm_120 \
-gencode=arch=compute_120,code=compute_120

Using TORCH_CUDA_ARCH_LIST for PyTorch

If you’re using PyTorch you can set the architectures using the TORCH_CUDA_ARCH_LIST env variable during installation like this:

$ TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6" python3 setup.py install

Note that while you can specify every single arch in this variable, each one will prolong the build time as kernels will have to compiled for every architecture.

You can also tell PyTorch to generate PTX code that is forward compatible by newer cards by adding a +PTX suffix to the most recent architecture you specify:

$ TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6+PTX" python3 build_my_extension.py

Using Cmake for TensorRT

If you’re compiling TensorRT with CMAKE, drop the sm_ and compute_ prefixes, refer only to the compute capabilities instead.

Example for Tesla V100 and Volta cards in general:
cmake <...> -DGPU_ARCHS="70"

Example for NVIDIA RTX 2070 and Tesla T4:
cmake <...> -DGPU_ARCHS="75"

Example for NVIDIA A100:
cmake <...> -DGPU_ARCHS="80"

Example for NVIDIA RTX 3080 and A100 together:
cmake <...> -DGPU_ARCHS="80 86"

Example for NVIDIA H100:
cmake <...> -DGPU_ARCHS="90"

Using Cmake for CUTLASS with Hopper GH100

cmake .. -DCUTLASS_NVCC_ARCHS=90a

What does `"Value 'sm_86' is not defined for option 'gpu-architecture'"` mean?

If you get an error that looks like this:

nvcc fatal : Value 'sm_86' is not defined for option 'gpu-architecture'

You probably have an older version of CUDA and/or the driver installed. Upgrade to a more recent driver, at least 450.36.06 or higher, to support sm_8x cards like the A100, RTX 3080.

What does “CUDA runtime error: operation not supported” mean?

If you get an std::runtime_error that looks like this:

CUDA runtime error: operation not supported

The implication is that your card is not supported with the runtime code that was generated.

Check with nvidia-smi to see which card and driver version you have. Then, try to match the gencodes to generate the correct runtime code suitable for your card.

Posted

27/10/2020

GPUs

Arnon Shimoni

Tags:

compute, compute_52, cuda, cuda arch, gencode, gpu, gtx 1080, nvcc, nvcc arch, nvcc flags, nvcc sm, nvidia, nvidia sm, pascal, sm

Comments

45 responses to “Matching CUDA arch and CUDA gencode for various NVIDIA architectures”

Alexander Stohr

02/12/2016

i can not find any hard information for term “SM62” on the web.
at least some are speculating that it is meant for Tegra.

what are your sources for your statements on “SM62”?

Reply
1. Arnon Shimoni
  
  02/12/2016
  
  You could be right… I’m not entirely sure
  
  Reply
2. Arunabh Athreya
  
  17/12/2019
  
  SM62 is meant for compute capability version 6.2. Tegra X2, Jetson TX2, DRIVE PX 2 and GP10B fall in this category.
  You can find more information in the following wikipedia page:
  
  https://en.wikipedia.org/wiki/CUDA#GPUs_supported
  
  Reply
Yan

08/03/2017

Hi,

Then what happens if I only use the following at compile time
-gencode arch=compute_20,code=\”sm_20,compute_20\”

but run the compiled code on a 5.0 card? The JIT compiler will generate the GPU code, but is it going to compile with
-gencode arch=compute_50,code=\”sm_50,compute_50\”

I’ve been searching the web, but couldn’t find anything. Please advice.

Thanks,

Ian

Reply
1. Arnon Shimoni
  
  09/03/2017
  
  Hey Ian
  If you’re compiling for a 5.0 card, the second option you suggested is better. If you have to have cross-compatibility, I’d recommend the first.
  
  Reply
  1. Matthias
    
    04/10/2023
    
    Thanks great articles!
    
    Reply
jg

24/05/2017

Thank you, very useful, what about sm_37 ?

Reply
1. Arnon Shimoni
  
  24/05/2017
  
  `sm_37` is for the Tesla K80 cards, but our experience proves that it’s not effective to compile for it specifically. sm_30 gives the same results and is better if you also have K40s or similar.
  
  Reply
LostWorld

19/06/2017

kindly help me to find SM for GTX950 and compute_????

Reply
1. Arnon Shimoni
  
  20/06/2017
  
  -gencode=arch=compute_52,code=sm_52 – make sure you have CUDA 6.5 at least.
  
  Reply
  1. LostWorld
    
    20/06/2017
    
    thank you. so nice of u
    
    Reply
Mandar Gogate

23/07/2017

Thank you. 🙂

Reply
dee6600durgesh

09/05/2018

can you help me with gtx860 please

Reply
ku4eto

26/09/2018

Heads up, the Turing SM_80 is incorrect.
Turing uses SM_75 (sm_75/compute_75) according to the NVCC CUDA Toolkit Documentation.

Reply
1. Arnon Shimoni
  
  26/09/2018
  
  Fixed, thanks
  
  Reply
Divya Mohan

08/11/2018

My laptop has CUDA V9.1.85, and GeForce MX130. What SM would be suggested? Thank you in advanced!

Reply
1. Arunabh Athreya
  
  17/12/2019
  
  Hi Divya
  It should be sm61, I think.
  
  Reply
Chris Jacobi

31/01/2019

Thank you, that was very helpful.

Reply
villjoie

15/01/2020

CUDA V10 , and GTX1080. What SM would be suggested? Thank you in advanced!

Reply
Girish Biswas

19/05/2020

In a document, it is said that Compute Capablity 2.0 (Fermi) supports dynamic parallelism.. but when I use it in kernel function it shows error saying: Global function cannot be called from a global function which is only supported in 3.5 architecture!
How can I use dynamic parallelism in 2.0 architecture? Plz help! Thanks in advance

Reply
1. Arnon Shimoni
  
  19/05/2020
  
  From what I recall, the syntax for dynamic parallelism is different between Fermi and subsequent architectures like Kepler.
  The error you’re getting is telling you that. Try asking on the NVIDIA Development forums!
  
  Reply
Kunal Khosla

08/06/2020

I have Nvidia complier 10.2, NVidia 930MX , can anyone help me with the architecture (sm_xx). Thank you in advance

Reply
1. Arnon Shimoni
  
  08/06/2020
  
  930MX is a Maxwell generation card, so `sm_50`
  
  Reply
CUDA

22/07/2020

https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

Support for Kepler sm_30 and sm_32 architecture based products is dropped.

Support for the following compute capabilities are deprecated in the CUDA Toolkit:

sm_35 (Kepler)
sm_37 (Kepler)
sm_50 (Maxwell)

Reply
NiKo

16/09/2020

Both –gpu-architecture and –gpu-code options can be omitted according to the NVCC CUDA Toolkit Documentation. I want to know the difference between not setting them and setting them. Thank you in advanced!

Reply
1. Arnon Shimoni
  
  17/09/2020
  
  If you don’t specify them, you’ll only get a compilation for the current “default”.
  With CUDA 11, that’s sm_52. It may not be the best option for the GPU you have installed.
  
  Reply
svenstaro

29/09/2020

cuda 11.1 adds 8.6: Added support for NVIDIA Ampere GPU architecture based GA10x GPUs GPUs (compute capability 8.6), including the GeForce RTX-30 series.
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

Reply
Manjunath K N

15/12/2020

I have GeForce 840 GPU, CUDA 11.0 installed and a CUDA C++ project created in Visual Studio. My task is to compile this .cu file into PTX as I need this PTX file to create the CudaKernel in my C# code. Code written in C# is as follows where CudaContext and CUModule belongs to the ManagedCUDA library (ManagedCUDA is the wrapper class written to access cuda modules from C#).

CudaContext cntxt = new CudaContext();
CUmodule cumodule = cntxt.LoadModule(@”E:\Manjunath K N\Programs\15-12-2020_2\15-12-2020_2\Debug\kernel.ptx”);
kernel.ptx is the kernel file that I need to load into CUModule. But when this LoadModule is called, I get the error “ErrorNoBinaryForGPU: This indicates that there is no kernel image available that is suitable for the device. This can occur when a user specifies code generation options for a particular CUDA source file that do not include the corresponding device configuration”.

To resolve this I went back to CUDA and checked for the CUDA project properties. I made the required changes following changes. Still I am unable to create the ptx file.
1. Keep preprocessed files as Yes, and
2. NVC Compilation type as “Generate .ptx fier”
3. The device configuration is set to compute_52 and sm_52 by default when the CUDA project was created.
Could you please provide a solution for this?

Reply
1. Arnon Shimoni
  
  15/12/2020
  
  The compute capability for the 840 is `compute_50, sm_50`.
  
  Note that it’s deprecated from CUDA 11 onwards.
  
  Reply
  1. Manjunath K N
    
    15/12/2020
    
    Thank you. Yes my project properties shows that compute capability is _50. Please suggest the solution? Should I use the older version CUDA like 10.0 or 7.5 etc in order to create my CUDA project?
    
    Reply
2. Ioan
  
  05/02/2021
  
  I think for Nvidia GeForce 840, you can do the following in Visual Studio:
  
  Project properties > Configuration Properties > CUDA C/C++ > Device > Code Generation > drop-down list > Edit
  
  The Code Generation window opens.
  Enter
  
  compute_50,sm_50
  
  in the edit field at the top then click OK.
  Back in the previous window click OK.
  
  Reply
Ganesh Rohan

22/06/2021

kindly help me to find it for GTX 1650Ti with cuda 11

Reply
danvd

20/08/2022

tl;dr Run “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin\__nvcc_device_query.exe” to find out the version.

I followed this guide and “thought” compute_87 will work on my RTX 3090. __WRONG__ should be compute_86 !

When I tried to use compute_87
The CUDNN deep neural network I was trying to run had error rate 100%.
The CUDNN API did not return ANY errors!

Only when running my program in NSIGHT Debugger I got cudaErrorNoKernelImageForDevice error…

Reply
1. Arnon Shimoni
  
  21/08/2022
  
  You’re right. It should be sm_87 and compute_86.
  
  Reply
Aaron

19/10/2022

MIght be a silly question but where do you type these changes discussed above? Is it in the bias_act.py found in torch utils? In which script do I do these changes? And how can I just set the GPU to use the sm_75? When I tried running TORCH_CUDA_ARCH_LIST=”7.5″ python3 file.py in the Linux terminal it still kept using the default architectures ie. 5.2 till 8.6. Would truly appreciate your help as I have been stuck on trying to make my code work solely on the 8.0 or less, I just want to exclude the 8.6. I am currently usign pytorch version 1.12.1 (latest one).

Reply
Andrzej

18/01/2023

Hi, this statement is not true:
“If you only mention ‘-gencode‘, but omit the ‘-arch‘ flag, the GPU code generation will occur on the JIT compiler by the CUDA driver.”
– gencode can be used to specify both – PTX and cubin generation

That one can be also misleading:
“you should always compile only one ‘-arch‘ flag that matches your most used GPU cards. This will enable faster runtime, because code generation will occur during compilation.”
– if you specify “-arch=compute_XY” then you obtain PTX without cubin, so it will be JIT

Reply
Ben

25/05/2024

What’s the SM/Compute values for CUDA12 + RTX 4070 gpu?

Reply
1. Arnon Shimoni
  
  26/05/2024
  
  `-arch=sm_89 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89`
  
  Reply
Parth

16/09/2024

This is a very very helpful website. I have bookmarked it. Keep maintaining please 🙂

Reply
Jiacai Lai

29/10/2024

What’s the SM/Compute values for CUDA12.6 + RTX 4090D gpu?

Reply
1. Arnon Shimoni
  
  30/10/2024
  
  `-arch=sm_89 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89`
  
  Reply
Shaun

31/01/2025

Such a great resource! A minor correction: sm_100 / compute_100 requires CUDA 12.8, CUDA 12.6 only goes up to sm_90 / compute_90. CUDA 12.8 also includes support for sm_101 and sm_120, with an approximate mapping to cards of:

10.0 Blackwell B100/B200
10.1 Blackwell Thor
10.1a Blackwell DIGITS
12.0 Blackwell RTX50

Reply
1. Alexis Girault (NVIDIA)
  
  24/04/2025
  
  indeed!
  
  Reply
Martin

21/02/2025

I have this page on my Fav list for nearlya 10 years now and it’s still the best ressource for this matching problem.

Thank you for always keeping it up to date and helping developers all around the world <3

Reply
Alexis Girault (NVIDIA)

09/05/2025

Some references of interest, which could help address some errors in this page:
– https://developer.nvidia.com/cuda-gpus
– https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/

Reply