Matching CUDA arch and CUDA gencode for various NVIDIA architectures

Updated November 14th 2023

tl;dr

I’ve seen some confusion regarding NVIDIA’s nvcc sm flags and what they’re used for:
When compiling with NVCC, the arch flag (‘-arch‘) specifies the name of the NVIDIA GPU architecture that the CUDA files will be compiled for.
Gencodes (‘-gencode‘) allows for more PTX generations and can be repeated many times for different architectures.

Here’s a list of NVIDIA architecture names, and which compute capabilities they have:

Fermi, KeplerMaxwellPascalVoltaTuringAmpereAdaHopperBlackwell
sm_20sm_30sm_50sm_60sm_70sm_75sm_80sm_89sm_90???
sm_35sm_52sm_61sm_72
(Xavier)
sm_86sm_90a (Thor)
sm_37sm_53sm_62sm_87 (Orin)

Fermi and Kepler are deprecated from CUDA 9 and 11 onwards
Maxwell is deprecated from CUDA 11.6 onwards

When should different ‘gencodes’ or ‘cuda arch’ be used?

When you compile CUDA code, you should always compile only one ‘-arch‘ flag that matches your most used GPU cards. This will enable faster runtime, because code generation will occur during compilation.
If you only mention ‘-gencode‘, but omit the ‘-arch‘ flag, the GPU code generation will occur on the JIT compiler by the CUDA driver.

When you want to speed up CUDA compilation, you want to reduce the amount of irrelevant ‘-gencode‘ flags. However, sometimes you may wish to have better CUDA backwards compatibility by adding more comprehensive ‘-gencode‘ flags.

Before you continue, identify which GPU you have and which CUDA version you have installed first.

Supported SM and Gencode variations

Below are the supported sm variations and sample cards from that generation.

I’ve tried to supply representative NVIDIA GPU cards for each architecture name, and CUDA version.

Fermi cards (CUDA 3.2 until CUDA 8)

Deprecated from CUDA 9, support completely dropped from CUDA 10.

  • SM20 or SM_20, compute_30
    GeForce 400, 500, 600, GT-630.
    Completely dropped from CUDA 10 onwards.

Kepler cards (CUDA 5 until CUDA 10)

Deprecated from CUDA 11.

  • SM30 or SM_30, compute_30
    Kepler architecture (e.g. generic Kepler, GeForce 700, GT-730).
    Adds support for unified memory programming
    Completely dropped from CUDA 11 onwards.
  • SM35 or SM_35, compute_35
    Tesla K40.
    Adds support for dynamic parallelism.
    Deprecated from CUDA 11, will be dropped in future versions.
  • SM37 or SM_37, compute_37
    Tesla K80.
    Adds a few more registers.
    Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a 32GB PCIe Tesla V100.

Maxwell cards (CUDA 6 until CUDA 11)

  • SM50 or SM_50, compute_50
    Tesla/Quadro M series.
    Deprecated from CUDA 11, will be dropped in future versions, strongly suggest replacing with a Quadro RTX 4000 or A6000.
  • SM52 or SM_52, compute_52
    Quadro M6000 , GeForce 900, GTX-970, GTX-980, GTX Titan X.
  • SM53 or SM_53, compute_53
    Tegra (Jetson) TX1 / Tegra X1, Drive CX, Drive PX, Jetson Nano.

Pascal (CUDA 8 and later)

  • SM60 or SM_60, compute_60
    Quadro GP100, Tesla P100, DGX-1 (Generic Pascal)
  • SM61 or SM_61, compute_61
    GTX 1080, GTX 1070, GTX 1060, GTX 1050, GTX 1030 (GP108), GT 1010 (GP108) Titan Xp, Tesla P40, Tesla P4, Discrete GPU on the NVIDIA Drive PX2
  • SM62 or SM_62, compute_62 – 
    Integrated GPU on the NVIDIA Drive PX2, Tegra (Jetson) TX2

Volta (CUDA 9 and later)

  • SM70 or SM_70, compute_70
    DGX-1 with Volta, Tesla V100, GTX 1180 (GV104), Titan V, Quadro GV100
  • SM72 or SM_72, compute_72
    Jetson AGX Xavier, Drive AGX Pegasus, Xavier NX

Turing (CUDA 10 and later)

  • SM75 or SM_75, compute_75
    GTX/RTX Turing – GTX 1660 Ti, RTX 2060, RTX 2070, RTX 2080, Titan RTX, Quadro RTX 4000, Quadro RTX 5000, Quadro RTX 6000, Quadro RTX 8000, Quadro T1000/T2000, Tesla T4

Ampere (CUDA 11.1 and later)

  • SM80 or SM_80, compute_80
    NVIDIA A100 (the name “Tesla” has been dropped – GA100), NVIDIA DGX-A100
  • SM86 or SM_86, compute_86 (from CUDA 11.1 onwards)
    Tesla GA10x cards, RTX Ampere – RTX 3080, GA102 – RTX 3090, RTX A2000, A3000, RTX A4000, A5000, A6000, NVIDIA A40, GA106 – RTX 3060, GA104 – RTX 3070, GA107 – RTX 3050, RTX A10, RTX A16, RTX A40, A2 Tensor Core GPU, A800 40GB
  • SM87 or SM_87, compute_87 (from CUDA 11.4 onwards, introduced with PTX ISA 7.4 / Driver r470 and newer) – for Jetson AGX Orin and Drive AGX Orin only

Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.

https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html#improved_fp32

Ada Lovelace (CUDA 11.8 and later)

Hopper (CUDA 12 and later)

  • SM90 or SM_90, compute_90
    NVIDIA H100 (GH100), NVIDIA H200
  • SM90a or SM_90a, compute_90a – (for PTX ISA version 8.0) – adds acceleration for features like wgmma and setmaxnreg. This is required for NVIDIA CUTLASS

Blackwell (CUDA 12 and later)

  • SM95 or SM_95, compute_95
    NVIDIA B100 (GB100)

Sample nvcc gencode and arch Flags in GCC

According to NVIDIA:

The arch= clause of the -gencode= command-line option to nvcc specifies the front-end compilation target and must always be a PTX version. The code= clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by the code= clause will be retained in the resulting binary; at least one must be PTX to provide Ampere compatibility.

Sample flags for GCC generation on CUDA 7.0 for maximum compatibility with all cards from the era:

-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_52,code=compute_52

Sample flags for generation on CUDA 8.1 for maximum compatibility with cards predating Volta:

-arch=sm_30 \
 -gencode=arch=compute_20,code=sm_20 \
 -gencode=arch=compute_30,code=sm_30 \
 -gencode=arch=compute_50,code=sm_50 \
 -gencode=arch=compute_52,code=sm_52 \
 -gencode=arch=compute_60,code=sm_60 \
 -gencode=arch=compute_61,code=sm_61 \
 -gencode=arch=compute_61,code=compute_61

Sample flags for generation on CUDA 9.2 for maximum compatibility with Volta cards:

-arch=sm_50 \
-gencode=arch=compute_50,code=sm_50 \
-gencode=arch=compute_52,code=sm_52 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_70,code=compute_70

Sample flags for generation on CUDA 10.1 for maximum compatibility with V100 and T4 Turing cards:

-arch=sm_50 \ 
-gencode=arch=compute_50,code=sm_50 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_75,code=compute_75 

Sample flags for generation on CUDA 11.0 for maximum compatibility with V100 and T4 Turing cards:

-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_80,code=compute_80 

Sample flags for generation on CUDA 11.7 for maximum compatibility with V100 and T4 Turing Datacenter cards, but also support newer RTX 3080, and Drive AGX Orin:

-arch=sm_52 \ 
-gencode=arch=compute_52,code=sm_52 \ 
-gencode=arch=compute_60,code=sm_60 \ 
-gencode=arch=compute_61,code=sm_61 \ 
-gencode=arch=compute_70,code=sm_70 \ 
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_87,code=sm_87
-gencode=arch=compute_86,code=compute_86

Sample flags for generation on CUDA 11.4 for best performance with RTX 3080 cards:

-arch=sm_80 \ 
-gencode=arch=compute_80,code=sm_80 \
-gencode=arch=compute_86,code=sm_86 \
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_86,code=compute_86

Sample flags for generation on CUDA 12 for best performance with GeForce RTX 4080, L40s, L4, and RTX A6000 Ada cards:

-arch=sm_89 \ 
-gencode=arch=compute_89,code=sm_89 \
-gencode=arch=compute_89,code=compute_89

Sample flags for generation on CUDA 12 (PTX ISA version 8.0) for best performance with NVIDIA H100 and H200 (Hopper) GPUs, and no backwards compatibility for previous generations:

-arch=sm_90 \ 
-gencode=arch=compute_90,code=sm_90 \
-gencode=arch=compute_90a,code=sm_90a \
-gencode=arch=compute_90a,code=compute_90a

To add more compatibility for Hopper GPUs and some backwards compatibility:

-arch=sm_52 \  
-gencode=arch=compute_52,code=sm_52 \  
-gencode=arch=compute_60,code=sm_60 \  
-gencode=arch=compute_61,code=sm_61 \  
-gencode=arch=compute_70,code=sm_70 \  
-gencode=arch=compute_75,code=sm_75 \ 
-gencode=arch=compute_80,code=sm_80 \ 
-gencode=arch=compute_86,code=sm_86 \ 
-gencode=arch=compute_87,code=sm_87 \
-gencode=arch=compute_90,code=sm_90 \ 
-gencode=arch=compute_90,code=compute_90

Using TORCH_CUDA_ARCH_LIST for PyTorch

If you’re using PyTorch you can set the architectures using the TORCH_CUDA_ARCH_LIST env variable during installation like this:

$ TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6" python3 setup.py install

Note that while you can specify every single arch in this variable, each one will prolong the build time as kernels will have to compiled for every architecture.

You can also tell PyTorch to generate PTX code that is forward compatible by newer cards by adding a +PTX suffix to the most recent architecture you specify:

$ TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6+PTX" python3 build_my_extension.py

Using Cmake for TensorRT

If you’re compiling TensorRT with CMAKE, drop the sm_ and compute_ prefixes, refer only to the compute capabilities instead.

Example for Tesla V100 and Volta cards in general:
cmake <...> -DGPU_ARCHS="70"

Example for NVIDIA RTX 2070 and Tesla T4:
cmake <...> -DGPU_ARCHS="75"

Example for NVIDIA A100:
cmake <...> -DGPU_ARCHS="80"

Example for NVIDIA RTX 3080 and A100 together:
cmake <...> -DGPU_ARCHS="80 86"

Example for NVIDIA H100:
cmake <...> -DGPU_ARCHS="90"

Using Cmake for CUTLASS with Hopper GH100


cmake .. -DCUTLASS_NVCC_ARCHS=90a

What does "Value 'sm_86' is not defined for option 'gpu-architecture'" mean?

If you get an error that looks like this:

nvcc fatal : Value 'sm_86' is not defined for option 'gpu-architecture'

You probably have an older version of CUDA and/or the driver installed. Upgrade to a more recent driver, at least 450.36.06 or higher, to support sm_8x cards like the A100, RTX 3080.

What does “CUDA runtime error: operation not supported” mean?

If you get an std::runtime_error that looks like this:

CUDA runtime error: operation not supported

The implication is that your card is not supported with the runtime code that was generated.

Check with nvidia-smi to see which card and driver version you have. Then, try to match the gencodes to generate the correct runtime code suitable for your card.


Posted

in

by

Comments

36 responses to “Matching CUDA arch and CUDA gencode for various NVIDIA architectures”

  1. Alexander Stohr Avatar

    i can not find any hard information for term “SM62” on the web.
    at least some are speculating that it is meant for Tegra.

    what are your sources for your statements on “SM62”?

    1. Arnon Shimoni Avatar
      Arnon Shimoni

      You could be right… I’m not entirely sure

    2. Arunabh Athreya Avatar
      Arunabh Athreya

      SM62 is meant for compute capability version 6.2. Tegra X2, Jetson TX2, DRIVE PX 2 and GP10B fall in this category.
      You can find more information in the following wikipedia page:

      https://en.wikipedia.org/wiki/CUDA#GPUs_supported

  2. Yan Avatar

    Hi,

    Then what happens if I only use the following at compile time
    -gencode arch=compute_20,code=\”sm_20,compute_20\”

    but run the compiled code on a 5.0 card? The JIT compiler will generate the GPU code, but is it going to compile with
    -gencode arch=compute_50,code=\”sm_50,compute_50\”

    I’ve been searching the web, but couldn’t find anything. Please advice.

    Thanks,

    Ian

    1. Arnon Shimoni Avatar
      Arnon Shimoni

      Hey Ian
      If you’re compiling for a 5.0 card, the second option you suggested is better. If you have to have cross-compatibility, I’d recommend the first.

      1. Matthias Avatar
        Matthias

        Thanks great articles! 

  3. jg Avatar
    jg

    Thank you, very useful, what about sm_37 ?

    1. Arnon Shimoni Avatar
      Arnon Shimoni

      `sm_37` is for the Tesla K80 cards, but our experience proves that it’s not effective to compile for it specifically. sm_30 gives the same results and is better if you also have K40s or similar.

  4. LostWorld Avatar
    LostWorld

    kindly help me to find SM for GTX950 and compute_????

    1. Arnon Shimoni Avatar
      Arnon Shimoni

      -gencode=arch=compute_52,code=sm_52 – make sure you have CUDA 6.5 at least.

      1. LostWorld Avatar
        LostWorld

        thank you. so nice of u

  5. Mandar Gogate Avatar
    Mandar Gogate

    Thank you. 🙂

  6. dee6600durgesh Avatar

    can you help me with gtx860 please

  7. ku4eto Avatar
    ku4eto

    Heads up, the Turing SM_80 is incorrect.
    Turing uses SM_75 (sm_75/compute_75) according to the NVCC CUDA Toolkit Documentation.

    1. Arnon Shimoni Avatar
      Arnon Shimoni

      Fixed, thanks

  8. Divya Mohan Avatar

    My laptop has CUDA V9.1.85, and GeForce MX130. What SM would be suggested? Thank you in advanced!

    1. Arunabh Athreya Avatar
      Arunabh Athreya

      Hi Divya
      It should be sm61, I think.

  9. Chris Jacobi Avatar

    Thank you, that was very helpful.

  10. villjoie Avatar

    CUDA V10 , and GTX1080. What SM would be suggested? Thank you in advanced!

  11. Girish Biswas Avatar
    Girish Biswas

    In a document, it is said that Compute Capablity 2.0 (Fermi) supports dynamic parallelism.. but when I use it in kernel function it shows error saying: Global function cannot be called from a global function which is only supported in 3.5 architecture!
    How can I use dynamic parallelism in 2.0 architecture? Plz help! Thanks in advance

    1. Arnon Shimoni Avatar
      Arnon Shimoni

      From what I recall, the syntax for dynamic parallelism is different between Fermi and subsequent architectures like Kepler.
      The error you’re getting is telling you that. Try asking on the NVIDIA Development forums!

  12. Kunal Khosla Avatar
    Kunal Khosla

    I have Nvidia complier 10.2, NVidia 930MX , can anyone help me with the architecture (sm_xx). Thank you in advance

    1. Arnon Shimoni Avatar
      Arnon Shimoni

      930MX is a Maxwell generation card, so `sm_50`

  13. CUDA Avatar
    CUDA

    https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

    Support for Kepler sm_30 and sm_32 architecture based products is dropped.

    Support for the following compute capabilities are deprecated in the CUDA Toolkit:

    sm_35 (Kepler)
    sm_37 (Kepler)
    sm_50 (Maxwell)

  14. NiKo Avatar
    NiKo

    Both –gpu-architecture and –gpu-code options can be omitted according to the NVCC CUDA Toolkit Documentation. I want to know the difference between not setting them and setting them. Thank you in advanced!

    1. Arnon Shimoni Avatar
      Arnon Shimoni

      If you don’t specify them, you’ll only get a compilation for the current “default”.
      With CUDA 11, that’s sm_52. It may not be the best option for the GPU you have installed.

  15. svenstaro Avatar

    cuda 11.1 adds 8.6: Added support for NVIDIA Ampere GPU architecture based GA10x GPUs GPUs (compute capability 8.6), including the GeForce RTX-30 series.
    https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

  16. Manjunath K N Avatar
    Manjunath K N

    I have GeForce 840 GPU, CUDA 11.0 installed and a CUDA C++ project created in Visual Studio. My task is to compile this .cu file into PTX as I need this PTX file to create the CudaKernel in my C# code. Code written in C# is as follows where CudaContext and CUModule belongs to the ManagedCUDA library (ManagedCUDA is the wrapper class written to access cuda modules from C#).

    CudaContext cntxt = new CudaContext();
    CUmodule cumodule = cntxt.LoadModule(@”E:\Manjunath K N\Programs\15-12-2020_2\15-12-2020_2\Debug\kernel.ptx”);
    kernel.ptx is the kernel file that I need to load into CUModule. But when this LoadModule is called, I get the error “ErrorNoBinaryForGPU: This indicates that there is no kernel image available that is suitable for the device. This can occur when a user specifies code generation options for a particular CUDA source file that do not include the corresponding device configuration”.

    To resolve this I went back to CUDA and checked for the CUDA project properties. I made the required changes following changes. Still I am unable to create the ptx file.
    1. Keep preprocessed files as Yes, and
    2. NVC Compilation type as “Generate .ptx fier”
    3. The device configuration is set to compute_52 and sm_52 by default when the CUDA project was created.
    Could you please provide a solution for this?

    1. Arnon Shimoni Avatar
      Arnon Shimoni

      The compute capability for the 840 is `compute_50, sm_50`.

      Note that it’s deprecated from CUDA 11 onwards.

      1. Manjunath K N Avatar
        Manjunath K N

        Thank you. Yes my project properties shows that compute capability is _50. Please suggest the solution? Should I use the older version CUDA like 10.0 or 7.5 etc in order to create my CUDA project?

    2. Ioan Avatar

      I think for Nvidia GeForce 840, you can do the following in Visual Studio:

      Project properties > Configuration Properties > CUDA C/C++ > Device > Code Generation > drop-down list > Edit

      The Code Generation window opens.
      Enter

      compute_50,sm_50

      in the edit field at the top then click OK.
      Back in the previous window click OK.

  17. Ganesh Rohan Avatar
    Ganesh Rohan

    kindly help me to find it for GTX 1650Ti with cuda 11

  18. danvd Avatar
    danvd

    tl;dr Run “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin\__nvcc_device_query.exe” to find out the version.

    I followed this guide and “thought” compute_87 will work on my RTX 3090. __WRONG__ should be compute_86 !

    When I tried to use compute_87
    The CUDNN deep neural network I was trying to run had error rate 100%.
    The CUDNN API did not return ANY errors!

    Only when running my program in NSIGHT Debugger I got cudaErrorNoKernelImageForDevice error…

    1. Arnon Shimoni Avatar
      Arnon Shimoni

      You’re right. It should be sm_87 and compute_86.

  19. Aaron Avatar
    Aaron

    MIght be a silly question but where do you type these changes discussed above? Is it in the bias_act.py found in torch utils? In which script do I do these changes? And how can I just set the GPU to use the sm_75? When I tried running TORCH_CUDA_ARCH_LIST=”7.5″ python3 file.py in the Linux terminal it still kept using the default architectures ie. 5.2 till 8.6. Would truly appreciate your help as I have been stuck on trying to make my code work solely on the 8.0 or less, I just want to exclude the 8.6. I am currently usign pytorch version 1.12.1 (latest one).

  20. Andrzej Avatar
    Andrzej

    Hi, this statement is not true:
    “If you only mention ‘-gencode‘, but omit the ‘-arch‘ flag, the GPU code generation will occur on the JIT compiler by the CUDA driver.”
    – gencode can be used to specify both – PTX and cubin generation

    That one can be also misleading:
    “you should always compile only one ‘-arch‘ flag that matches your most used GPU cards. This will enable faster runtime, because code generation will occur during compilation.”
    – if you specify “-arch=compute_XY” then you obtain PTX without cubin, so it will be JIT

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.