To achieve the best possible performance whilst being portable, GPU code should be generated for the architecture(s) it will be executed upon.
That is controlled by specifying -gencode arguments to NVCC which, unlike the -arch and -code arguments, allows for ‘fatbinary’ executables that are optimised for multiple device architectures.
Each -gencode argument requires two values, the virtual architecture and real architecture, for use in NVCC’s two-stage compilation. I.e. -gencode=arch=compute_60, code=sm_60 specifies a virtual architecture of compute_60 and real architecture sm_60.
The minimum specified virtual architecture must be less than or equal to the GPU’s Compute Capability used to execute the code.
To build a CUDA application which targets any GPU on HPC cluster “Rudens”, use the following -gencode arguments (for CUDA 8.0):