I part Introduction to high-performance computing technology CUDA

This course covers the theoretical and practical principles of massively parallel approach to high-performance computing using multiprocessing systems and/or combination of GPU hardware and specialised software environment. The seminar gives an overview of the types of high-performance computing hardware and software architecture, computing algorithms, application libraries and tools. More attention is paid to the applied interdisciplinary use of GPU-based parallel computing platform CUDA, e.g., analysis of large data amount, image processing, and machine learning tasks. Along with theoretical information, there is also a possibility to acquire basic skills in developing IT solutions using CUDA.

Day 1

CUDA architecture (30 min):

  • History of GPU development
  • Types of GPU architecture supported by CUDA

CUDA programming (30 min):

  • CUDA programming model
  • Basic principles of CUDA programming
  • Concepts of threads and blocks
  • GPU and CPU data exchange

Parallel algorithms in CUDA environment (30 min):

  • Parallel reduction
  • Sum of prefixes

Practical task: exercises in developing simple CUDA programs (2 h)

Day 2

CUDA memory hierarchy (30 min):

  • Memory levels
  • Register file, constant memory
  • Global memory
  • Shared memory
  • Texture memory
  • Unified memory

CUDA libraries (20 min):


CUDA interaction with computer graphics (20 min):

  • OpenGL interoperability
  • Image filter

CUDA application in machine learning (20 min):

  • Deep Neural Network Library cuDNN
  • Machine learning library TensorFlow

Practical task: exercises in using CUDA development tools (2 h):

  • data processing using CUBLAS and CURAND
  • image processing in GPU environment

II part Applied use of the high-performance computing technology CUDA

This course continues the previous “Introduction to high-performance computing technology CUDA” by focusing on CUDA implementation on multiprocessor graphical systems and CUDA cloud computing possibilities in a remote server environment, in particular. At the end, CUDA practical exercises will be offered on the RTU HPC cluster.

Day 1

Efficient use of CUDA memory (30 min):

  • Textures, arrays and possibilities of using them
  • Sharing of CUDA memory
  • CUDA unified memory

CUDA streams and events (30 min):

  • Streams and events for parallel execution
  • Asynchronous data copying
  • Parallel core execution and data exchange

Debugging and profiling of CUDA software (30 min):

  • Performance evaluation and metrics
  • Overview of Nvidia Nsight tool for debugging and profiling
  • Debugging of CUDA software
  • Profiling of CUDA performance

Practical task: exercises in developing applied CUDA programs (2 h):

  • Interaction among Nbodies
  • Interactive fluid simulation
  • Implementation of normalised correlation algorithm in image processing

Day 1

CUDA graphics in multiprocessor systems (30 min):

  • Data exchange between GPUs
  • Synchronisation of execution

CUDA in a remote server environment (1 h):

  • CUDA as a cloud computing service
  • Operation principles and architecture of HPC cluster
  • Parallel execution of CUDA jobs on the RTU HPC cluster

Practical task: exercises in developing applied CUDA programs (2 h):

  • CUDA exercises on the RTU HPC cluster
  • Photorealistic rendering in CUDA environment