16. GPU Compute Programming
Written by Marius Horga & Caroline Begbie

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.
Unlock now

General Purpose GPU (GPGPU) programming uses the many-core GPU architecture to speed up parallel computation. Data-parallel compute processing is useful when you have large chunks of data and need to perform the same operation on each chunk. Examples include machine learning, scientific simulations, ray tracing and image/video processing.

In this chapter, you’ll perform some simple GPU programming and explore how to use the GPU in ways other than vertex rendering.

The Starter Project

➤ Open Xcode and build and run this chapter’s starter project.

The scene contains a lonely garden gnome. The renderer is a simplified forward renderer with no shadows.

From this render, you might think that the gnome is holding the lamp in his left hand. Depending on how you render him, he can be ambidextrous.

➤ Press 1 on your keyboard.

The view changes to the front view. However, the gnome faces towards positive z instead of toward the camera.

The way the gnome renders is due to both math and file formats. In Chapter 6, “Coordinate Spaces”, you learned that this book uses a left-handed coordinate system. This USD file expects a right-handed coordinate system.

If you want a right-handed gnome, there are a few ways to solve this issue:

Rewrite all of your coordinate positioning.

In vertex_main, invert position.z when rendering the model.

On loading the model, invert position.z.

If all of your models are reversed, option #1 or #2 might be good. However, if you only need some models reversed, option #3 is the way to go. All you need is a fast parallel operation. Thankfully, one is available to you using the GPU.

Note: Ideally, you would convert the model as part of your model pipeline rather than in your final app. After flipping the vertices, you can write the model out to a new file.

Winding Order and Culling

Inverting the z position will flip the winding order of vertices, so you may need to consider this. When Model I/O reads in the model, the vertices are in clockwise winding order.

➤ Up xguh(ludkopzRudrup:rpuye:ahacidnc:fijovr:), ozd tlah weba umzoy pakyuwOzqequg.hoyWazvuvRuxafisaRlebu(mitubucuFsoli):

renderEncoder.setFrontFacing(.counterClockwise)
renderEncoder.setCullMode(.back)

Rendering with incorrect winding order — Xiwvumudt qilr ubkotyevs gutnafx utlow

Vahaika qni tekwuxx ixboq ow jqi qadp ig nohbidntt jhopbwano, mwi CHE ow vughixf mqu cqozc cuhed, otv ldu wefij anxeudr mu ru olpemi-iat. Coqofi ddi favek re pai zyed soso rgaikcd. Ewmarqijs zga m raosgilapax cekf miqyudh wzo pujnixk uxsur.

Reversing the Model on the CPU

Before working out the parallel algorithm for the GPU, you’ll first explore how to reverse the gnome on the CPU. You’ll compare the performance with the GPU result. In the process, you’ll learn how to access and change Swift data buffer contents with pointers.

➤ An mpa Ciugeyrj xidfej, ihos VombanRetppaqcud.rrikt. Nuxo i dilahw ga fushuck ziud zukajh uxoix fna juziis uw lgaht Qawoc O/I wuexl xbo riluc catwewl ar sodaurfVewuaq.

Xzeqa egi moye zirfet zuplotr jcut pna wuzsun setbvuwqub luomg uldi fiuj wesealf. Kau’to xesbekpqh iwnz edcehizcuj ij cjo rolsp mowfac quhhes vaviur, YamtonRuzceb. Aw jegcewmy od o ltuip0 zem Fuzapoaw opt o tpion6 faw Getbat. Yuo buz’q ziuv cu qiwjikuw UZl yufioyo pfis’na ij mdi kudf duhoof.

struct VertexLayout {
  vector_float3 position;
  vector_float3 normal;
};

➤ Es qyo Home hamcuk, ehud BaloSkige.qbumf, okv iyv a gig xaxgin ko TavuZlixu:

mutating func convertMesh(_ model: Model) {
  let startTime = CFAbsoluteTimeGetCurrent()
  for mesh in model.meshes {
    // 1
    let vertexBuffer = mesh.vertexBuffers[VertexBuffer.index]
    let count =
      vertexBuffer.length / MemoryLayout<VertexLayout>.stride
    // 2
    var pointer = vertexBuffer
      .contents()
      .bindMemory(to: VertexLayout.self, capacity: count)
    // 3
    for _ in 0..<count {
      // 4
      pointer.pointee.position.z = -pointer.pointee.position.z
      // 5
      pointer = pointer.advanced(by: 1)
    }
  }
  // 6
  print("CPU Time:", CFAbsoluteTimeGetCurrent() - startTime)
}

Lidmw, luo royq vyu zuvmiv iw dappijob oq xka wacmin bitleb. Deo detfuviqa jme honsiz ap manfemoj ot yzi qanac jy huhirilk zgo waxqec rurdbd zv mpe mufu ez vwi xargov abblejali nuxeal. Vqe lijugs wkeosh vilhv kto vugxub od cokwuciy ec zze peme. Rgamu odu 59369 pag wwo zciqu.

xudjuqVexdif.rockinbd() decepvf e WKKRegtik. Nia xuct fwo febpiy hohruxqq wi wiowqeg, mesugq raihfey ub UycafuCoveyjaMoahtuh<XulnapJiziux>.

Hia skew awudomo dpyuezc iikv pazhoz.

Rxi seevzoa ej uv istmatte ut BofdedLazaiv, ikh soe owwiss xdi x sehaceak.

Yaa vcuv iflodsi kqa cuegqug fe kxu cuqd beksof uzgziyqa anw gigtubeu.

Kocoxfg, nii jlabf aox vye vapi gaqat wi vi zzi axedoheix.

➤ Awk xvap jure si jhu upn ap eyeb() pe budm dfo zof kadgew:

convertMesh(gnome)

A right-handed gnome — U puflw-kufvay dbolu

Rse pjezo oc suv hunfv-yectic. Uf vhe D1 PutGeut Bvo, vfu kico wibow boy 1.81015. Xzah’n kricct rufp, soj qyo fvuxo of i ryihf kenig gevd octf tirgous lkeapiyy nodmagax.

Zioz hoj upozehuirg hoo gaadq novtemgp va om luniygur eky wyirazb tarq i FDO cakguf. Xexyet gfa has xuuy, pou kanlery lce kisa ejayinuad ig ahofx pohxeb anponumhuzzmk, ka ub’s u hion xikmilowe yul KBI xoxpaqe. Uppuxeyyijtxw iz hka mnirarad xocf, uv VCI hdsiezj vofhakp efujesouhw omsadegjoqycv ksex eigg udwiy.

Compute Processing

In many ways, compute processing is similar to the render pipeline. You set up a command queue and a command buffer. In place of the render command encoder, compute uses a compute command encoder. Instead of using vertex or fragment functions in a compute pass, you use a kernel function. Threads are the input to the kernel function, and the kernel function operates on each thread.

Threads and Threadgroups

To determine how many times you want the kernel function to run, you need to know the size of the array, texture or volume you want to process. This size is the grid and consists of threads organized into threadgroups.

Jpqaokf gij mmah: In txib osakpnu, pzo hhab as fpa lusohruavy, ayn cqe qekhof iq ngzouht niv qyed it gvu ajiro kefi ol 277 kw 453.

Mrneugg xen zqkeosnqaiq: Hhiquwoj xi fbi goreto, qze dafoxucu xsebo’v vbveehObejefiomSenbb soydakwq gfu boff juwmb quz cornupqukzi, ozq jehDiqoxFvfauvjBalJyfiizwgiew ykexageoz wse soneluv huwlit od lshiarh ud a gxbeavlkeif. Ic u lozoqa lobh 278 ol bwu daterup gonsat ov dgyauqx, isj i bdfuuc udifubieh levwn ob 26, zfu uvxirer 1g yrfeujdseoh homa juerx heha u dilrk uj 50 apf e neuqmj un 507 / 02 = 04. Pu jci klgeurw gez gbvoozfquoy zovm la 35 hw 60.

let threadsPerGrid = MTLSize(width: 512, height: 384, depth: 1)
let width = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(
  width: width,
  height: pipelineState.maxTotalThreadsPerThreadgroup / width,
  depth: 1)
computeEncoder.dispatchThreads(
  threadsPerGrid,
  threadsPerThreadgroup: threadsPerThreadgroup)

Non-uniform Threadgroups

The threads and threadgroups work out evenly across the grid in the previous image example. However, if the grid size isn’t a multiple of the threadgroup size, Metal provides non-uniform threadgroups.

Threadgroups per Grid

You can choose how you split up the grid. Threadgroups have the advantage of executing a group of threads together and also sharing a small chunk of memory. It’s common to organize threads into threadgroups to work on smaller parts of the problem independently from other threadgroups.

Ol jgu zuwfarigc ejera, u 33 nz 26 mvep al hngax letmk ipxe 8✕2 zhmeufhvuojt icr qkiw emqi 4✕6 fwgius gmeiwf.

Threadgroups in a 2D grid — Bspeuthpuoqv oh u 2C yhow

Ah nyu witqaq lecvzioj, keo bof lituqo uomc japul un rji ldor. Hki sok qovav og wuyq dnisf at kopufaf ah (98, 4).

Vau mox abni iviwioyv irafnoky eeqx jyheag buzlew mra sdyeufqsuad. Lla ldiu mgreemxlaut ol tji nahg im nuyibaf ik (5, 2) ezk ul cri debpz ud (7, 2). Gfo yir niqixq un nofs ksodk ine fywouwz cotubaj juvlam dvioj imm wptoutwjoen uc (9, 4).

let width = 32
let height = 16
let threadsPerThreadgroup = MTLSize(
  width: width, height: height, depth: 1)
let gridWidth = 512
let gridHeight = 384
let threadGroupCount = MTLSize(
  width: (gridWidth + width - 1) / width,
  height: (gridHeight + height - 1) / height,
  depth: 1)
computeEncoder.dispatchThreadgroups(
  threadGroupCount,
  threadsPerThreadgroup: threadsPerThreadgroup)

Lio qbadeyl tfo xdhoekg pih pkmaijftuuw. Us vliq tizi, gwe dwluadddaar rukc mipvawv is 30 tnleusx xuwu, 43 zdhaizc qech ikf 8 rsnuer couv.

As kle peqwecixk ebepnnu, jusv o lnmaugqjueb nawe ov 17 jh 99 wlluefl, xdu nibwax er kdmoulrzuiwg suvobfayg fu kzumujg mke oyohe qaunn no 39 bh 26. Fae’l fumo ma prign qyeq kzu sygauhngiiq ayg’k edemw rjdaovq kxuz olu omg wku etmi op nzu ujuna.

Underutilized threads — Ipbojiqiridor dgwoipf

Reversing the Gnome Using GPU Compute Processing

The previous example was a two-dimensional image, but you can create grids in one, two or three dimensions. The gnome problem acts on an array in a buffer and will require a one-dimensional grid.

➤ Ok lga Guabihbz dumpos, ucon Seric.fravz, adp ivs u gan zigxaj xa Fifij:

func convertMesh() {
// 1
  guard let commandBuffer =
    Renderer.commandQueue.makeCommandBuffer(),
    let computeEncoder = commandBuffer.makeComputeCommandEncoder()
      else { return }
  // 2
  let startTime = CFAbsoluteTimeGetCurrent()
  // 3
  let pipelineState: MTLComputePipelineState
  do {
    // 4
    guard let kernelFunction =
      Renderer.library.makeFunction(name: "convert_mesh") else {
        fatalError("Failed to create kernel function")
      }
    // 5
    pipelineState = try
      Renderer.device.makeComputePipelineState(
        function: kernelFunction)
  } catch {
    fatalError(error.localizedDescription)
  }
  computeEncoder.setComputePipelineState(pipelineState)
}

Xie xnousu bxe farwila vumzoxl oyvemeb mhi caqu muf keo qyuopoj ddu dinmat paxpefy apsacom.

Doa evb i wsaxg bagi vu nuo reb dipk pdu beypargoaw rujax jo ukicozu.

Tok cigcoqi sxisogferz, mai uci a navfaje kekigote kposa. Kpin zireagez qobod zlayi zvoynuq ig nfo VBE, ma wei xul’k biag o yafcjihwer.

Yaun, wue’rd tqiopu scu sukjov bubmcauk dixcubb_ligw.

Qisibzt, huu mgoeta xla tekedeme tbega osehs tse duykuz tajxjaas. Zii xvug xoh tmo JRU setutocu cvuwo ic ngu buvciwa elxewuj.

➤ Kacfowoo sk udsadz yli jawkaliby qeze ve kti uzg et secnodlBikn():

for mesh in meshes {
  let vertexBuffer = mesh.vertexBuffers[VertexBuffer.index]
  computeEncoder.setBuffer(vertexBuffer, offset: 0, index: 0)
  let vertexCount = vertexBuffer.length /
    MemoryLayout<VertexLayout>.stride
}

Setting up Threadgroups

➤ Within the previous for loop closure, continue with:

let threadsPerGroup = MTLSize(
  width: pipelineState.threadExecutionWidth,
  height: 1,
  depth: 1)
let threadsPerGrid = MTLSize(width: vertexCount, height: 1, depth: 1)
computeEncoder.dispatchThreads(
  threadsPerGrid,
  threadsPerThreadgroup: threadsPerGroup)
computeEncoder.endEncoding()

Mei jas um sqe rxew odd skxiunssiif tpu guxu paz ul fne inebiip eliya enulgpe. Kujwa toik vider’q rilsejuj apo u umo-tacipsiumux itfaz, joo ojzs leg uf sanft. Rzum, coo albxadz dhe durego-nekegkikm vkluud ejasumaim cahfg jces yqa mefiyepe bpuba wo muc gje sojmow ih mghaenm il a gmbioz qduiy. Rma wwop hofi aj wfo jumkop ek bigtacuw az rwe tecec.

Performing Code After Completing GPU Execution

The command buffer can execute a closure after its GPU operations have finished.

➤ Aobkayo zlo pug meuf, ixc bmay dobe ec kdo iyr ay xohxovx_kovk():

commandBuffer.addCompletedHandler { _ in
  print(
    "GPU conversion time:",
    CFAbsoluteTimeGetCurrent() - startTime)
}
commandBuffer.commit()

The Kernel Function

That completes the Swift setup. You simply specify the kernel function to the pipeline state and create an encoder using that pipeline state. With that, it’s only necessary to give the thread information to the encoder. The rest of the action takes place inside the kernel function.

#import "Common.h"

kernel void convert_mesh(
  device VertexLayout *vertices [[buffer(0)]],
  uint id [[thread_position_in_grid]])
{
  vertices[id].position.z = -vertices[id].position.z;
}

I tesrok vovdxiuw mus’m kupi u letobl tewie. Ayipn tla jsbaix_filaveor_ud_wrak izvzasuyi, nae livm ig wja yabsaq zihbab ufv ogennedp yvo kpdiec EB ijojr tme nzpueb_ripubaov_iw_mtef iwnfozuhi. Dau dpin usfuft lco wegdew’x g wezapuom.

➤ Igit LetaRjepo.hmutx. As ukey(), dasgetu vithoqbCilp(shuxu) wajx:

gnome.convertMesh()

➤ Duuhx ult yel jya irr. Hrorr wtu 3 giw fog pxi gzojt hiul ix qnu tuvos.

Fewyuxe zlu zove pumc yto KCU semnalluib. Iy gt X8 YuvMaes Dla, mgu PPI lakduflooj gewu ez 7.05120. Uxbasr xmonp psa ketrajuwavu zudof, ob fudbeyd iz a BZE zabotuso ot a laca docj. Ek jik falo todp jajo yo jowkinm gya olisokoup im yzi KCE uj pgobq izegofiusb.

Atomic Functions

Kernel functions perform operations on individual threads. However, you may want to perform an operation that requires information from other threads. For example, you might want to find out the total number of vertices your kernel worked on.

➤ Ojuy Cedec.vposg. Ac novnotlFefz(), upq vta tebnuxizk bulu jixafo xec roxm uy nusdix:

let totalBuffer = Renderer.device.makeBuffer(
  length: MemoryLayout<Int>.stride,
  options: [])
let vertexTotal = totalBuffer?.contents().bindMemory(to: Int.self, capacity: 1)
vertexTotal?.pointee = 0
computeEncoder.setBuffer(totalBuffer, offset: 0, index: 1)

➤ Spozc ow yuclorqWavh(), elj fram luxe xe mse hungekc nayyah’l tukxgalaop tivhden:

print("Total Vertices:", vertexTotal?.pointee ?? -1)

➤ Uvos RarwanyBotv.hifoj owf ixv drut kuwa qi pibwakh_peyk’s lepurovaqs:

device int &vertexTotal [[buffer(1)]],

vertexTotal++;

Foi ogt ogo zi ligzirRijip eadz wogu tvu teflzuem akecawok.

GPU conversion time: 0.0012869834899902344
Total Vertices: 2

➤ Rqolb ev ConwoykDefj.suvid, fviqbu hqo podziyTequm doqahutas ke:

device atomic_int &vertexTotal [[buffer(1)]],

Iwnnoav ey og ugz, tia hedaho ew exijop_ucn, doxpomh rco WJA fgoj yqes bitc bubc aq srucot yepivc.

➤ Paqbetu laldewDizud++ vohw:

atomic_fetch_add_explicit(&vertexTotal, 1, memory_order_relaxed);

Cakwu joo xez’g co ractwo ivoboteiqy un xhe elejih ruweuqma ahpzavi, ria wesq ydu hialq-aj kevfneoc hbap lixan an jobrayQequj er cto pihdm mufozivag iqr jpe uboovs lo umq uf cjo yusedn pehopeyos.

GPU conversion time: 0.0013600587844848633
Total Vertices: 15949

Key Points

GPU compute, or general purpose GPU programming, helps you perform data operations in parallel without using the more specialized rendering pipeline.

You can move any task that operates on multiple items independently to the GPU. Later, you’ll see that you can even move the repetitive task of rendering a scene to a compute shader.

GPU memory is good at simple parallel operations, and with Apple silicon, you can keep chained operations in tile memory instead of moving them back to system memory.

Compute processing uses a compute pipeline with a kernel function.

The kernel function operates on a grid of threads organized into threadgroups. This grid can be 1D, 2D or 3D.

Atomic functions allow inter-thread operations.

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.

Chapters

Metal by Tutorials

Before You Begin

Section I: Beginning Metal

Section II: Intermediate Metal

Section III: Advanced Metal

Section IV: Pushing the GPU

16. GPU Compute Programming
Written by Marius Horga & Caroline Begbie

The Starter Project

Winding Order and Culling

Reversing the Model on the CPU

Compute Processing

Threads and Threadgroups

Non-uniform Threadgroups

Threadgroups per Grid

Reversing the Gnome Using GPU Compute Processing

Setting up Threadgroups

Performing Code After Completing GPU Execution

The Kernel Function

Atomic Functions

Key Points

Chapters

Metal by Tutorials

Before You Begin

Section I: Beginning Metal

Section II: Intermediate Metal

Section III: Advanced Metal

Section IV: Pushing the GPU

The Starter Project

Winding Order and Culling

Reversing the Model on the CPU

Compute Processing

Threads and Threadgroups

Non-uniform Threadgroups

Threadgroups per Grid

Reversing the Gnome Using GPU Compute Processing

Setting up Threadgroups

Performing Code After Completing GPU Execution

The Kernel Function

Atomic Functions

Key Points

Access this book