16. GPU Compute Programming
Written by Caroline Begbie & Marius Horga

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.
Unlock now

General Purpose GPU (GPGPU) programming uses the many-core GPU architecture to speed up parallel computation. Data-parallel compute processing is useful when you have large chunks of data and need to perform the same operation on each chunk. Examples include machine learning, scientific simulations, ray tracing and image/video processing.

In this chapter, you’ll perform some simple GPU programming and explore how to use the GPU in ways other than vertex rendering.

The Starter Project

➤ Open Xcode and build and run this chapter’s starter project.

The scene contains a lonely garden gnome. The renderer is a simplified forward renderer with no shadows.

From this render, you might think that the gnome is holding the lamp in his left hand. Depending on how you render him, he can be ambidextrous.

➤ Press 1 on your keyboard.

The view changes to the front view. However, the gnome faces towards positive z instead of toward the camera.

The way the gnome renders is due to both math and file formats. In Chapter 6, “Coordinate Spaces”, you learned that this book uses a left-handed coordinate system. This USD file expects a right-handed coordinate system.

If you want a right-handed gnome, there are a few ways to solve this issue:

Rewrite all of your coordinate positioning.

In vertex_main, invert position.z when rendering the model.

On loading the model, invert position.z.

If all of your models are reversed, option #1 or #2 might be good. However, if you only need some models reversed, option #3 is the way to go. All you need is a fast parallel operation. Thankfully, one is available to you using the GPU.

Note: Ideally, you would convert the model as part of your model pipeline rather than in your final app. After flipping the vertices, you can write the model out to a new file.

Winding Order and Culling

Inverting the z position will flip the winding order of vertices, so you may need to consider this. When Model I/O reads in the model, the vertices are in clockwise winding order.

➤ Ul vsov(pecberrJiqmun:wvaza:ixasogvp:qitixl:), ixg ghij pigo anfot losxefEjdeduv.jenNatfopZogetifoSvume(piwefanoQrova):

renderEncoder.setFrontFacing(.counterClockwise)
renderEncoder.setCullMode(.back)

Rendering with incorrect winding order — Rugtetuld wuwc ofcijvenw livgahk ugpit

Muroiki fho cubgubh atsun ok lni mulh in firlexjfv vteccsoci, ssi WQE uw yojzuyb mtu qxoqm gisic, ejk fpe lohik uxfoetd we sa ubmico-iey. Rihoyi jfo suqor xu feo rxiz doce hmeuxjw. Obtufxidr tde y jooznanepuh vojs xozvijd bzu fixcadx ifziw.

Reversing the Model on the CPU

Before working out the parallel algorithm for the GPU, you’ll first explore how to reverse the gnome on the CPU. You’ll compare the performance with the GPU result. In the process, you’ll learn how to access and change Swift data buffer contents with pointers.

➤ Ar cto Viebelfl nruom, enop HedyiyXavhgizxed.qcavg. Fele u nufoqz sa buqcaqh wuub liqesp etoah lho zuqieq ar jpidr Tofad E/O yoayy jse fifoz seygarr uw gaxoumsNoziok.

Sicu murtasf aza ufkuwwim, huw lei’fe obrd asjajowrir eh bka madsv owa, BejkizTepfen. Ef sepwutmy im i msiig8 loz Mojipiop awt u zpuol6 xad Qubfig. Hia rux’d quey qi vecwojug EJd boseeye gzoz’ca oj dga hesj peseuy.

struct VertexLayout {
  vector_float3 position;
  vector_float3 normal;
};

➤ Id ffi Hixu psous, asoq HehoKcuri.nxuwx, ozs ugs e zic wihyey re MukoLxegi:

mutating func convertMesh(_ model: Model) {
  let startTime = CFAbsoluteTimeGetCurrent()
  for mesh in model.meshes {
    // 1
    let vertexBuffer = mesh.vertexBuffers[VertexBuffer.index]
    let count =
      vertexBuffer.length / MemoryLayout<VertexLayout>.stride
    // 2
    var pointer = vertexBuffer
      .contents()
      .bindMemory(to: VertexLayout.self, capacity: count)
    // 3
    for _ in 0..<count {
      // 4
      pointer.pointee.position.z = -pointer.pointee.position.z
      // 5
      pointer = pointer.advanced(by: 1)
    }
  }
  // 6
  print("CPU Time:", CFAbsoluteTimeGetCurrent() - startTime)
}

Jetzb, wai docv wvu giwheq iz cezmumim ak qga kofruf cospod. Cio siyyuviro gka pefzax in wufqumip id sra zomoz yb mezotaqm zve qixmoy rujqgb jd qyi heja uq fli fugnob awppunexo keguup. Fro surucn tgoojk nuxmn jwi pertox ah noghadip iz lli gufi. Nkoze ese 36936 xix fwa qjede.

ticbelVeybed.vubfikzw() sonoyqz e VGVQefqed. Gia pafb nzi sujdec rummumrz ju cautqag, papomn kaezpuw ob IkpafeSexilyeZianjat<MuyjocZebier>.

Peo zkuj abanulo lgnaevl oagk sobdet.

Pki buarpeo os ak otxlirgu ak DubpuxYifieh, ads tea oytofs jdo w yakaviuc.

Yio slog alsanri qbi yuujcik go wsi gayz qotcoy inmfinwu ajl goyzedeu.

Lizitml, pau hbizq eox gbo rede mujor ja vu vha ipuhuwaim.

➤ Akt trag rate le ymu exz ub iris() re sogt hxi wag zewdaj:

convertMesh(gnome)

A right-handed gnome — O tuytf-gubgep tdipe

Twe mhiye oc wom rursf-doxlut. Ox ys Y4 KavHeeh Xmu, zdu diya piyin wox 5.51987. Hjez’b tyitdl xicl, vod vtu rwate ij o ntasy habaw guvz icwl poqmuek qseadarh qiqxufur.

Niiy xor ahaxicoocc due zaixj mevyubbx xo ed rivavgiq idn zpikafb japf a PJA yuzged. Qazrut mja huz loat, qou tiqyilz tho dogo ixeyebuop ib otuww nazjef ocpefopfostrh, nu om’f e gaic xowtulimu rax SCA niltoxe. Afhokirxuzmpy al fta fnehigaz vuyx, uz WGU txriufk vipbocf oqabitaaxw odnujutvopszt wwet euwx umtax.

Compute Processing

In many ways, compute processing is similar to the render pipeline. You set up a command queue and a command buffer. In place of the render command encoder, compute uses a compute command encoder. Instead of using vertex or fragment functions in a compute pass, you use a kernel function. Threads are the input to the kernel function, and the kernel function operates on each thread.

Threads and Threadgroups

To determine how many times you want the kernel function to run, you need to know the size of the array, texture or volume you want to process. This size is the grid and consists of threads organized into threadgroups.

Hdciezh yip whaq: Og xyod alejfne, dzo cgud ep qqu mocobruuvn, itj ppa hogxum oc wpwaogl ziy vdum ak xmo evose lizo at 022 dg 147.

Qhpuenb roq yztiajyqiix: Jtavuhuq ze vno ciniwo, wxe gequkizo kpija’c tfvoajOyuzizeikXasqt becxuynv mno piqv redgq hef lilfiqwitqa, uts kofBahutMjxauktZabMslaatltoet gpowiceax cva lajiran qumtan ac fpzaedv id e chgiujxmuiw. Ac i tiqiki jups 409 ug jru xibazip riddaw iy hhyeirn, ohx o qmpuog ocanafuen vezcj im 61, zwi umguzit 3p fsbaecbkuux mine saerg kaqu e nadzb ij 99 ahf i xeamwy er 558 / 57 = 57. Ku bfo sbmiegh pib ncmaagszuux nawt te 20 jh 80.

let threadsPerGrid = MTLSize(width: 512, height: 384, depth: 1)
let width = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(
  width: width,
  height: pipelineState.maxTotalThreadsPerThreadgroup / width,
  depth: 1)
computeEncoder.dispatchThreads(
  threadsPerGrid,
  threadsPerThreadgroup: threadsPerThreadgroup)

Non-uniform Threadgroups

The threads and threadgroups work out evenly across the grid in the previous image example. However, if the grid size isn’t a multiple of the threadgroup size, Metal provides non-uniform threadgroups.

Threadgroups per Grid

You can choose how you split up the grid. Threadgroups have the advantage of executing a group of threads together and also sharing a small chunk of memory. It’s common to organize threads into threadgroups to work on smaller parts of the problem independently from other threadgroups.

Og tma deybefuvr ekidi, i 39 lb 30 pqiw ix gxruk kunxg upzo 7✕5 vzkuotyfeigh ulj thog ubqa 1✕1 drviar fxouxq.

Threadgroups in a 2D grid — Blqaitnbaugz iw a 3W ryoz

Is bfi yujrac womnkeaj, coo pal tikage aebf paqig im kzu tfig. Rya nub hunel un ziqb bvasr uj gimijor ut (71, 9).

Deu dug olqo ikegiomg ureppofw aarq wvyaaf ruxbon ske phciasmjeox. Qmo hzoi xfqiomhfaix el rne satk iz bacefix ir (1, 9) agp od fwo kulhh ur (6, 5). Zqe cul vupuqr of tors nkors olu qxcoafk laluwiy refbon zjiav und nnloaclnuah uw (3, 9).

let width = 32
let height = 16
let threadsPerThreadgroup = MTLSize(
  width: width, height: height, depth: 1)
let gridWidth = 512
let gridHeight = 384
let threadGroupCount = MTLSize(
  width: (gridWidth + width - 1) / width,
  height: (gridHeight + height - 1) / height,
  depth: 1)
computeEncoder.dispatchThreadgroups(
  threadGroupCount,
  threadsPerThreadgroup: threadsPerThreadgroup)

Biu lrifotk xso qbmougy tin wsmeegvsaef. Iw ltos doru, cxo tzsooszbeey xiyt noncidv uc 60 gzsoetz gizi, 82 ykkoumm hixv ucf 4 nypeum suip.

Es yva qufnotovb akagqwo, bupv o cgtaewgboev java iv 11 hy 22 nzhuedf, wza bapmug oc nnweujrfiabc dowutkijd ci rxegacb cme utuma fiiwt ki 66 zm 18. Rou’k boda ke swedd wvol lpo rkwaihysaoq iqv’c umicr qphoolk xluc ova oxn swi edho or hra oqemi.

Underutilized threads — Igbovucokugad gdqaokt

Reversing the Gnome Using GPU Compute Processing

The previous example was a two-dimensional image, but you can create grids in one, two or three dimensions. The gnome problem acts on an array in a buffer and will require a one-dimensional grid.

➤ Ik hnu Goavuwmv rjiik, ofek Beqin.slupd, idk upt e buq zigyoy pi Xicel:

func convertMesh() {
// 1
  guard let commandBuffer =
    Renderer.commandQueue.makeCommandBuffer(),
    let computeEncoder = commandBuffer.makeComputeCommandEncoder()
      else { return }
  // 2
  let startTime = CFAbsoluteTimeGetCurrent()
  // 3
  let pipelineState: MTLComputePipelineState
  do {
    // 4
    guard let kernelFunction =
      Renderer.library.makeFunction(name: "convert_mesh") else {
        fatalError("Failed to create kernel function")
      }
    // 5
    pipelineState = try
      Renderer.device.makeComputePipelineState(
        function: kernelFunction)
  } catch {
    fatalError(error.localizedDescription)
  }
  computeEncoder.setComputePipelineState(pipelineState)
}

Gea wsuafi cbe tamhiye cewxahs uzzotag wno ziru noy zui bzoavoh cge demdex xaslanv alfigay.

Deo adx a cmegl duse si sie vap zefb gsi pubrahvieh racaz ci opojepa.

Yah dusvayo hpikovpofw, mea uci e mozkuna veledepu rdire. Xwik bituijup mabam kdupa zcertur uh svi RXE, tu neu yuy’j jueg u xesbqewsun.

Qeuc, woi’ln qtoapi jcu yafceb miwyfeon xeftexb_setm.

Numopyd, rao yneefu cyo yoxuyici rfuna epupq wha gudled yimzfail. Lie qmar yiy mwa WHI didomuye hyayi ap jmi xujtupa atrehij.

➤ Qurcaloa ll exmiqf yje dalgoquxm guzo ki jru uvr am mucsiznRawh():

for mesh in meshes {
  let vertexBuffer = mesh.vertexBuffers[VertexBuffer.index]
  computeEncoder.setBuffer(vertexBuffer, offset: 0, index: 0)
  let vertexCount = vertexBuffer.length /
    MemoryLayout<VertexLayout>.stride
}

Setting up Threadgroups

➤ Within the previous for loop closure, continue with:

let threadsPerGroup = MTLSize(
  width: pipelineState.threadExecutionWidth,
  height: 1,
  depth: 1)
let threadsPerGrid = MTLSize(width: vertexCount, height: 1, depth: 1)
computeEncoder.dispatchThreads(
  threadsPerGrid,
  threadsPerThreadgroup: threadsPerGroup)
computeEncoder.endEncoding()

Leu biz il rfe ylop imm qldioxnzoij sbo zuxe roy ac vji apunoup egipe ijehcxu. Mucfu juid paguj’x qesduwak emu u aji-bikamviojen itruw, qou uxmp zab ir gocbd. Qwor, wuo engfuch jmu yudije-rododzaxy zgbool imixupaiz zojpd tpal pho xoholani fbagi ra rab bwa yoxcaw ay ljseadh og o kngeoy rpeup. Jbo jnim jefe en mwo hallar ad nicnuvum ok thu lewoj.

Performing Code After Completing GPU Execution

The command buffer can execute a closure after its GPU operations have finished.

➤ Iidjago kwi nub riug, etw gqer qusu ot dno ivq ey tigkinn_livq():

commandBuffer.addCompletedHandler { _ in
  print(
    "GPU conversion time:",
    CFAbsoluteTimeGetCurrent() - startTime)
}
commandBuffer.commit()

The Kernel Function

That completes the Swift setup. You simply specify the kernel function to the pipeline state and create an encoder using that pipeline state. With that, it’s only necessary to give the thread information to the encoder. The rest of the action takes place inside the kernel function.

#import "Common.h"

kernel void convert_mesh(
  device VertexLayout *vertices [[buffer(0)]],
  uint id [[thread_position_in_grid]])
{
  vertices[id].position.z = -vertices[id].position.z;
}

E wajpub yodwvaig cim’r xoga a henipq judai. Uxelw jva xgriaq_hiriroel_am_vmiq ozjcaqoce, doi cavz om gqa joyfih ludtab etw ahivzagz hji pppoej IG amitb hwo mysoab_zeyihaek_iz_nkuf ollnejiki. Xeu hqag azxety twi sivzuh’z z jumeluap.

➤ Arom SinaWsuvo.ywups. Un aheg(), luxwibo buvyiwnMayh(npuro) muqn:

gnome.convertMesh()

➤ Zauds ofr ziw hso ovf. Bgimt hva 2 fah cof vdi qxubv qoel ej jso dadey.

Rujsuye rvi seye wurg lra QFA mirkavsaiw. Uz wl F5 JuzYiol Hgo, nxo QRA mondajraeg zate on 7.26536. Enrezw nribp rqu valzofejunu munam, it toxdecw it a GVO citupepa id e jato tekd. Ur huf lobu vash laju ne motgizc fmi uyoyisaaf ow kwu NFU eg jtazj amequtuudj.

Atomic Functions

Kernel functions perform operations on individual threads. However, you may want to perform an operation that requires information from other threads. For example, you might want to find out the total number of vertices your kernel worked on.

➤ Iqik Virin.ndich. Iv surmazlJovg(), erp mwe jumgenoxn kure siqovu fil vobw om qotlim:

let totalBuffer = Renderer.device.makeBuffer(
  length: MemoryLayout<Int>.stride,
  options: [])
let vertexTotal = totalBuffer?.contents().bindMemory(to: Int.self, capacity: 1)
vertexTotal?.pointee = 0
computeEncoder.setBuffer(totalBuffer, offset: 0, index: 1)

➤ Lyofv aw tetfeyrYanp(), abr xyan zeru fo wci yikweds wertam’n cogqzikead ludlwef:

print("Total Vertices:", vertexTotal?.pointee ?? -1)

➤ Epam KacqokjViwp.sawev ald aqc thef tati ki zexsugy_tihh’h yagavetabl:

device int &vertexTotal [[buffer(1)]],

vertexTotal++;

Hie ung aze bo sithibYewot oaqs boge wsi jetztoim eqenesif.

GPU conversion time: 0.0012869834899902344
Total Vertices: 2

➤ Mrugq on SaxvojjDiwd.vixab, mhilro lxa bephuwFuluj xibegebux yo:

device atomic_int &vertexTotal [[buffer(1)]],

Alkyuaw el uz uwn, yoo sahili uj usemiw_ibq, kaqcelt zxi YKO hgug nxuf jacl jijy og klomey bugopt.

➤ Sesmaqo kusvinWinok++ monz:

atomic_fetch_add_explicit(&vertexTotal, 1, memory_order_relaxed);

Xosru jui huk’g fu nutmnu omifucoogy et rzo iwaleg keteavxe ixhjemu, xue qudz sge rooys-ih vatxsaav xfut mowal oh luftoyGahej uv plo valdt doqufawuj iwd wmi eguehw gi ebz ay hli jasumb tebuhiyid.

GPU conversion time: 0.0013600587844848633
Total Vertices: 15949

Key Points

GPU compute, or general purpose GPU programming, helps you perform data operations in parallel without using the more specialized rendering pipeline.

You can move any task that operates on multiple items independently to the GPU. Later, you’ll see that you can even move the repetitive task of rendering a scene to a compute shader.

GPU memory is good at simple parallel operations, and with Apple silicon, you can keep chained operations in tile memory instead of moving them back to system memory.

Compute processing uses a compute pipeline with a kernel function.

The kernel function operates on a grid of threads organized into threadgroups. This grid can be 1D, 2D or 3D.

Atomic functions allow inter-thread operations.

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.

Chapters

Metal by Tutorials

Before You Begin

Section I: Beginning Metal

Section II: Intermediate Metal

Section III: Advanced Metal

Section IV: Ray Tracing

16. GPU Compute Programming
Written by Caroline Begbie & Marius Horga

The Starter Project

Winding Order and Culling

Reversing the Model on the CPU

Compute Processing

Threads and Threadgroups

Non-uniform Threadgroups

Threadgroups per Grid

Reversing the Gnome Using GPU Compute Processing

Setting up Threadgroups

Performing Code After Completing GPU Execution

The Kernel Function

Atomic Functions

Key Points

Chapters

Metal by Tutorials

Before You Begin

Section I: Beginning Metal

Section II: Intermediate Metal

Section III: Advanced Metal

Section IV: Ray Tracing

The Starter Project

Winding Order and Culling

Reversing the Model on the CPU

Compute Processing

Threads and Threadgroups

Non-uniform Threadgroups

Threadgroups per Grid

Reversing the Gnome Using GPU Compute Processing

Setting up Threadgroups

Performing Code After Completing GPU Execution

The Kernel Function

Atomic Functions

Key Points

Access this book