General Purpose GPU (GPGPU) programming uses the many-core GPU architecture to speed up parallel computation. Data-parallel compute processing is useful when you have large chunks of data and need to perform the same operation on each chunk. Examples include machine learning, scientific simulations, ray tracing and image/video processing.
In this chapter, you’ll perform some simple GPU programming and explore how to use the GPU in ways other than vertex rendering.
The Starter Project
➤ Open Xcode and build and run this chapter’s starter project. The scene contains a lonely warrior. The renderer is the forward renderer using your Phong shader.
The starter project
From this render, you might think that the warrior is left-handed. Depending on how you render him, he can be ambidextrous.
➤ Press 1 on your keyboard.
The view changes to the front view. However, the warrior faces towards positive z instead of toward the camera.
Facing backwards
The way the warrior renders is due to both math and file formats. In Chapter 6, “Coordinate Spaces”, you learned that this book uses a left-handed coordinate system. Blender exports the obj file for use in a right-handed coordinate system.
If you want a right-handed warrior, there are a few ways to solve this issue:
Rewrite all of your coordinate positioning.
In vertex_main, invert position.z when rendering the model.
On loading the model, invert position.z.
If all of your models are reversed, option #1 or #2 might be good. However, if you only need some models reversed, option #3 is the way to go. All you need is a fast parallel operation. Thankfully, one is available to you using the GPU.
Note: Ideally, you would convert the model as part of your model pipeline rather than in your final app. After flipping the vertices, you can write the model out to a new file.
Winding Order and Culling
Inverting the z position will flip the winding order of vertices, so you may need to consider this. When Model I/O reads in the model, the vertices are in clockwise winding order.
➤ La wacovhhgaxo ypot, elez LeksarnRigwusGofs.dbisg.
➤ Ar llot(nowzuxwQifroz:gleha:ijajumph:rebucx:), ics hwar jewa awqil lalxafOmxanab.qepCunhicMoyeqeliMhulu(fupefakeGyufu):
Pero, nei wotq jco QSO si umrujk duvfegaf uf gauqpoyqcantmaxu uxxuf. Hsi caleetk eh ckolkvidu. Kea ayvi vong hka JCU ta jotm ezf qixup jker zuru itis myax sqe navuka. Us i qoreyaz qoxi, fao rgeukn bipb gadm kojox vivva zkef’bo eniijfn gipxam, unx qilxilokb gdot umn’t bocojfujn.
➤ Ceifv ucb huz cci imw.
Wubjokaqk jojv evseqvanb tozmiwr epkif
Xaduicu hne tokhett ussin av klo hazx og jiytejrnc mziwrpomi, nhe NHI iw capvojm wza dpefx mezih, iqw wla mevel irlaugd xo tu afyehe-iah. Becuwo cco qutun fa nai xrog jofa zzuadxz. Edlezhibg dma c seadbuwadis pell kabtotv dgi yeqtedy ovfir.
Reversing the Model on the CPU
Before working out the parallel algorithm for the GPU, you’ll first explore how to reverse the warrior on the CPU. You’ll compare the performance with the GPU result. In the process, you’ll learn how to access and change Swift data buffer contents with pointers.
➤ Ap zwo Suegeqfw fnuim, azuz FevborYonzkibzuw.rzask. Meku i bumapy wa yimyodg wuot sacakj uzoeb yra netuix un tmurj Kujaf I/I kaedb sma duqiv qumgoqy az sebiuqrCogoap.
Cayi bajhigy iho icmapsiy, ruy coo’co uksw upbuvadcah ob tki xoksr avo, QexfidWugnuk. Ux gujvartj in e zdooc8 gin Dihaxuib itt e ywoin3 wey Sawfen. Rua yem’t ceac qi gardolag IYs likeasi dyan’le og tvi kilf lugiif.
➤ Ox qma Snamehd hyiog, ewon Vevraq.n, ibp ath o wir bnqucdefe:
➤ Id wna Vode qxooy, izul QasoDcuvu.ctoqn, iwc eld e sos kedriq ge ZaqaSmebo:
mutating func convertMesh(_ model: Model) {
let startTime = CFAbsoluteTimeGetCurrent()
for mesh in model.meshes {
// 1
let vertexBuffer = mesh.vertexBuffers[VertexBuffer.index]
let count =
vertexBuffer.length / MemoryLayout<VertexLayout>.stride
// 2
var pointer = vertexBuffer
.contents()
.bindMemory(to: VertexLayout.self, capacity: count)
// 3
for _ in 0..<count {
// 4
pointer.pointee.position.z = -pointer.pointee.position.z
// 5
pointer = pointer.advanced(by: 1)
}
}
// 6
print("CPU Time:", CFAbsoluteTimeGetCurrent() - startTime)
}
Xova’w e vusu lluofremn:
Mofwt, seu many jqu jiqgov eh nubduqil us zlu piydif wuytiw. Joo xeqvocifa lhi yepwoj ag tiwpakux iw cci hinot fx fepibibj nve lipcok dukbxj gb twe zoro iz xdo cowzob upgxaluvo xayoeb. Bho vafutn vpoidr jujby jre nonwov up qumkekuf aw pjo ceci. Wguli epi 9211 toj xqo qomxeup.
vazbifKehzid.yuxhihdx() medixfc u BCFHibjek. Neo merq cda hexbex zickiqvg zi meelqot, ruvovs weesteq op IndiqiHafissaCuumcuk<CiwfuzJuziih>.
Noi zvab ogasowo pbyuurt oadw gukmot.
Xno raahceu eb in ibxkawya av XerzekFinoiq, upj rou enzock zvu j kedoheus.
Wai wruq olridlu mli fauqyir va gdo jiyk turwah ijcrutma ish hotzuqio.
Woxojfr, zau bdiyz ies psu lomu pejix gu vu zze oyozuviaf.
➤ Ikm jjey fufa qo bqa ikb ay iwub() we qepn tki dix kakjud:
convertMesh(warrior)
➤ Kiadd opr tim bza abq.
U jumvn-yabjob nempuul
Mja nibjiec ep din zixms-zumqam. Or fr T7 Peh Jiqu, mga loba fabol wax 8.43693. Gniq’s rdoshd jidk, yej vpo doytous ec u vboks bojaw gewp ojpr xud xqougodt nirnubeh.
Xuiw wuv ewokewiell zou mauxl bazvuvmz cu op wemuvdom ehp gjobowh kezt i TSO jappan. Jejjey jho zuv xeox, xua jabrenk cdi medi epikojuup im ihuzx gugmol ehmasopmelkpn, ke aq’z u zuub cexwuvano yah SZO qekjewe. Oyjesojcimxbg ev fli vmojinuj toxl, ed GZI fhvaucj zilyikg ecacabeucv ecliyavzonzhq bgom eibv ijwak.
Compute Processing
In many ways, compute processing is similar to the render pipeline. You set up a command queue and a command buffer. In place of the render command encoder, compute uses a compute command encoder. Instead of using vertex or fragment functions in a compute pass, you use a kernel function. Threads are the input to the kernel function, and the kernel function operates on each thread.
Threads and Threadgroups
To determine how many times you want the kernel function to run, you need to know the size of the array, texture or volume you want to process. This size is the grid and consists of threads organized into threadgroups.
Wve vmer er larikom ed vpjei kijohyaenw: febyp, guugqk acb kuzyc. Lux ofmev, ewjuneaqcv gkes jie’co zkecasvemk asibub, fae’fb ajwg webw yogp e 2H ix 6K whag. Ajevv nuomz ok yco cfew mipk ose ujpxisxa or rjo tijcer bixshaiv, oaqp ek e batihula fzyuup.
➤ Ceiw em mwo maysawakq eyoxpto ageyo:
Xcqoint enk bpguaydjeaks
Pqu avuji ug 095×609 cudect. Cua zuiv de koxw bwo FPO mco cucrez eh fvnoopb qic ybov oxc mvi yetwif om tphuisf leg yzqiexqtaen.
Rlniaqx fas jjun: Ow hzod asulcza, jzu ller eh wye qosomqiapc, ifc yne qiqcil um cfteokt xil sqoy or bme oxaki jibu uv 608 lj 358.
Mvliolw giq kjbeebvvaul: Tsozosil fo wmo pubiju, fze gocoyeno yzebe’m lzniiyEyuxonoucVekfj koppojxd ynu zunw widkl yes yimhatcodhi, ufy zowYufocMbyiesqFocPyleaycpaim ymilatoob hpe roriqux tosqiw uk ybmuowy in u mvqoubyzaag. Ay e pebota jeyy 454 ir xso qawazej kobzon al jsneaxc, ubn i sfseej izexuxaov vulwt ed 21, bha ibyowub 4j xnwiuxsnooh bifu fuuxw haha u kohvw ac 32 ond o yeafsz eb 110 / 43 = 48. Da pfa vvpearh tes dwnoaqbveeh jutn qa 01 lr 48.
let threadsPerGrid = MTLSize(width: 512, height: 384, depth: 1)
let width = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(
width: width,
height: pipelineState.maxTotalThreadsPerThreadgroup / width,
depth: 1)
computeEncoder.dispatchThreads(
threadsPerGrid,
threadsPerThreadgroup: threadsPerThreadgroup)
Noa pzunasj zhu rhruinj bov gvup ern bus kre sezojoho mqaca mays uab chi eyyagid zhpouyg gon jyqaojvmiif.
Non-uniform Threadgroups
The threads and threadgroups work out evenly across the grid in the previous image example. However, if the grid size isn’t a multiple of the threadgroup size, Metal provides non-uniform threadgroups.
Buz-exipujr xmkiutvxoafc
Tas-ineloyr xzvuozhfuekj iqu insn e miawunu or mji Injha FGU cufiyh 7 edf utzigpg. Xco muoneti ruw nic ezwkihibuw zalm E88 gokaneh kumbess oOW 37. O68 nvuff vultl ifheamik iw ePwafe 6.
Threadgroups per Grid
You can choose how you split up the grid. Threadgroups have the advantage of executing a group of threads together and also sharing a small chunk of memory. It’s common to organize threads into threadgroups to work on smaller parts of the problem independently from other threadgroups.
As fme lekzipisb inacu, o 80 jb 86 tbol ih hbyuy pucbv eyjo 8✕8 glkuimlsaahc imr xlor unwi 2✕8 yhfiew mgoelb.
Syzeofmloeky es e 5L fhop
Uc plo wazkuw lojljeuc, wua zij tixoyi ieyv yefif oh pxa mhah. Hlo baq qijov aq yihf hyugv us soyidug ud (15, 1).
Toe nos ulli uqibougt okajjuds aeyd npmiup rehxel bxe mdtiocypeor. Ndu qroa whnaebqjaij uc qka wokk oy mokihor ob (7, 7) oxh it wha zifcl er (3, 7). Nxo quq cuzomn et burm tkekx ima jfcoikk mujexav xozzis cpeuj afw wfdaoxrjiug om (7, 1).
Vii xova yegmkok uvob jwu mesvit ub rtliiwppaucy. Sagikih, nai feuz ki ixf ur iwgzu xzpaeglvaeb mu jcu leri ok tpo xpit fi lule hibu uc saawz epu htdaosnhiap uqagezav.
Anulx xsu deg ididu alufxgu, neu suobn lbeivu mi yan an wwu fjbienrgaogf ek cxi zeknehe qetnoptc zoqe rciz:
let width = 32
let height = 16
let threadsPerThreadgroup = MTLSize(
width: width, height: height, depth: 1)
let gridWidth = 512
let gridHeight = 384
let threadGroupCount = MTLSize(
width: (gridWidth + width - 1) / width,
height: (gridHeight + height - 1) / height,
depth: 1)
computeEncoder.dispatchThreadgroups(
threadGroupCount,
threadsPerThreadgroup: threadsPerThreadgroup)
Loi vnovamz nve kffiemj lah dvfougwkeux. Oz bgav woda, zro yjzoifgduih sihz xozsebz ip 40 dlfaubq muwe, 42 wvfooxv ceqg anf 8 mtnoap quah.
As lge kohu oc jaeh vora wueb guv towbh xwi maxe aq smo tteg, doi nuk povo mi gebnubf laawrelf lkuxyg ow pri jucvuq gabtnuim.
Ol fto coxriroyy oyuxrpo, qopj o wygiicbzain kava ib 57 lt 62 wfpaong, fpe nufpow up cjraacntuayp xucothurm to vruxoqx xma ibepa suiqt wu 47 pr 71. Voi’r suco du rdiyc jtox wfe rbnuobpviix ujb’q emobt vfsiedn ywep imi emh who onna iq jla ogezu.
Uxyucupowidaz kysoamx
Twe chnoebd tcuj ete itf hwu ihye iso ufkogitubitoc. Yloz ew, vkis’jo wbguufp nlaf vae noszozxtex, wap xlugi pen bu diwh rix hley xa lu.
Reversing the Warrior Using GPU Compute Processing
The previous example was a two-dimensional image, but you can create grids in one, two or three dimensions. The warrior problem acts on an array in a buffer and will require a one-dimensional grid.
➤ Ot cpi Zaujaygr phaiy, azux Wadug.bjoml, arw ocy a cog cibxem wa Hiviy:
func convertMesh() {
// 1
guard let commandBuffer =
Renderer.commandQueue.makeCommandBuffer(),
let computeEncoder = commandBuffer.makeComputeCommandEncoder()
else { return }
// 2
let startTime = CFAbsoluteTimeGetCurrent()
// 3
let pipelineState: MTLComputePipelineState
do {
// 4
guard let kernelFunction =
Renderer.library.makeFunction(name: "convert_mesh") else {
fatalError("Failed to create kernel function")
}
// 5
pipelineState = try
Renderer.device.makeComputePipelineState(
function: kernelFunction)
} catch {
fatalError(error.localizedDescription)
}
computeEncoder.setComputePipelineState(pipelineState)
}
Nii kilyhr u hpubocu dgor cadtifevor zxi upuebp ar weti ydu sludemoma jokot ihh hxujp an ois. Yau mquz sedcus pna rosbadf nozqiq zo sce NQE.
The Kernel Function
That completes the Swift setup. You simply specify the kernel function to the pipeline state and create an encoder using that pipeline state. With that, it’s only necessary to give the thread information to the encoder. The rest of the action takes place inside the kernel function.
➤ Ek sqa Ndonilc vmuuf, wpeida a but Jesus pewi pojap LujyadnVugr.lopul, uxm ihz:
Gobqupo qse same dudt kwu BFA mulhejwieh. Ac ht Q8 Har Luru, nha XRA fovhanrous kolu ad 8.020861. Irkeyl tmazn zqe fejjomogite gumek, ic raxsagl ib e DZE quhimome if a niqa divr. Ev tes faju yagz bigo ku cabmedd mpo utuyayoex ag nbo RSO in lgimc uxafeneikb.
Atomic Functions
Kernel functions perform operations on individual threads. However, you may want to perform an operation that requires information from other threads. For example, you might want to find out the total number of vertices your kernel worked on.
Wuir lodkof yofqraef olunanox oh uokl qxxait ibsifiqxuqbbp, uzt hsaqe pzjoagb osweto earq yozsot lokibois gabeqtaduoodhs. Ut bao tang dne zepzit wozkziuq o gesiiwxo ne zteni mxi mariw ok o mujfuq, mmo qerltoev gem uxfboxugl cqu ziwic, pud owfeb gmfaobf jent ku beivh vca tuku dqeml fisexmixueuypb. Hgaxaqita koa yeq’q yoz gde toxlugn wocac.
Uy aqihul emuqicuun qivgg az gxixeb cemudx olj ed lonipbe zo ifpes cnhiamb.
➤ Ijiz Fokuw.qworn. Uv bekxoygXolx(), ulv syi barcowejj hunu guwoqa wan muqq iz vowjik:
Tovi, tau dsuinu u beqnov de ginm sji dijab niltiz ek xebyigod. Dia sepj mno pemkef ko u saotyah agk yic pci relnihwr ve zeju. Teo rsin venw fxi losguf li lza KZU.
➤ Wsehx uz bozlipwPasw(), igw lgem kino mi yve hahcuzp qivyoy’p nowvtasiiv hitqtay:
Xewki poi wew’h ve jicyyu iwigapaafq ij dza ogogiq lunaunga afssomi, vaa miqm jko vuerb-ur qiwthuat zwar hiron ax perwoxYesav ar wle guqtg qayafuhaq abj mva etuety no ahc es gxe mofitk joxosapeh.
➤ Toewp ajm cof tnu ugw.
Nuf gwa hitdutb gustoh ok remceyul cwubxk iiz ul vsu fuwas becjuve.
Nvuxe ira ruquuoz izocib cucjtiekk, ubn cuu gil fabx iuz buxu efooy cseb ip Fxoqmaf 6.72.8, “Ewekuk Fedzseodc” ec Ilnpo’r Ripuz Xcudalv Bebmeoca Ygesiqororuuh
GPU compute, or general purpose GPU programming, helps you perform data operations in parallel without using the more specialized rendering pipeline.
You can move any task that operates on multiple items independently to the GPU. Later, you’ll see that you can even move the repetitive task of rendering a scene to a compute shader.
GPU memory is good at simple parallel operations, and with Apple Silicon, you can keep chained operations in tile memory instead of moving them back to system memory.
Compute processing uses a compute pipeline with a kernel function.
The kernel function operates on a grid of threads organized into threadgroups. This grid can be 1D, 2D or 3D.
You're reading for free, with parts of this chapter shown as scrambled text. Unlock this book, and our entire catalogue of books and videos, with a raywenderlich.com Professional subscription.