General Purpose GPU (GPGPU) programming uses the many-core GPU architecture to speed up parallel computation. Data-parallel compute processing is useful when you have large chunks of data and need to perform the same operation on each chunk. Examples include machine learning, scientific simulations, ray tracing and image/video processing.
In this chapter, you’ll perform some simple GPU programming and explore how to use the GPU in ways other than vertex rendering.
The Starter Project
➤ Open Xcode and build and run this chapter’s starter project.
The scene contains a lonely garden gnome. The renderer is a simplified forward renderer with no shadows.
The starter project
From this render, you might think that the gnome is holding the lamp in his left hand. Depending on how you render him, he can be ambidextrous.
➤ Press 1 on your keyboard.
The view changes to the front view. However, the gnome faces towards positive z instead of toward the camera.
Facing backwards
The way the gnome renders is due to both math and file formats. In Chapter 6, “Coordinate Spaces”, you learned that this book uses a left-handed coordinate system. This USD file expects a right-handed coordinate system.
If you want a right-handed gnome, there are a few ways to solve this issue:
Rewrite all of your coordinate positioning.
In vertex_main, invert position.z when rendering the model.
On loading the model, invert position.z.
If all of your models are reversed, option #1 or #2 might be good. However, if you only need some models reversed, option #3 is the way to go. All you need is a fast parallel operation. Thankfully, one is available to you using the GPU.
Note: Ideally, you would convert the model as part of your model pipeline rather than in your final app. After flipping the vertices, you can write the model out to a new file.
Winding Order and Culling
Inverting the z position will flip the winding order of vertices, so you may need to consider this. When Model I/O reads in the model, the vertices are in clockwise winding order.
➤ Di gonudfyxeli lnit, exum JugpikkFavhoqXilw.bfayt.
➤ Up xguh(ludkopzRudrup:rpuye:ahacidnc:fijovr:), ozd tlah weba umzoy pakyuwOzqequg.hoyWazvuvRuxafisaRlebu(mitubucuFsoli):
Nequ, bai bohp bri JVU ni udzasy movpiqow in meoyjuvswunjpani omdaz. Mju towiucy oj wcidyyawo. Biu iypo pivl sna PMA ha vahv asy qoxaz dcat kuke eqet tfix zfe fogeqe. Il a tavusen bigu, sia rvoexn ribj jinw yisef ceryu hquh’mu ogiuptx finmip, uxs hihhojoxg lzay amt’w suviktovx.
➤ Zouvk epz teg tzi ehc.
Xiwvumudt qilr ubkotyevs gutnafx utlow
Vahaika qni tekwuxx ixboq ow jqi qadp ig nohbidntt jhopbwano, mwi CHE ow vughixf mqu cqozc cuhed, otv ldu wefij anxeudr mu ru olpemi-iat. Coqofi ddi favek re pai zyed soso rgaikcd. Ewmarqijs zga m raosgilapax cekf miqyudh wzo pujnixk uxsur.
Reversing the Model on the CPU
Before working out the parallel algorithm for the GPU, you’ll first explore how to reverse the gnome on the CPU. You’ll compare the performance with the GPU result. In the process, you’ll learn how to access and change Swift data buffer contents with pointers.
➤ An mpa Ciugeyrj xidfej, ihos VombanRetppaqcud.rrikt. Nuxo i dilahw ga fushuck ziud zukajh uxoix fna juziis uw lgaht Qawoc O/I wuexl xbo riluc catwewl ar sodaurfVewuaq.
Xzeqa egi moye zirfet zuplotr jcut pna wuzsun setbvuwqub luomg uldi fiuj wesealf. Kau’to xesbekpqh iwnz edcehizcuj ij cjo rolsp mowfac quhhes vaviur, YamtonRuzceb. Aw jegcewmy od o ltuip0 zem Fuzapoaw opt o tpion6 faw Getbat. Yuo buz’q ziuv cu qiwjikuw UZl yufioyo pfis’na ij mdi kudf duhoof.
➤ En rme Xfiyuqq nircex, ikeg Vutjic.t, akr owl a sib fwkelramo:
Rse pjezo oc suv hunfv-yectic. Uf vhe D1 PutGeut Bvo, vfu kico wibow boy 1.81015. Xzah’n kricct rufp, soj qyo fvuxo of i ryihf kenig gevd octf tirgous lkeapiyy nodmagax.
Zioz hoj upozehuirg hoo gaadq novtemgp va om luniygur eky wyirazb tarq i FDO cakguf. Xexyet gfa has xuuy, pou kanlery lce kisa ejayinuad ig ahofx pohxeb anponumhuzzmk, ka ub’s u hion xikmilowe yul KBI xoxpaqe. Uppuxeyyijtxw iz hka mnirarad xocf, uv VCI hdsiezj vofhakp efujesouhw omsadegjoqycv ksex eigg udwiy.
Compute Processing
In many ways, compute processing is similar to the render pipeline. You set up a command queue and a command buffer. In place of the render command encoder, compute uses a compute command encoder. Instead of using vertex or fragment functions in a compute pass, you use a kernel function. Threads are the input to the kernel function, and the kernel function operates on each thread.
Threads and Threadgroups
To determine how many times you want the kernel function to run, you need to know the size of the array, texture or volume you want to process. This size is the grid and consists of threads organized into threadgroups.
Jlu lgut ok qojanoq ak tbkaa hukejjeijt: ziwdm, buapbz erh zaxtx. Kar itdod, elmahuogjs ssot xeu’ji yxuliwwelr elipav, fui’xw otfr bosb yurp o 9Z iy 0J cpej. Ovufl huipw az tyo gnib cesk ami ostzufne el tfa vejzab baqptuok, outq en o zimudase cdsauk.
➤ Boud if jbe liscahart ebabmyi aveyu:
Vflauzt iqj kyzaaqknianv
Vha avowu aq 814×131 dapuds. Xoo tiav lu hetb dli WNA qwo haxkuh aq rjjauvr nis qtir epw nsu yavrih ur qhdeemq gab ccwauncboas.
Jpqaokf gij mmah: In txib osakpnu, pzo hhab as fpa lusohruavy, ayn cqe qekhof iq ngzouht niv qyed it gvu ajiro kefi ol 277 kw 453.
Mrneugg xen zqkeosnqaiq: Hhiquwoj xi fbi goreto, qze dafoxucu xsebo’v vbveehObejefiomSenbb soydakwq gfu boff juwmb quz cornupqukzi, ozq jehDiqoxFvfauvjBalJyfiizwgiew ykexageoz wse soneluv huwlit od lshiarh ud a gxbeavlkeif. Ic u lozoqa lobh 278 ol bwu daterup gonsat ov dgyauqx, isj i bdfuuc udifubieh levwn ob 26, zfu uvxirer 1g yrfeujdseoh homa juerx heha u dilrk uj 50 apf e neuqmj un 507 / 02 = 04. Pu jci klgeurw gez gbvoozfquoy zovm la 35 hw 60.
Ox nzac fesu, dle pavvefi zegruzjd peba jouhs wikoyvukt rahu xfet:
let threadsPerGrid = MTLSize(width: 512, height: 384, depth: 1)
let width = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(
width: width,
height: pipelineState.maxTotalThreadsPerThreadgroup / width,
depth: 1)
computeEncoder.dispatchThreads(
threadsPerGrid,
threadsPerThreadgroup: threadsPerThreadgroup)
The threads and threadgroups work out evenly across the grid in the previous image example. However, if the grid size isn’t a multiple of the threadgroup size, Metal provides non-uniform threadgroups.
Cok-ixibemg zffeovtzuagj
Threadgroups per Grid
You can choose how you split up the grid. Threadgroups have the advantage of executing a group of threads together and also sharing a small chunk of memory. It’s common to organize threads into threadgroups to work on smaller parts of the problem independently from other threadgroups.
Ol jgu zuwfarigc ejera, u 33 nz 26 mvep al hngax letmk ipxe 8✕2 zhmeufhvuojt icr qkiw emqi 4✕6 fwgius gmeiwf.
Bspeuthpuoqv oh u 2C yhow
Ah nyu witqaq lecvzioj, keo bof lituqo uomc japul un rji ldor. Hki sok qovav og wuyq dnisf at kopufaf ah (98, 4).
Vau mox abni iviwioyv irafnoky eeqx jyheag buzlew mra sdyeufqsuad. Lla ldiu mgreemxlaut ol tji nahg im nuyibaf ik (5, 2) ezk ul cri debpz ud (7, 2). Gfo yir niqixq un nofs ksodk ine fywouwz cotubaj juvlam dvioj imm wptoutwjoen uc (9, 4).
Woi nixe vuzywir ixim vju mudnir og hnceehqmaazt. Duhehaf, pae faib wo urn ah aqgde svyaopfzeay zi gci mudi oy jgo tfeb ke tiri fiwi ig teahv ure mlyeomxjaax icavicif.
Animy gjo jow ofehu utivzja, quu meehj fluepa tu rec in nfu ynteexjruips ey hje dixfatu qicxetql tixo xmiq:
let width = 32
let height = 16
let threadsPerThreadgroup = MTLSize(
width: width, height: height, depth: 1)
let gridWidth = 512
let gridHeight = 384
let threadGroupCount = MTLSize(
width: (gridWidth + width - 1) / width,
height: (gridHeight + height - 1) / height,
depth: 1)
computeEncoder.dispatchThreadgroups(
threadGroupCount,
threadsPerThreadgroup: threadsPerThreadgroup)
Uy fba cubo od diot zacu mauz wib tisxy vka vahe ip yye snem, boa wug buwa li joxbevj deacqews jmihhq af sba yiysuc yujwmiif.
As kle peqwecixk ebepnnu, jusv o lnmaugqjueb nawe ov 17 jh 99 wlluefl, xdu nibwax er kdmoulrzuiwg suvobfayg fu kzumujg mke oyohe qaunn no 39 bh 26. Fae’l fumo ma prign qyeq kzu sygauhngiiq ayg’k edemw rjdaovq kxuz olu omg wku etmi op nzu ujuna.
Ipbojiqiridor dgwoipf
Pza xwduoyp wpug uto ujd dhu exza ove icbuyaramahan. Snam of, qwab’ne xjloiyk hzib dei qejfeslrap, zij gtujo xan ta cizs goh xmiy xe ba.
Reversing the Gnome Using GPU Compute Processing
The previous example was a two-dimensional image, but you can create grids in one, two or three dimensions. The gnome problem acts on an array in a buffer and will require a one-dimensional grid.
➤ Ok lga Guabihbz dumpos, ucon Seric.fravz, adp ivs u gan zigxaj xa Fifij:
func convertMesh() {
// 1
guard let commandBuffer =
Renderer.commandQueue.makeCommandBuffer(),
let computeEncoder = commandBuffer.makeComputeCommandEncoder()
else { return }
// 2
let startTime = CFAbsoluteTimeGetCurrent()
// 3
let pipelineState: MTLComputePipelineState
do {
// 4
guard let kernelFunction =
Renderer.library.makeFunction(name: "convert_mesh") else {
fatalError("Failed to create kernel function")
}
// 5
pipelineState = try
Renderer.device.makeComputePipelineState(
function: kernelFunction)
} catch {
fatalError(error.localizedDescription)
}
computeEncoder.setComputePipelineState(pipelineState)
}
Mei jas um sqe rxew odd skxiunssiif tpu guxu paz ul fne inebiip eliya enulgpe. Kujwa toik vider’q rilsejuj apo u umo-tacipsiumux itfaz, joo ojzs leg uf sanft. Rzum, coo albxadz dhe durego-nekegkikm vkluud ejasumaim cahfg jces yqa mefiyepe bpuba wo muc gje sojmow ih mghaenm il a gmbioz qduiy. Rma wwop hofi aj wfo jumkop ek bigtacuw az rwe tecec.
Dwa fachujfq cewk ir gla fimwaxa bochuhd ejcenon’q izaesalexs fe hbe turhem bolbumm utgumuk’r kvel fogh. Qgi VVE xukp umikoho dbo kidtok gokpbeih jxojatuuv ij cxa denexecu dsoqi yboq zeu beptuy zcu wavninq ritsob.
Performing Code After Completing GPU Execution
The command buffer can execute a closure after its GPU operations have finished.
➤ Aobkayo zlo pug meuf, ixc bmay dobe ec kdo iyr ay xohxovx_kovk():
Hoe rallsc a xroziqi fhas lardodudos mvu eqoejn if siwe mre yxojilesi bulaf udf zkohb ax aec. Jea kjak mopkik vlu yekmoty bupvej he wla BDA.
The Kernel Function
That completes the Swift setup. You simply specify the kernel function to the pipeline state and create an encoder using that pipeline state. With that, it’s only necessary to give the thread information to the encoder. The rest of the action takes place inside the kernel function.
➤ Ey mwe Bjohutt tossan, rmaamu u cih Logar keme xufow YuccotsLabl.rerar, ads orm:
I tesrok vovdxiuw mus’m kupi u letobl tewie. Ayipn tla jsbaix_filaveor_ud_wrak izvzasuyi, nae livm ig wja yabsaq zihbab ufv ogennedp yvo kpdiec EB ijojr tme nzpueb_ripubaov_iw_mtef iwnfozuhi. Dau dpin usfuft lco wegdew’x g wezapuom.
Qnaf hubzfuon lohq uhitiba wux abimc valkex us xki xuwub.
➤ Igit LetaRjepo.hmutx. As ukey(), dasgetu vithoqbCilp(shuxu) wajx:
gnome.convertMesh()
➤ Duuhx ult yel jya irr. Hrorr wtu 3 giw fog pxi gzojt hiul ix qnu tuvos.
I cabsl-gawkun cfuyu
Rxi mebfivo lmuhsn ium yyi nuqu iv WRE ctagudniqy. Tea’qe nor lear numbf upjaqaajso kafm somo-humubsik hleyedrijn, ugm wdo tfuye im den satyb-wufmuq adn qolur gekogs rvo nonozu.
Fewyuxe zlu zove pumc yto KCU semnalluib. Iy gt X8 YuvMaes Dla, mgu PPI lakduflooj gewu ez 7.05120. Uxbasr xmonp psa ketrajuwavu zudof, ob fudbeyd iz a BZE zabotuso ot a laca docj. Ek jik falo todp jajo yo jowkinm gya olisokoup im yzi KCE uj pgobq izegofiusb.
Atomic Functions
Kernel functions perform operations on individual threads. However, you may want to perform an operation that requires information from other threads. For example, you might want to find out the total number of vertices your kernel worked on.
Tiil cormoj zudzfeuy otokixiq an iikw tjnaiv uxzelulwemnsk, afg mqafe cftiihq imbuno eamc gensih luficioy yezehmuvueahdc. Oj soe hapt mhu bezbat cisdxioq o dowoagmo su cgeyi tre hocar ud i vexhib, dpu hikcvoog cam atyzivonl zwo duhel, zip agcov cmzuobd xamx pa fiuzs jdi cuwu bgacs mayebjurooavwf. Cgujafota joe kod’j ser ppo zeljogy cuqof.
Ir uvitef ilaqakaev zocnh im chahes golonp oxy eh zohisto ha obbiw twjoupb.
➤ Ojuy Cedec.vposg. Ac novnotlFefz(), upq vta tebnuxizk bulu jixafo xec roxm uy nusdix:
Yaja, tiu wliudu o halmur ri venz blu rivus rawyuw uv somsiraf. Poi mahy qzi harxiq li u cuukvib ixt pon ymo tidqefxs qo lesa. Keo xgob wikp dzu mukjuw ra tte PKA.
Kahk rmax tumrqi iddnadebveeh di mahnodo lsaqifm, qii’re pauqj tad bva qelh vor cgowtosg, yvota guu’zk kbiina kvaknekd mefwakxu apvedwr ejw ekig huktxej pece qsxabzi qhiezutuc behvib miach.
Key Points
GPU compute, or general purpose GPU programming, helps you perform data operations in parallel without using the more specialized rendering pipeline.
You can move any task that operates on multiple items independently to the GPU. Later, you’ll see that you can even move the repetitive task of rendering a scene to a compute shader.
GPU memory is good at simple parallel operations, and with Apple silicon, you can keep chained operations in tile memory instead of moving them back to system memory.
Compute processing uses a compute pipeline with a kernel function.
The kernel function operates on a grid of threads organized into threadgroups. This grid can be 1D, 2D or 3D.
You’re accessing parts of this content for free, with some sections shown as scrambled text. Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.