Open main menu

CDOT Wiki β

Changes

GPU621/GameEngineParallelisation

4 bytes added, 09:18, 13 December 2021
no edit summary
Game engines have many systems running all in tandem, and lots of work to do for every frame that needs to be rendered. Old games like Pong and Final Fantasy didn't have a lot going on at once and could comfortably run as a serial application. Modern games, on the other hand, are commonly ran at 60 - 144 frames per second with layers upon layers of systems in tandem. Running the game as efficiently as possible is a must to keep up with these refresh rates, especially when players demand extreme graphical fidelity without reducing frame rates. Before we parallelise any work we need to first break up tasks for either the CPU or the GPU to handle.
[[File:Pong.jpgpng]]
(Pong 1972)
[[File:FFIV.jpgpng]]
(Final Fantasy IV 1991)
[[File:Destiny.jpgpng]]
(Destiny 2014)
The game loop is the main part of any game, continuously runs, reading Human Interface Devices, updating all of the game systems, handling collision, physics, and everything else, and finally rendering it all to the screen. It is important here to set everything up here in an efficient way that will allow for parallelisation.
[[File:GameLoop.jpgpng]]
(Serial Game Loop)
The easiest way to start parallelising a game engine is to dedicate threads to certain activities. You can have one thread that exclusively updates the simulation, and another dedicated to rendering. To keep from unsafely sharing data between threads, the update thread can write to a buffer that is passed to the rendering thread after the update thread is done updating. The rendering thread can then render the fixed game state as the game engine continues to update the next frame's data. This was fine for games in the early 2000s, but it wastes a lot of time that could be used to compute something else. This system of parallelisation will require many threads to handle many systems, along with this, it does not scale well with hardware. Having a 16 core CPU won’t make the game run faster, it just leaves more CPU cores idle. If you have too little cores then your operating system is going to have to context switch between these subsystems every frame. Context switching creates a lot of overhead that will hurt the performance of the game. To make use of the idle threads we need a different solution.
[[File:OneThread.jpgpng]]
=== Fork / Join ===
The next logical step in parallelising the code is to spawn threads as they are needed. Tasks can be offloaded from the main thread as they are needed, and the threads won’t be sitting idle so long as there is work to do. Rendering, inverse kinematics, handling animation states, particle systems, enemy AI, and anything else desired can be easily sent to a thread. The only consideration for what should be multithreaded is that it costs more than the overhead of spawning a thread. Spawning threads is an expensive process, however, and doing it every frame will grind things to a halt. We can vectorize loops, and use a SIMD design, to ease things, but the biggest optimization we can make is using thread pools.
[[File:ForkJoin.jpgpng]]
=== Thread Pools ===
Thread pools are just a group of threads that are spawned at the start of a program, usually 1 for every CPU core available. Instead of spawning a thread whenever it is needed, instead, tasks are sent to an idle thread in the pool.
[[File:TaskQueue.png]]
[[File:ThreadPoolEx1.jpgpng]][[File:ThreadPoolEx2.jpgpng]]
(Simple Thread Pool Implementation)
Setting up a game engine to use thread pools can get difficult to maintain, so another implementation of thread pools was created: Job Systems. Job systems are just an abstraction of thread pools, instead of sending the data to process into the thread pool all jobs to be executed are sent to a queue. The job queue is then scheduled in the game engine, and sent off to a thread pool (or other system) for execution. Job systems are highly customizable as they are arbitrarily fine-grained. Jobs can be defined to do however much is required of them, job sizes in a job system can even be highly different sizes.
[[File:JobSystem.jpgpng]]
(Job System Game Loop Diagram)
When working with jobs and a thread pool, a Job Worker would handle each job in the queue, locking it, sending the job to a thread, and unlocking it afterwards. A problem arises, however, when trying to halt the job (sleep & wake).
[[File:JobThreadPool.jpgpng]]
(Job Example)
[[File:JobWorker.jpgpng]]
(Job Worker Example)
Consider a job for an enemy to fire a raycast at the player, and react according to if it hit. The enemy would have to wait for the encapsulated raycast job to finish executing before continuing. If doing this with thread pools, a full context switch of the thread to the new job would have to take place. This is akin to spawning jobs with the fork / join model, and is a big performance hit. To accommodate this use case, there are 3 common ways to handle jobs other than thread pools.
[[File:Raycast.jpgpng]]
=== Jobs as Coroutines ===
While the intricacies GPU parallelisation are out of scope for this wiki article, there are some basic ideas that can be discussed here. GPU work follows a set pipeline called the Rendering Pipeline. The OpenGL API will be the focus of this section.
[[File:Pipeline.jpgpng]]
Not all portions of the rendering pipeline are accessible by the programmer as they are handled automatically by OpenGL. The areas that the programmer can manipulate are: Vertext Shader, Tessellation, Geometry Shader, and the Fragment Shader. The ways to communicate with these rendering steps are through shader files written in GLSL (OpenGL Shading Language). GLSL is a very limited language which makes some optimizations difficult to implement, if not impossible. Communication from the CPU to the GPU takes time, there is latency inherent in the communication process, so minimizing communication is important. Usually communication with the GPU is only done when submitting a render job. GPUs automatically split up the work given to them by OpenGL, leaving optimizations to be done in the shaders. Passing less data to the GPU will give it less work to do overall, and will lessen the communication latency. This can be done though culling any objects that are unused (not in the player’s viewpoint), and structuring data in an optimal way for the GPU. The GPU doesn’t care about the internal animation state of the objects, it only cares about what it needs to render in the current frame, so only passing the vertex data instead of the entire model’s object would be optimal.
25
edits