Open main menu

CDOT Wiki β

Changes

Team NP Complete

20 bytes removed, 01:31, 22 December 2017
Parallelized and Dynamic Performance
We parallelized the code for rendering and state evolution above. We did not, however, yet parallelize the FFT code. The reason for this is that we initially thought that there was a loop-carried dependency that could not be resolved:
[[File:FFT_crop.png|1000px|centernone|Hotspot]]
The issue is the <code>T *= phiT;</code> line. This means that every time the loop counter <code>l</code> increases, <code>T</code> gets multiplied by <code>phiT</code>. This statement seems painfully obvious, but it prevents us from parallelizing the code since the iterations can't be done in an arbitrary order. What it does mean, however, is that we can remove that line and replace any usage of <code>T</code> with <code>pow(phiT, l)</code>. We can then parallelize it, since the iterations are now order-invariant. When we do that, the FPS somehow does not change. In fact, if we remove the parallel for construct, the FPS drops to a meager 1 frame per second. This is awful, and likely because the <code>pow()</code> operation is very computationally expensive. Are we stuck? Of course not. We can apply math. There is a property of complex numbers which allows us to turn exponentiation into multiplication. If we write the complex number <code>phiT</code> as <code>phiT = cos(arg) + i * sin(arg)</code>, and we can since it has norm 1, we have <code>phiT ** l = cos(l * arg) + i * sin(l * arg)</code>. This gave a tremendous speedup since the trigonometric functions are apparently less costly than exponentiation. The code and new vTune analyses are below.
[[File:PFFT_crop_no_dynamic.png|1000px500px|centernone|No dynamic]]
[[File:after_fft_analysis.png|1000px750px|centernone|After FFT analysis]]
[[File:after_fft_hist.png|1000px750px|centernone|After FFT analysis]]
So we've better parallelized the program by attacking a computationally expensive loop which runs many times. We have achieved a massive speedup. That said, it seems that a lot of CPU time is spent idle, especially in the region we have just parallelized. (In addition, the FPS is not very stable). This is because each loop iteration is guaranteed to do a different amount of work. When the loop counter <code>l</code> is small, the inner loop runs many times. When it is large, it does not. This is most easily fixed by adding dynamic scheduling.
[[File:PFFT_crop.png|1000px500px|centernone|Dynamic]]
[[File:after_dynamic_analysis.png|1000px750px|centernone|Dynamic Performance]]
[[File:after_dynamic_histo.png|1000px750px|centernone|Dynamic Histogram]]
As the analyses show, that loop now shows more efficient CPU usage. It also gave a slightly higher, stabilized FPS.
25
edits