25
edits
Changes
→Parallelized and Dynamic Performance
[[File:FFT_crop.png|1000px|center|Hotspot]]
The issue is the <code>T *= phiT;</code> line. This means that every time the loop counter <code>l</code> increases, <code>T</code> gets multiplied by <code>phiT</code>. This statement seems painfully obvious, but it prevents us from parallelizing the code since the iterations can't be done in an arbitrary order. What it does mean, however, is that we can remove that line and replace any usage of <code>T</code> with <code>pow(phiT, l)</code>. We can then parallelize it, since the iterations are now order-invariant. When we do that, the FPS somehow does not change. In fact, if we remove the parallel for construct, the FPS drops to a meager 1 frame per second. This is awful, and likely because the <code>pow()</code> operation is very computationally expensive. Are we stuck? Of course not. We can apply math. There is a property of complex numbers which allows us to turn exponentiation into multiplication. If we write the complex number <code>phiT</code> as <code>phiT = cos(arg) + i * sin(arg)</code>, and we can since it has norm 1, we have <code>phiT ** l = cos(l * arg) + i * sin(l * arg)</code>. This gave a tremendous speedup since the trigonometric functions are apparently less costly than exponentiation. The code and new vTune analyses are below.
[[File:PFFT_crop_no_dynamic.png|1000px|center|No dynamic]]
[[File:after_fft_analysis.png|1000px|center|After FFT analysis]]
[[File:after_fft_hist.png|1000px|center|After FFT analysis]]
So we've better parallelized the program by attacking a computationally expensive loop which runs many times. We have achieved a massive speedup. That said, it seems that a lot of CPU time is spent idle, especially in the region we have just parallelized. (In addition, the FPS is not very stable). This is because each loop iteration is guaranteed to do a different amount of work. When the loop counter <code>l</code> is small, the inner loop runs many times. When it is large, it does not. This is most easily fixed by adding dynamic scheduling.
[[File:PFFT_crop.png|1000px|center|Dynamic]]
[[File:after_dynamic_analysis.png|1000px|center|Dynamic Performance]]