93
edits
Changes
GPUSquad
,→Assignment 2
</source>
The hotspot seems to clearly be the triple double for-loop based on m and n in the Jacobi iterations code of the dojacobi() function. I believe these matrix calculations could be parallelized for improved performance. Note that the for-loop that the double loop is inside of is based on a constant numbers, iters, so it doesn't grow with the problem size. It would be O(iters * n^2) which is still O(n^2) not O(n^3).
==== Idea 2 - LZW Compression ====
const int total_iters = 5000;
const int error_every = 2;
const int m = 50032, n = 5001024;
const float xmin = -1, xmax = 1;
const float ymin = -1, ymax = 1;
cudaMemcpy(d_b, b, n* m * sizeof(float), cudaMemcpyHostToDevice);
int nblocks = n / 1024; dim3 dGrid(1nblocks); dim3 dBlock(m1024);
// Carry out Jacobi iterations