37
edits
Changes
→Parallel Image Processor
The optimized version of the source code with the kernels we created can be found [https://pastebin.com/R0xEfN9W here].
[[File:input.jpg]] [[File:outputPGM.jpg]] Due to device limitations we were only able to profile our program up to an enlarged scale of 8 but our results still showed a performance increase as the enlarge scale got to 4. [[File:GpuA2Spreadsheet.png]] As seen from the results above , our parallel implementation of the image processor shows some significant performance increase as the enlarge scale gets larger. We also noticed that the c++ implementation of the processor seems to run at around the same time as the CUDA implementation ran faster when there was no enlarge and when the image is enlarge was only negated and reflected2. However, once the image is scaled by any factor, there is a definite increase in performance from the CUDA implementation. It i also worth noting We believe that this could be due to the profiled times form costly operation of cudaMemcpy() since the CUDA implementation seemed operations we are doing to vary a lot more than the c++ implementation which we think is from pixels are not that intensive, the variation in times time it takes for the cudaMemcpycudaMEmcpy() could easily start to runexceed the time of the transformations.
To continue our optimizations, we think that we could get more of a performance increase by minimizing the amount of data copied to and from the GPU. We are going to look into storing the image on the GPU until all transformations have been done and then copying the data back from the GPU.
=== Assignment 3 ===