Open main menu

CDOT Wiki β



1,701 bytes added, 05:50, 13 April 2017
Assignment 2 - V1 Parallelization
[[File:Pycpu.PNG]]<br />
Here is the code to calculate each pixel. It is good to use GPU to calculate them because each pixel is independent.
for (int y = 0; y < N; ++y) {
for (int x = 0; x < N; ++x) {
== Assignment 2 - V1 Parallelization== Output result(converted to PNG formate): 
Run time graph:
CPU code:
The most expensive part in the program.
for (int y = 0; y < N; ++y) {
for (int x = 0; x < N; ++x) {
//Store RGB to array pixs[3 * (y * N + x)] = (int)pix_col.x;
pixs[3 * (y * N + x) + 1] = (int)pix_col.y;
pixs[3 * (y * N + x) + 2] = (int)pix_col.z;
GPUMain code on .cu: 1. Allocate memory on device. 2. run kunal. ntpb = 1024. 3. copy the key data out. 
int size = N * N;
int nblocks = (size + ntpb - 1) / ntpb;
Kernel: before: for (int y = 0; y < N; ++y) for (int x = 0; x < N; ++x)after: int idx = blockIdx.x * blockDim.x + threadIdx.x; int x = idx / N; int y = idx % N;
__global__ void kernel_tray(Vec3 pix_col, int N, int* pixs_x, int* pixs_y, int* pixs_z) {
//Store RGB to arrays
pixs_x[y * N + x] = (int)pix_col.x;
pixs_y[y * N + x] = (int)pix_col.y;
pixs_z[y * N + x] = (int)pix_col.z;
Profile on nvvp:
== Assignment 3 - Optimization ==
PPM file output: <br />
[[File:Txt.PNG]]<br />
We allocate three arrays to store the all the results. Each pixel stores in 3 arrys, and it is slow.
Instead of 3 arrays, we allocate a bigger array and store all the pixels in this array.
For the first pixel.
1st: R _ _ _ _ _ _ _
2nd: G _ _ _ _ _ _ _
3rd: B _ _ _ _ _ _ _
new array: R G B _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
int* d_pixs_x;
=== V3 -- Occupancy ===
If we use 1024 threads, we only get 50%. However, if we change it to 640, we can get 60%. <br />
const int ntpb = 1024;
=== V4 -- Coalescence ===
Before this modification, here is our array.
R1 G1 B1 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ R2 G2 B2 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
After we modify switch the x and y.
R1 G1 B1 R1 G1 B1 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
int idx = blockIdx.x * blockDim.x + threadIdx.x;
'''What problems does it solve?''' <br />
1. Using too many registers
To get 100%, we have to use less than 32 registers. If we change it from double to float, it reduces from 44 to 29.
2. Calculating in double slowly is very slow on Geforce device. <br />
==referencesLinks== Referances:  PPT: