Open main menu

CDOT Wiki β

Changes

UnknownX

6,582 bytes added, 05:50, 13 April 2017
Assignment 2 - V1 Parallelization
= TBD... Ray Tracing =
== Team Member ==
Output: <br />
[[File:Txt.PNG]]<br />
[[File:Pycpu.PNG]]<br />
Here is the code to calculate each pixel. It is good to use GPU to calculate them because each pixel is independent.
for (int y = 0; y < N; ++y) {
for (int x = 0; x < N; ++x) {
In this function it has a runtime speed of T(n) = O^2.
== Presentation ==What is Ray Tracing? Ray tracing is the technique of generating an image by tracing the paths light would travel through pixels in an image plane and simulating the effects it encounters with virtual objects. [[File:Ray_trace_diagram.svg.png]] == Code == struct Vec3 { double x, y, z; Vec3(double x, double y, double z) : x(x), y(y), z(z) {} Vec3 operator + (const Vec3& v) const { return Vec3(x + v.x, y + v.y, z + v.z); } Vec3 operator - (const Vec3& v) const { return Vec3(x - v.x, y - v.y, z - v.z); } Vec3 operator * (double d) const { return Vec3(x*d, y*d, z*d); } Vec3 operator / (double d) const { return Vec3(x / d, y / d, z / d); } Vec3 normalize() const { double mg = sqrt(x*x + y*y + z*z); return Vec3(x / mg, y / mg, z / mg); } }; inline double dot(const Vec3& a, const Vec3& b) { return (a.x*b.x + a.y*b.y + a.z*b.z); }  struct Ray { Vec3 o, d; Ray(const Vec3& o, const Vec3& d) : o(o), d(d) {} };  struct Sphere { Vec3 c; double r; Sphere(const Vec3& c, double r) : c(c), r(r) {} Vec3 getNormal(const Vec3& pi) const { return (pi - c) / r; } bool intersect(const Ray& ray, double &t) const { const Vec3 o = ray.o; const Vec3 d = ray.d; const Vec3 oc = o - c; const double b = 2 * dot(oc, d); const double c = dot(oc, oc) - r*r; double disc = b*b - 4 * c; if (disc < 1e-4) return false; disc = sqrt(disc); const double t0 = -b - disc; const double t1 = -b + disc; t = (t0 < t1) ? t0 : t1; return true; } };  int main() { steady_clock::time_point ts, te,tm; ts = steady_clock::now(); const int N = 500; const Vec3 white(255, 255, 255); const Vec3 black(0, 0, 0); const Vec3 red(0, 255, 0); const Sphere sphere(Vec3(N*0.5, N*0.5, 50), 50); const Sphere light(Vec3(0, 0, 50), 1); std::ofstream out("out.ppm"); out << "P3\n" << N << ' ' << N << ' ' << "255\n"; double t; Vec3 pix_col(black); int* pixs = new int[N * N * 3]; for (int y = 0; y < N; ++y) { for (int x = 0; x < N; ++x) { pix_col = black; const Ray ray(Vec3(x, y, 0), Vec3(0, 0, 1)); if (sphere.intersect(ray, t)) { const Vec3 pi = ray.o + ray.d*t; const Vec3 L = light.c - pi; const Vec3 N = sphere.getNormal(pi); const double dt = dot(L.normalize(), N.normalize()); pix_col = (red + white*dt) * 0.5; clamp255(pix_col); } pixs[3 * (y * N + x)] = (int)pix_col.x; pixs[3 * (y * N + x) + 1] = (int)pix_col.y; pixs[3 * (y * N + x) + 2] = (int)pix_col.z; } } te = steady_clock::now(); reportTime("matrix-matrix multiplication", te - ts); for (int y = 0; y < N; ++y) { for (int x = 0; x < N; ++x) { out << pixs[3 * (y * N + x)] << ' ' << pixs[3 * (y * N + x) + 1] << ' ' << pixs[3 * (y * N + x) + 2] << '\n'; } } tm = steady_clock::now(); reportTime("matrix-matrix multiplication", tm - ts); delete[] pixs; } == Points of possible Parallelization == for (int y = 0; y < N; ++y) { for (int x = 0; x < N; ++x) { pix_col = black; const Ray ray(Vec3(x, y, 0), Vec3(0, 0, 1)); if (sphere.intersect(ray, t)) { const Vec3 pi = ray.o + ray.d*t; const Vec3 L = light.c - pi; const Vec3 N = sphere.getNormal(pi); const double dt = dot(L.normalize(), N.normalize()); pix_col = (red + white*dt) * 0.5; clamp255(pix_col); } pixs[3 * (y * N + x)] = (int)pix_col.x; pixs[3 * (y * N + x) + 1] = (int)pix_col.y; pixs[3 * (y * N + x) + 2] = (int)pix_col.z; } } ==Graph==[[File:GraphDPS915kevin.JPG]]   == Assignment 2 - V1 Parallelization== Output result(converted to PNG formate): [[File:GpuassOutput.PNG]] Run time graph: 
[[File:Pygpu2.PNG]]
CPU code:
 
The most expensive part in the program.
 
for (int y = 0; y < N; ++y) {
for (int x = 0; x < N; ++x) {
clamp255(pix_col);
}
//Store RGB to array pixs[3 * (y * N + x)] = (int)pix_col.x;
pixs[3 * (y * N + x) + 1] = (int)pix_col.y;
pixs[3 * (y * N + x) + 2] = (int)pix_col.z;
}
GPUMain code on .cu: 1. Allocate memory on device. 2. run kunal. ntpb = 1024. 3. copy the key data out. 
int size = N * N;
int nblocks = (size + ntpb - 1) / ntpb;
Kernel: before: for (int y = 0; y < N; ++y) for (int x = 0; x < N; ++x)after: int idx = blockIdx.x * blockDim.x + threadIdx.x; int x = idx / N; int y = idx % N;
__global__ void kernel_tray(Vec3 pix_col, int N, int* pixs_x, int* pixs_y, int* pixs_z) {
clamp255(pix_col);
}
//Store RGB to arrays
pixs_x[y * N + x] = (int)pix_col.x;
pixs_y[y * N + x] = (int)pix_col.y;
pixs_z[y * N + x] = (int)pix_col.z;
}
 
Profile on nvvp:
[[File:matrix.senecac.on.ca/~zzha1/Capture.PNG]]
== Assignment 3 - Optimization ==
=== V1 V2 -- One array ===PPM file output: <br />[[File:Txt.PNG]]<br />We allocate three arrays to store the all the results. Each pixel stores in 3 arrys, and it is slow.Instead of 3 arrays, we allocate a bigger array and store all the pixels in this array.For the first pixel. 1st: R _ _ _ _ _ _ _ 2nd: G _ _ _ _ _ _ _ 3rd: B _ _ _ _ _ _ _  new array: R G B _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Before
[[File:pyonearray.PNG]]
=== V2 -- Occupancy ===--
=== V3 -- Occupancy ===
If we use 1024 threads, we only get 50%. However, if we change it to 640, we can get 60%. <br />
before:
const int ntpb = 1024;
After:
const int ntpb = 640;
[[File:pyoccu.PNG]]
[[File:pythreads640.PNG]]
---- === V3 V4 -- Coalescence === Before this modification, here is our array. R1 G1 B1 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ R2 G2 B2 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _After we modify switch the x and y. R1 G1 B1 R1 G1 B1 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _  
Before
int idx = blockIdx.x * blockDim.x + threadIdx.x;
[[File:pyCoalescence.PNG]]
---- === V4 V5 -- Double -> float ===
struct Vec3 {
'''What problems does it solve?''' <br />
1. Using too much registermany registers To get 100%, we have to use less than 32 registers. If we change it from double to float, it reduces from 44 to 29.
[[File:Pyoccu2.PNG]]
2. Calculating in double slowly is very slow on Geforce device. <br />
[[File:pyfloat.PNG]]
 
==Links==
Referances: https://www.youtube.com/watch?v=ARn_yhgk7aE
 
PPT: https://docs.google.com/presentation/d/10Cr_zIDUultkQLzdyC3_3B-GKO_bl6RJFHpWNg72tRk/edit#slide=id.g20678afd80_0_1313
51
edits