52
edits
Changes
→Problem
}
=== Analysis === After analyzing this block of code. We decided to parallelize this it. Here is the kernel that we programmed.
__global__ void setCenter(float* d_center, float* d_sample, int n, int dim, int randi) {
d_center[j * n + i] = d_sample[j * randi + i];
}
Launching the kernel
int nb = (n + ntpb - 1) / ntpb;
dim3 dGrid(nb, nb, 1);
dim3 dBlock(ntpb, ntpb, 1);
float* d_center = nullptr;
cudaMalloc((void**)&d_center, centers.rows * centers.cols * sizeof(float));
cudaMemcpy(d_center, (float*)centers.data, centers.rows * centers.cols * sizeof(float), cudaMemcpyHostToDevice);
check(cudaGetLastError());
float* d_sample = nullptr;
cudaMalloc((void**)&d_sample, samples.rows * samples.cols * sizeof(float));
cudaMemcpy(d_sample, (float*)samples.data, centers.rows * centers.cols * sizeof(float), cudaMemcpyHostToDevice);
int rand = genrand_int31() % n;
setCenter << <dGrid, dBlock >> >(d_center, d_sample, N, dim, rand);
cudaDeviceSynchronize();
After programming this kernel. we noticed an improvement in performace.
Here are is a graph comparing the run-times of the serial program vs parallelized.
[[File:Assignment2Graph.png]]