57
edits
Changes
→Assignment 2
=== Assignment 2 ===
====CUDA enabled functions====
The main function was changed to perform the copying of data from host to device, launch the kernel, copy back results from the device to host and release all memory, on host and device.
cudaMalloc((void**)&d_A, (size) * sizeof(long));
cudaMalloc((void**)&d_B, (size) * sizeof(long));
cudaMemcpy(d_A, h_A, size * sizeof(long), cudaMemcpyHostToDevice);
cudaMalloc((void**)&d_ntpb, sizeof(dim3));
cudaMalloc((void**)&d_nbpg, sizeof(dim3));
cudaMemcpy(d_ntpb, &ntpb, sizeof(dim3), cudaMemcpyHostToDevice);
cudaMemcpy(d_nbpg, &nbpg, sizeof(dim3), cudaMemcpyHostToDevice);
long *d_actual = d_A;
long *d_swap = d_B;
long totalThreads = ntpb.x * nbpg.x;
float start_time = clock();
for (int width = 2; width < (size << 1); width <<= 1) {
long slices = size / (totalThreads * width) + 1;
merge_sort << <nbpg, ntpb >> > (d_actual, d_swap, size, width, slices, d_ntpb, d_nbpg);
cudaDeviceSynchronize();
d_actual = d_actual == d_A ? d_B : d_A;
d_swap = d_swap == d_A ? d_B : d_A;
}
float end_time = clock() - start_time;
//copy back the device array to host array
cudaMemcpy(h_C, d_actual, size * sizeof(long), cudaMemcpyDeviceToHost);
The serial code did not use recusion, which is usually utilized for merge sort. Instead, it did a bottoms-up merge, wherein the array was sliced into pieces and each piece was sorted. The width of each slice changed with each iteration. This was done using two functions int the serial code: merge(...) and merge_sort(...). The former function was responsible for slicing the array into different widths. The latter received this slices and sorted them by creating a local STL map.
In the CUDA code for this, it seemed advisable to make merge_sort as the kernel. This can be justified because each thread now gets a slice of the array to sort, which means more work is done concurrently than before. Also, this enabled removal of the two nested for-loops from the serial code. Here is how the new merge_sort function looks like:
__global__ void merge_sort(long *sourceA, long *destA, long size, long width, long slices, dim3 *threads, dim3 *blocks) {
int index = threadIdx.x + blockIdx.x * blockDim.x; //may need to change this
int beg = width * index * slices, middle, end;
for (long slice = 0; slice < slices; slice++) { //basically removed the two for loops
if (beg < size) {
middle = min(beg + (width >> 1), size);
end = min(beg + width, size);
merge(sourceA, destA, beg, middle, end);
beg += width;
}
}
}
As previously, the merge_sort() function calls the merge function. In the serial code, an STL map is utilized to store the sliced array's values that are later used as values to be assigned to the original array based on the index comparison or value comparison.
Now, the merge function is strictly a device function that is called from the kernel for each slice to be sorted.
__device__ void merge(long *sourceA, long *destA, long b, long c, long e) {
long i1 = b;
long i2 = c;
for (long k = b; k < e; k++) {
if (i1 < c && (i2 >= e || sourceA[i1] < sourceA[i2])) {
destA[k] = sourceA[i1];
i1++;
}
else {
destA[k] = sourceA[i2];
i2++;
}
}
}
=== Parallelize ===