Changes

Jump to: navigation, search

TriForce

42,986 bytes added, 14:17, 8 April 2019
Kernel Optimization Attempts
Sudoku Solver Profiling
 
Rather than try to continuously increase the difficulty of a 9x9 sudoku, I decided to modify the program I found to handle larger and large sudokus, increasing the size of the matrices that make up the sudoku (starting with a 9x9 sudoku, which is 9 3x3 matrices, then 16x16 which is 16 4x4 matrices, and finally 25x25 which is 25 5x5 matrices) without changing the logic of the program (only constants), so larger sudokus are solved the same way as a normal one.
Source code from: https://www.geeksforgeeks.org/sudoku-backtracking-7/
{| class="wikitable mw-collapsible mw-collapsed"
! Original Code:
|-
|
Original Code:
// A Backtracking program in C++ to solve Sudoku problem
/* Check if 'num' is not already placed in current row,
current column and current 3x3 box */
return !UsedInRow(grid, row, num) && !UsedInCol(grid, col, num) && !UsedInBox(grid, row - row%3 , col - col%3, num)&& grid[row][col]==UNASSIGNED;
}
/* Check if 'num' is not already placed in current row,
current column and current 4x4 box */
return !UsedInRow(grid, row, num) && !UsedInCol(grid, col, num) && !UsedInBox(grid, row - row%4 , col - col%4, num)&& grid[row][col]==UNASSIGNED;
}
/* Check if 'num' is not already placed in current row,
current column and current 5x5 box */
return !UsedInRow(grid, row, num) && !UsedInCol(grid, col, num) && !UsedInBox(grid, row - row%5 , col - col%5, num)&& grid[row][col]==UNASSIGNED;
}
return 0;
}
|}
Obtaining flat profiles and call graphs on matrix environment:
Attempted to run the program with a number of files (8K resolution):
{| class="wikitable mw-collapsible mw-collapsed"
! Sample Images
|-
|
[[File:Cabin small.jpg]]
[[File:Cabin2 small.jpg]]
|}
{| class="wikitable mw-collapsible mw-collapsed"
[[File:Julia.jpg]]
|}
 
This problem would be fairly simple to parallelize. In the image created by Julia sets each pixel is independent of the others. This problem involves Complex numbers, but that can be simply represented by using two arrays, or pairs of floats.
 
==== Assignment 1: Selection for parallelizing ====
 
After reviewing the three programs above, we decided to attempt to parallelize the Sudoku Solver Program for a few reasons.
 
1. By increasing the dimensions of the smaller matrices that make up a sudoku by one, we see a major increase in the time it takes to solve the sudoku, from almost instantly to around 38 seconds, and then to '''36 minutes'''. With a 25x25 sudoku (of 5x5 matrices), several functions were called over '''100 million times'''.
 
2. Based on the massive time increases and similarity to the Hamiltonian Path Problem [https://www.hackerearth.com/practice/algorithms/graphs/hamiltonian-path/tutorial/] which also uses backtracking to find a solution, we believe the run time of the sudoku solver to have a Big O notation that approaches O(n!) where 'n' is the number of blank spaces in the sudoku as the sudoku solver uses recursion to check every single possible solution, returning to previous steps if the tried solution does not work. O(n!) is an even worse runtime than O(n^2).
 
3. The Julia sets still took less than 6 minutes after increasing the image size, and the EasyBMP only took a few seconds to convert a large, high resolution image. Therefore, the Sudoku Solver had the greatest amount of time to be shaven off through optimization and thus offered the most challenge.
=== Assignment 2 ===
__syncthreads();
}
/**/
if (control)
d_a[row * N + col] = guess;
/**/
}
bool SolveSudoku(int grid[N][N], int* d_a, int row, int col)
{
// If there is no unassigned location, we are done if (!FindUnassignedLocation(grid, row, col)) return true; // success! dim3 block(N, 3); int guess = 0; while (guess <= N) { makeGuess << <1, block >> > (d_a, guess, row, col); cudaMemcpy(&guess, d_a + row * N + col, sizeof(int), cudaMemcpyDeviceToHost); if (guess <= N && SolveSudoku(grid, d_a, row, col + 1)) return true; } //Erase the guess on the host int zero = UNASSIGNED; cudaMemcpy(d_a + row * N + col, &zero, sizeof(int), cudaMemcpyHostToDevice); return false; // this triggers backtracking
}
 
/* Searches the grid to find an entry that is still unassigned. If
found, the reference parameters row, col will be set the location
bool FindUnassignedLocation(int grid[N][N], int &row, int &col)
{
for (; row < N; ++row) { for (; col < N; ++col) if (grid[row][col] == UNASSIGNED) return true; col = 0; } return false;
}
 
/* A utility function to print grid */
void printGrid(int grid[N][N])
{
for (int row = 0; row < N; row++) { for (int col = 0; col < N; col++) printf("%2d", grid[row][col]); printf("\n"); }
}
 
/* Driver Program to test above functions */
int main()
{
/* 0 means unassigned cells */
int grid[N][N] = {{3, 0, 6, 5, 0, 8, 4, 0, 0},
{5, 2, 0, 0, 0, 0, 0, 0, 0},
{0, 8, 7, 0, 0, 0, 0, 3, 1},
{0, 0, 3, 0, 1, 0, 0, 8, 0},
{9, 0, 0, 8, 6, 3, 0, 0, 5},
{0, 5, 0, 0, 9, 0, 6, 0, 0},
{1, 3, 0, 0, 0, 0, 2, 5, 0},
{0, 0, 0, 0, 0, 0, 0, 7, 4},
{0, 0, 5, 2, 0, 6, 3, 0, 0} };
/**
int grid[N][N] = {{0, 8, 0, 0, 0, 0, 0, 3, 0, 0, 0, 10, 9, 7, 11, 0},
{0, 9, 15, 13, 0, 10, 0, 0, 2, 6, 8, 16, 0, 0, 0, 0},
{0, 0, 16, 0, 15, 0, 8, 0, 9, 0, 0, 0, 6, 0, 2, 0},
{1, 0, 2, 0, 9, 11, 4, 6, 15, 3, 5, 7, 0, 0, 12, 0},
{16, 6, 4, 0, 5, 2, 0, 0, 1, 0, 0, 0, 11, 0, 0, 12},
{5, 11, 0, 0, 0, 3, 0, 15, 0, 16, 0, 13, 0, 1, 0, 8},
{0, 0, 3, 0, 0, 6, 11, 14, 0, 5, 7, 0, 0, 9, 0, 0},
{0, 0, 0, 14, 8, 0, 10, 0, 0, 11, 12, 0, 0, 0, 0, 0},
{0, 7, 13, 0, 0, 0, 0, 12, 0, 8, 9, 0, 0, 0, 3, 0},
{0, 0, 11, 9, 0, 7, 0, 0, 0, 0, 0, 12, 0, 8, 16, 5},
{0, 0, 10, 0, 11, 13, 0, 0, 0, 0, 0, 3, 12, 0, 6, 0},
{0, 5, 0, 0, 10, 15, 0, 1, 7, 2, 0, 0, 14, 11, 0, 0},
{0, 0, 5, 0, 0, 12, 14, 0, 0, 10, 0, 0, 15, 0, 0, 4},
{9, 0, 14, 6, 0, 0, 1, 0, 16, 0, 2, 0, 3, 0, 13, 0},
{8, 13, 0, 4, 0, 0, 0, 0, 12, 7, 3, 0, 0, 6, 0, 0},
{0, 16, 12, 0, 0, 5, 0, 9, 0, 13, 14, 4, 1, 0, 0, 0} };
/**/
int* d_a;
cudaMalloc((void**)&d_a, N*N * sizeof(int));
cudaMemcpy(d_a, /* 0 means unassigned cells */ int grid, [N*][N * sizeof(int)] = {{3, 0, 6, 5, 0, 8, 4, 0, 0}, {5, 2, 0, 0, 0, 0, 0, 0, 0}, {0, 8, 7, 0, 0, 0, 0, 3, 1}, {0, 0, 3, 0, 1, 0, 0, 8, 0}, {9, 0, 0, 8, 6, 3, 0, 0, 5}, cudaMemcpyHostToDevice); SolveSudoku(grid {0, 5, 0, 0, 9, 0, d_a6, 0, 0);}, {1, 3, 0, 0, 0, 0, 2, 5, 0}, cudaMemcpy(grid {0, 0, 0, 0, 0, 0, 0, d_a7, N*N * sizeof(int)4}, cudaMemcpyDeviceToHost); printGrid(grid) {0, 0, 5, 2, 0, 6, 3, 0, 0} };
/**
int grid[N][N] =
{{0, 8, 0, 0, 0, 0, 0, 3, 0, 0, 0, 10, 9, 7, 11, 0},
{0, 9, 15, 13, 0, 10, 0, 0, 2, 6, 8, 16, 0, 0, 0, 0},
{0, 0, 16, 0, 15, 0, 8, 0, 9, 0, 0, 0, 6, 0, 2, 0},
{1, 0, 2, 0, 9, 11, 4, 6, 15, 3, 5, 7, 0, 0, 12, 0},
{16, 6, 4, 0, 5, 2, 0, 0, 1, 0, 0, 0, 11, 0, 0, 12},
{5, 11, 0, 0, 0, 3, 0, 15, 0, 16, 0, 13, 0, 1, 0, 8},
{0, 0, 3, 0, 0, 6, 11, 14, 0, 5, 7, 0, 0, 9, 0, 0},
{0, 0, 0, 14, 8, 0, 10, 0, 0, 11, 12, 0, 0, 0, 0, 0},
{0, 7, 13, 0, 0, 0, 0, 12, 0, 8, 9, 0, 0, 0, 3, 0},
{0, 0, 11, 9, 0, 7, 0, 0, 0, 0, 0, 12, 0, 8, 16, 5},
{0, 0, 10, 0, 11, 13, 0, 0, 0, 0, 0, 3, 12, 0, 6, 0},
{0, 5, 0, 0, 10, 15, 0, 1, 7, 2, 0, 0, 14, 11, 0, 0},
{0, 0, 5, 0, 0, 12, 14, 0, 0, 10, 0, 0, 15, 0, 0, 4},
{9, 0, 14, 6, 0, 0, 1, 0, 16, 0, 2, 0, 3, 0, 13, 0},
{8, 13, 0, 4, 0, 0, 0, 0, 12, 7, 3, 0, 0, 6, 0, 0},
{0, 16, 12, 0, 0, 5, 0, 9, 0, 13, 14, 4, 1, 0, 0, 0} };
/**/
cudaFree int* d_a; cudaMalloc((void**)&d_a, N*N * sizeof(int)); / cudaMemcpy(d_a, grid, N*N *sizeof(int), cudaMemcpyHostToDevice); if ( SolveSudoku(grid, d_a, 0, 0); cudaMemcpy(grid, d_a, N*N * sizeof(int), cudaMemcpyDeviceToHost); printGrid(grid); else printf cudaFree("No solution exists"d_a); /**/ return 0;
}
|}
This code is unable to solve the 16x16 in any reasonable amount of time (I stopped it at 10+ minutes).
If you consider the 130+ empty spaces in the grid I estimate over 130^2 calls to cudaMemcpy either way...
 
So we need an algorithm which will check each open spot, calculate all possible values which can fit there, and assign single values.
We can also check each section (Box, row, col) for values which can only go in one place
 
{| class="wikitable mw-collapsible mw-collapsed"
! Attempt One...
|-
|
/**
* Vincent Terpstra
* Sudoku.cu
* March 18 / 2019
* An Optimistic approach to solving a Sudoku on a CUDA enabled GPU
* Assumes that the puzzle is deterministic(single solvable solution)
* AND each next step can be found with the kernel
* KERNEL: educatedGuess
* searches each square in a box for
* squares that have only a single appropiate value
* OR values that (in the box) can only fit in one square
*/
 
#include <stdio.h>
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <iomanip>
// CUDA header file
#include "cuda_runtime.h"
#include <device_launch_parameters.h>
#ifndef __CUDACC__
#define __CUDACC__
#endif
#include <device_functions.h>
#include <stdio.h>
// UNASSIGNED is used for empty cells in sudoku grid
#define UNASSIGNED 0
// N is used for the size of Sudoku grid. Size will be NxN
#define BOXWIDTH 5
#define N (BOXWIDTH * BOXWIDTH)
 
/*
* kernel to solve a sudoku
* Input: sudoku puzzle partitioned into boxes
* * d_a = the sudoku puzzle
* figures out what values can fit in each square
* figures out how many spots each value can go
* assigns the appropiate values,
* saves to addedIdx to show that there is a change
*/
 
__global__ void educatedGuess(int * d_a, int * addedIdx) {
int idx = threadIdx.x + BOXWIDTH * threadIdx.y;
int gridX = threadIdx.x + BOXWIDTH * blockIdx.x;
int gridY = threadIdx.y + BOXWIDTH * blockIdx.y;
int gridIdx = gridX + N * gridY;
__shared__ bool hasValue[N]; //If the value occurs in the box
__shared__ int inBox[N]; //Number of places each integer can go in the box
hasValue[idx] = false;
inBox[idx] = 0;
__syncthreads();
int at = d_a[gridIdx];
if (at != 0)
hasValue[at - 1] = true;
__syncthreads();
if (at != 0)
return;
//For remembering which values were seen in the rows and columns
bool foundVal[N];
for (int i = 0; i < N; ++i)
foundVal[i] = hasValue[i];
for (int check = 0; check < N; check++) {
foundVal[d_a[N * check + gridX] - 1] = true;
foundVal[d_a[N * gridY + check] - 1] = true;
}
int fndVals = 0;
for( int i = 0; i < N; ++i)
if (!foundVal[i]) {
fndVals++;
at = i + 1;
}
if (fndVals == 1) {
//Only one possible value for this index
d_a[gridIdx] = at; //assign value
addedIdx[0] = gridIdx; //to tell host that the table has changed
inBox[at - 1] = 4; //Prevent one index per value
}
__syncthreads();
//Calculate the number of places each integer can go in the box
for (int i = 0; i < N; ++i) {
int num = (idx + i) % N; //keep each thread on a seperate idx
if (!foundVal[num])
inBox[num]++;
__syncthreads();
}
for (int i = 0; i < N; ++i) {
//if there is only one possible index for that value assign the value
if (inBox[i] == 1 && !foundVal[i]) {
d_a[gridIdx] = i + 1; //assign value
addedIdx[0] = gridIdx; //to tell host that the table has changed
}
}
}
 
/* Solves the Sudoku, with best values */
void SolveSudoku(int grid[N][N], int* d_a, int* d_results)
{
dim3 block(BOXWIDTH, BOXWIDTH);
int lastIdx(-1), nextIdx(-1);
do {
lastIdx = nextIdx;
educatedGuess << <block, block >> > (d_a, d_results);
cudaMemcpy(&nextIdx, d_results, sizeof(int), cudaMemcpyDeviceToHost);
} while (lastIdx != nextIdx);
}
 
/* A utility function to print grid */
void printGrid(int grid[N][N])
{ for (int row = 0; row < N; row++) {
for (int col = 0; col < N; col++)
printf("%3d", grid[row][col]);
printf("\n");
}
}
 
/* Driver Program to test above functions */
int main()
{ /* 0 means unassigned cells *
int grid[N][N] =
{ {3, 0, 6, 5, 0, 8, 4, 0, 0},
{5, 2, 0, 0, 0, 0, 0, 0, 0},
{0, 8, 7, 0, 0, 0, 0, 3, 1},
{0, 0, 3, 0, 1, 0, 0, 8, 0},
{9, 0, 0, 8, 6, 3, 0, 0, 5},
{0, 5, 0, 0, 9, 0, 6, 0, 0},
{1, 3, 0, 0, 0, 0, 2, 5, 0},
{0, 0, 0, 0, 0, 0, 0, 7, 4},
{0, 0, 5, 2, 0, 6, 3, 0, 0} };
 
/**
int grid[N][N] =
{{0, 8, 0, 0, 0, 0, 0, 3, 0, 0, 0, 10, 9, 7, 11, 0},
{0, 9, 15, 13, 0, 10, 0, 0, 2, 6, 8, 16, 0, 0, 0, 0},
{0, 0, 16, 0, 15, 0, 8, 0, 9, 0, 0, 0, 6, 0, 2, 0},
{1, 0, 2, 0, 9, 11, 4, 6, 15, 3, 5, 7, 0, 0, 12, 0},
{16, 6, 4, 0, 5, 2, 0, 0, 1, 0, 0, 0, 11, 0, 0, 12},
{5, 11, 0, 0, 0, 3, 0, 15, 0, 16, 0, 13, 0, 1, 0, 8},
{0, 0, 3, 0, 0, 6, 11, 14, 0, 5, 7, 0, 0, 9, 0, 0},
{0, 0, 0, 14, 8, 0, 10, 0, 0, 11, 12, 0, 0, 0, 0, 0},
{0, 7, 13, 0, 0, 0, 0, 12, 0, 8, 9, 0, 0, 0, 3, 0},
{0, 0, 11, 9, 0, 7, 0, 0, 0, 0, 0, 12, 0, 8, 16, 5},
{0, 0, 10, 0, 11, 13, 0, 0, 0, 0, 0, 3, 12, 0, 6, 0},
{0, 5, 0, 0, 10, 15, 0, 1, 7, 2, 0, 0, 14, 11, 0, 0},
{0, 0, 5, 0, 0, 12, 14, 0, 0, 10, 0, 0, 15, 0, 0, 4},
{9, 0, 14, 6, 0, 0, 1, 0, 16, 0, 2, 0, 3, 0, 13, 0},
{8, 13, 0, 4, 0, 0, 0, 0, 12, 7, 3, 0, 0, 6, 0, 0},
{0, 16, 12, 0, 0, 5, 0, 9, 0, 13, 14, 4, 1, 0, 0, 0} };
/**/
 
int grid[N][N] =
{ {1, 0, 4, 0, 25, 0, 19, 0, 0, 10, 21, 8, 0, 14, 0, 6, 12, 9, 0, 0, 0, 0, 0, 0, 5},{5, 0, 19, 23, 24, 0, 22, 12, 0, 0, 16, 6, 0, 20, 0, 18, 0, 25, 14, 13, 10, 11, 0, 1, 15},{0, 0, 0, 0, 0, 0, 21, 5, 0, 20, 11, 10, 0, 1, 0, 4, 8, 24, 23, 15, 18, 0, 16, 22, 19},
{0, 7, 21, 8, 18, 0, 0, 0, 11, 0, 5, 0, 0, 24, 0, 0, 0, 17, 22, 1, 9, 6, 25, 0, 0},
{0, 13, 15, 0, 22, 14, 0, 18, 0, 16, 0, 0, 0, 4, 0, 0, 0, 19, 0, 0, 0, 24, 20, 21, 17},
{12, 0, 11, 0, 6, 0, 0, 0, 0, 15, 0, 0, 0, 0, 21, 25, 19, 0, 4, 0, 22, 14, 0, 20, 0},
{8, 0, 0, 21, 0, 16, 0, 0, 0, 2, 0, 3, 0, 0, 0, 0, 17, 23, 18, 22, 0, 0, 0, 24, 6},
{4, 0, 14, 18, 7, 9, 0, 22, 21, 19, 0, 0, 0, 2, 0, 5, 0, 0, 0, 6, 16, 15, 0, 11, 12},
{22, 0, 24, 0, 23, 0, 0, 11, 0, 7, 0, 0, 4, 0, 14, 0, 2, 12, 0, 8, 5, 19, 0, 25, 9},
{20, 0, 0, 0, 5, 0, 0, 0, 0, 17, 9, 0, 12, 18, 0, 1, 0, 0, 7, 24, 0, 0, 0, 13, 4},
{13, 0, 0, 5, 0, 2, 23, 14, 4, 18, 22, 0, 17, 0, 0, 20, 0, 1, 9, 21, 12, 0, 0, 8, 11},
{14, 23, 0, 24, 0, 0, 0, 0, 0, 0, 0, 0, 20, 25, 0, 3, 4, 13, 0, 11, 21, 9, 5, 18, 22},
{7, 0, 0, 11, 17, 20, 24, 0, 0, 0, 3, 4, 1, 12, 0, 0, 6, 14, 0, 5, 25, 13, 0, 0, 0},
{0, 0, 16, 9, 0, 17, 11, 7, 10, 25, 0, 0, 0, 13, 6, 0, 0, 18, 0, 0, 19, 4, 0, 0, 20},
{6, 15, 0, 19, 4, 13, 0, 0, 5, 0, 18, 11, 0, 0, 9, 8, 22, 16, 25, 10, 7, 0, 0, 0, 0},
{0, 0, 0, 2, 0, 0, 10, 19, 3, 0, 1, 0, 22, 9, 4, 11, 15, 0, 20, 0, 0, 8, 23, 0, 25},
{0, 24, 8, 13, 1, 0, 0, 4, 20, 0, 17, 14, 0, 0, 18, 0, 16, 22, 5, 0, 11, 0, 10, 0, 0},
{23, 10, 0, 0, 0, 0, 0, 0, 18, 0, 6, 0, 16, 0, 0, 17, 1, 0, 13, 0, 0, 3, 19, 12, 0},
{25, 5, 0, 14, 11, 0, 17, 0, 8, 24, 13, 0, 19, 23, 15, 9, 0, 0, 12, 0, 20, 0, 22, 0, 7},
{0, 0, 17, 4, 0, 22, 15, 0, 23, 11, 12, 25, 0, 0, 0, 0, 18, 8, 0, 7, 0, 0, 14, 0, 13},
{19, 6, 23, 22, 8, 0, 0, 1, 25, 4, 14, 2, 0, 3, 7, 13, 10, 11, 16, 0, 0, 0, 0, 0, 0},
{0, 4, 0, 17, 0, 3, 0, 24, 0, 8, 20, 23, 11, 10, 25, 22, 0, 0, 0, 12, 13, 2, 18, 6, 0},
{0, 0, 7, 16, 0, 0, 6, 17, 2, 21, 0, 18, 0, 0, 0, 19, 0, 0, 8, 0, 0, 0, 0, 4, 0},
{18, 9, 25, 1, 2, 11, 0, 0, 13, 22, 4, 0, 21, 0, 5, 0, 23, 7, 0, 0, 15, 0, 3, 0, 8},
{0, 21, 10, 0, 0, 12, 0, 20, 16, 0, 19, 0, 0, 0, 0, 15, 14, 4, 2, 18, 23, 25, 11, 7, 0} };
/**/
 
int* d_a; //Table
int* d_result; //Table change indicator
 
cudaMalloc((void**)&d_a, N*N * sizeof(int));
cudaMalloc((void**)&d_result, sizeof(int));
 
//Copy Sudoku over
cudaMemcpy(d_a, grid, N*N * sizeof(int), cudaMemcpyHostToDevice);
 
SolveSudoku(grid, d_a, d_result);
 
//Copy Sudoku back
cudaMemcpy(grid, d_a, N*N * sizeof(int), cudaMemcpyDeviceToHost);
printGrid(grid);
 
cudaFree(d_a);
cudaFree(d_result);
 
return 0;
}
 
|}
 
'''Single Pass Sudoku Solver'''
 
This Kernel was designed to run on a single block with dimensions N*N the size of the Sudoku
limiting us to a Sudoku of size 25 * 25
For each empty space, counts the number possible values which can fit and how many times each value can fit in that section
If only one value can fit or that value has only one place, assigns the value
 
 
__global__ void superSolve(int * d_a) {
//Used to remember which row | col | box ( section ) have which values
__shared__ bool rowHas[N][N];
__shared__ bool colHas[N][N];
__shared__ bool boxHas[N][N];
//Used to ensure that the table has changed
__shared__ int added, past;
//Number of spaces which can place the number in each section
__shared__ int rowCount[N][N];
__shared__ int colCount[N][N];
__shared__ int boxCount[N][N];
//Where the square is located in the Sudoku
int row = threadIdx.x;
int col = threadIdx.y;
int box = row / BOXWIDTH + (col / BOXWIDTH) * BOXWIDTH;
//Unique identifier for each square in row, col, box
//Corresponds to the generic Sudoku Solve
//Using a Sudoku to solve a Sudoku !!!
int offset = col + (row % BOXWIDTH) * BOXWIDTH + (box % BOXWIDTH);
//Square's location in the Sudoku
int gridIdx = col * N + row;
int at = d_a[gridIdx];
if (!gridIdx) { //Thread at 0,0 sets values
added = -1;
past = -2;
}
rowHas[col][row] = false;
colHas[col][row] = false;
boxHas[col][row] = false;
__syncthreads();
if (at != UNASSIGNED) {
rowHas[row][at - 1] = true;
colHas[col][at - 1] = true;
boxHas[box][at - 1] = true;
}
//Previous loop has not changed any values
while (added != past) {
//RESET counters
rowCount[col][row] = 0;
colCount[col][row] = 0;
boxCount[col][row] = 0;
__syncthreads();
if (!gridIdx) //forget previous change
past = added;
int count = 0; //number of values which can fit in this square
int guess = at; //last value found which can fit in this square
for (int idx = 0; idx < N; ++idx) {
//Ensures that every square in each section is working on a different number in the section
int num = (idx + offset) % N;
if (at == UNASSIGNED && !(rowHas[row][num] || colHas[col][num] || boxHas[box][num])) {
count++;
guess = num + 1;
rowCount[row][num] ++;
colCount[col][num] ++;
boxCount[box][num] ++;
}
__syncthreads();
}
//Only ONE value can fit in this spot
if (count == 1) {
at = guess--;
d_a[gridIdx] = at;
rowHas[row][guess] = true;
colHas[col][guess] = true;
boxHas[box][guess] = true;
added = gridIdx;
}
__syncthreads();
if (at == UNASSIGNED) {
//Find values which can go in only one spot in the section
for (int idx = 0; idx < N; ++idx) {
if (!(rowHas[row][idx] || colHas[col][idx] || boxHas[box][idx]) &&
(boxCount[box][idx] == 1 || rowCount[row][idx] == 1 || colCount[col][idx] == 1)) {
//In this section this value can only appear in this square
at = idx + 1;
d_a[gridIdx] = at;
rowHas[row][idx] = true;
colHas[col][idx] = true;
boxHas[box][idx] = true;
added = gridIdx;
}
}
}
__syncthreads();
}
}
[[File:Backtrack_vs_Kernel.png]]
=== Assignment 3 ===
 
 
Changes:
Reduced Thread Divergence/CGMA
-each thread now remembers which values it has seen in a boolean array
- values are only assigned to the grid after the kernel 'solves' the sudoku
- at value in kernel and shared memory for rowHas, colHas, boxHas, updated in a single place
Coalesced Memory
- change modifying _Has and _Count arrays from row->col to col->row as row(threadIdx.x) is our fastest moving dimension
Clarified Code
- use gridIdx == 0 rather then !gridIdx
- use a do-while loop rather then a while loop
 
{| class="wikitable mw-collapsible mw-collapsed"
! Full code
|-
|
 
#include <stdio.h>
// CUDA header file
#include "cuda_runtime.h"
#include <device_launch_parameters.h>
#ifndef __CUDACC__
#define __CUDACC__
#endif
// UNASSIGNED is used for empty cells in Sudoku grid
#define UNASSIGNED 0
// BOX_W is used for the length of one of the square sub-regions of the Sudoku grid.
// Overall length will be N * N.
#define BOX_W 5
#define N (BOX_W * BOX_W)
__global__ void solve(int* d_a) {
// Used to remember which row | col | box ( section ) have which values
__shared__ bool rowHas[N][N];
__shared__ bool colHas[N][N];
__shared__ bool boxHas[N][N];
// Used to ensure that the table has changed
__shared__ bool changed;
// Number of spaces which can place the number in each section
__shared__ int rowCount[N][N];
__shared__ int colCount[N][N];
__shared__ int boxCount[N][N];
// Where the square is located in the Sudoku
int row = threadIdx.x;
int col = threadIdx.y;
int box = row / BOX_W + (col / BOX_W) * BOX_W;
// Unique identifier for each square in row, col, box
// Corresponds to the generic Sudoku Solve
// Using a Sudoku to solve a Sudoku !!!
int offset = col + (row % BOX_W) * BOX_W + (box % BOX_W);
// Square's location in the Sudoku
int gridIdx = col * N + row;
int at = d_a[gridIdx];
bool notSeen[N];
for (int i = 0; i < N; ++i)
notSeen[i] = true;
rowHas[col][row] = false;
colHas[col][row] = false;
boxHas[col][row] = false;
__syncthreads();
if (at != UNASSIGNED) {
rowHas[row][at - 1] = true;
colHas[col][at - 1] = true;
boxHas[box][at - 1] = true;
}
// Previous loop has not changed any values
do {
// RESET counters
rowCount[col][row] = 0;
colCount[col][row] = 0;
boxCount[col][row] = 0;
__syncthreads();
if (gridIdx == 0) // forget previous change
changed = false;
int count = 0; // number of values which can fit in this square
int guess = 0; // last value found which can fit in this square
for (int idx = 0; idx < N; ++idx) {
// Ensures that every square in each section is working on a different number in the section
int num = (idx + offset) % N;
if (at == UNASSIGNED && notSeen[num]) {
if (rowHas[row][num] || boxHas[box][num] || colHas[col][num])
notSeen[num] = false;
else {
++count;
guess = num;
rowCount[row][num]++;
colCount[col][num]++;
boxCount[box][num]++;
}
}
__syncthreads();
}
// Find values which can go in only one spot in the section
for (int idx = 0; idx < N && count > 1; ++idx) {
if (notSeen[idx] &&
(rowCount[row][idx] == 1 || boxCount[box][idx] == 1 || colCount[col][idx] == 1)) {
// In this section this value can only appear in this square
guess = idx;
count = 1;
}
}
if (count == 1) {
at = guess + 1;
rowHas[row][guess] = true;
colHas[col][guess] = true;
boxHas[box][guess] = true;
changed = true;
}
__syncthreads();
} while (changed);
//SOLVED CHECK
if (!(rowHas[row][col] || colHas[row][col] || boxHas[row][col]))
changed = true;
__syncthreads();
if (changed && gridIdx == 0)
at = 0;
d_a[gridIdx] = at;
}
void print(int result[N][N]) {
for (int row = 0; row < N; row++) {
for (int col = 0; col < N; col++)
printf("%3d", result[row][col]);
printf("\n");
}
}
// Driver program to test main program functions
int main() {
int h_a[N][N] = {
{ 1, 0, 4, 0, 25, 0, 19, 0, 0, 10, 21, 8, 0, 14, 0, 6, 12, 9, 0, 0, 0, 0, 0, 0, 5},
{ 5, 0, 19, 23, 24, 0, 22, 12, 0, 0, 16, 6, 0, 20, 0, 18, 0, 25, 14, 13, 10, 11, 0, 1, 15},
{ 0, 0, 0, 0, 0, 0, 21, 5, 0, 20, 11, 10, 0, 1, 0, 4, 8, 24, 23, 15, 18, 0, 16, 22, 19},
{ 0, 7, 21, 8, 18, 0, 0, 0, 11, 0, 5, 0, 0, 24, 0, 0, 0, 17, 22, 1, 9, 6, 25, 0, 0},
{ 0, 13, 15, 0, 22, 14, 0, 18, 0, 16, 0, 0, 0, 4, 0, 0, 0, 19, 0, 0, 0, 24, 20, 21, 17},
{ 12, 0, 11, 0, 6, 0, 0, 0, 0, 15, 0, 0, 0, 0, 21, 25, 19, 0, 4, 0, 22, 14, 0, 20, 0},
{ 8, 0, 0, 21, 0, 16, 0, 0, 0, 2, 0, 3, 0, 0, 0, 0, 17, 23, 18, 22, 0, 0, 0, 24, 6},
{ 4, 0, 14, 18, 7, 9, 0, 22, 21, 19, 0, 0, 0, 2, 0, 5, 0, 0, 0, 6, 16, 15, 0, 11, 12},
{ 22, 0, 24, 0, 23, 0, 0, 11, 0, 7, 0, 0, 4, 0, 14, 0, 2, 12, 0, 8, 5, 19, 0, 25, 9},
{ 20, 0, 0, 0, 5, 0, 0, 0, 0, 17, 9, 0, 12, 18, 0, 1, 0, 0, 7, 24, 0, 0, 0, 13, 4},
{ 13, 0, 0, 5, 0, 2, 23, 14, 4, 18, 22, 0, 17, 0, 0, 20, 0, 1, 9, 21, 12, 0, 0, 8, 11},
{ 14, 23, 0, 24, 0, 0, 0, 0, 0, 0, 0, 0, 20, 25, 0, 3, 4, 13, 0, 11, 21, 9, 5, 18, 22},
{ 7, 0, 0, 11, 17, 20, 24, 0, 0, 0, 3, 4, 1, 12, 0, 0, 6, 14, 0, 5, 25, 13, 0, 0, 0},
{ 0, 0, 16, 9, 0, 17, 11, 7, 10, 25, 0, 0, 0, 13, 6, 0, 0, 18, 0, 0, 19, 4, 0, 0, 20},
{ 6, 15, 0, 19, 4, 13, 0, 0, 5, 0, 18, 11, 0, 0, 9, 8, 22, 16, 25, 10, 7, 0, 0, 0, 0},
{ 0, 0, 0, 2, 0, 0, 10, 19, 3, 0, 1, 0, 22, 9, 4, 11, 15, 0, 20, 0, 0, 8, 23, 0, 25},
{ 0, 24, 8, 13, 1, 0, 0, 4, 20, 0, 17, 14, 0, 0, 18, 0, 16, 22, 5, 0, 11, 0, 10, 0, 0},
{ 23, 10, 0, 0, 0, 0, 0, 0, 18, 0, 6, 0, 16, 0, 0, 17, 1, 0, 13, 0, 0, 3, 19, 12, 0},
{ 25, 5, 0, 14, 11, 0, 17, 0, 8, 24, 13, 0, 19, 23, 15, 9, 0, 0, 12, 0, 20, 0, 22, 0, 7},
{ 0, 0, 17, 4, 0, 22, 15, 0, 23, 11, 12, 25, 0, 0, 0, 0, 18, 8, 0, 7, 0, 0, 14, 0, 13},
{ 19, 6, 23, 22, 8, 0, 0, 1, 25, 4, 14, 2, 0, 3, 7, 13, 10, 11, 16, 0, 0, 0, 0, 0, 0},
{ 0, 4, 0, 17, 0, 3, 0, 24, 0, 8, 20, 23, 11, 10, 25, 22, 0, 0, 0, 12, 13, 2, 18, 6, 0},
{ 0, 0, 7, 16, 0, 0, 6, 17, 2, 21, 0, 18, 0, 0, 0, 19, 0, 0, 8, 0, 0, 0, 0, 4, 0},
{ 18, 9, 25, 1, 2, 11, 0, 0, 13, 22, 4, 0, 21, 0, 5, 0, 23, 7, 0, 0, 15, 0, 3, 0, 8},
{ 0, 21, 10, 0, 0, 12, 0, 20, 16, 0, 19, 0, 0, 0, 0, 15, 14, 4, 2, 18, 23, 25, 11, 7, 0}
};
int* d_a; //Table
cudaMalloc((void**)&d_a, N * N * sizeof(int));
// Copy Sudoku to device
cudaMemcpy(d_a, h_a, N * N * sizeof(int), cudaMemcpyHostToDevice);
dim3 dBlock(N, N);
solve << <1, dBlock >> > (d_a);
// Copy Sudoku back to host
cudaMemcpy(h_a, d_a, N * N * sizeof(int), cudaMemcpyDeviceToHost);
// Check if solved
if (h_a[0][0])
print(h_a);
else
printf("No solution could be found.");
cudaFree(d_a);
return 0;
}
 
 
|}
 
[[File:Unoptimized_vs_Optimized.png]]
 
===Kernel Optimization Attempts===
These Kernels change a minor part of the Optimized Kernel or use a slightly different algorithm in an attempt to make it faster
 
 
Change : Replaces the boolean array hasSeen with a single int & uses bitwise operators
Theory : Since local array variables of threads are stored in Global memory this was an attempt to move that into a register
Result : No speed up noticed, suggesting that more is happening beyond arrays stored in Global memory, perhaps some type of paging,
more testing would be needed on something less erratic then a Sudoku Solver
{| class="wikitable mw-collapsible mw-collapsed"
! Using a int as a boolean array
|-
|
__global__ void solve(int* d_a) {
// Used to remember which row | col | box ( section ) have which values
__shared__ bool rowHas[N][N];
__shared__ bool colHas[N][N];
__shared__ bool boxHas[N][N];
// Used to ensure that the table has changed
__shared__ bool changed;
// Number of spaces which can place the number in each section
__shared__ int rowCount[N][N];
__shared__ int colCount[N][N];
__shared__ int boxCount[N][N];
// Where the square is located in the Sudoku
int row = threadIdx.x;
int col = threadIdx.y;
int box = row / BOX_W + (col / BOX_W) * BOX_W;
int gridIdx = col * N + row;
int at = d_a[gridIdx];
// Unique identifier for each square in row, col, box
// Corresponds to the generic Sudoku Solve
// Using a Sudoku to solve a Sudoku !!!
int offset = col + (row % BOX_W) * BOX_W + (box % BOX_W);
// Square's location in the Sudoku
int notSeen = 0;
rowHas[col][row] = false;
colHas[col][row] = false;
boxHas[col][row] = false;
__syncthreads();
if (at != UNASSIGNED) {
rowHas[row][at - 1] = true;
colHas[col][at - 1] = true;
boxHas[box][at - 1] = true;
} else {
notSeen = ~0;
}
__syncthreads();
// Previous loop has not changed any values
do {
// RESET counters
rowCount[col][row] = 0;
colCount[col][row] = 0;
boxCount[col][row] = 0;
__syncthreads();
if (gridIdx == 0) // forget previous change
changed = false;
int count = 0; // number of values which can fit in this square
int guess = 0; // last value found which can fit in this square
int b_shuttle = 1;
for (int idx = 0; idx < N; ++idx) {
// Ensures that every square in each section is working on a different number in the section
int num = (idx + offset) % N;
if (b_shuttle & notSeen) {
if (rowHas[row][num] || boxHas[box][num] || colHas[col][num])
notSeen ^= b_shuttle;
else {
++count;
guess = num;
rowCount[row][num]++;
colCount[col][num]++;
boxCount[box][num]++;
}
}
b_shuttle <<= 1;
__syncthreads();
}
// Find values which can go in only one spot in the section
b_shuttle = 1;
for (int idx = 0; idx < N && count > 1; ++idx) {
int num = (idx + offset) % N;
if ((b_shuttle & notSeen) &&
(rowCount[row][num] == 1 || boxCount[box][num] == 1 || colCount[col][num] == 1)) {
// In this section this value can only appear in this square
guess = num;
count = 1;
}
b_shuttle <<= 1;
}
 
if (count == 1) {
at = guess + 1;
notSeen = 0;
rowHas[row][guess] = true;
colHas[col][guess] = true;
boxHas[box][guess] = true;
changed = true;
}
__syncthreads();
} while (changed);
//SOLVED CHECK
if (!(rowHas[row][col] || colHas[row][col] || boxHas[row][col]))
changed = true;
__syncthreads();
if (changed && gridIdx == 0)
at = 0;
d_a[gridIdx] = at;
}
|}
Change : Remove the counters, and logic which checks for a section needing a value in one place
Theory : The counting logic requires a additional nested loop each solve cycle and created more thread divergence
Result : The algorithm is slower, probably because 'sections requiring a single value' adds more values early in the kernel resulting in less passes overall
Also this kernel is similar to one of my earlier builds, which was unable to solve the 9x9 getting stuck on every square having more then one possible value
{| class="wikitable mw-collapsible mw-collapsed"
! Dropping Section Logic
|-
|
__global__ void solve(int* d_a) {
// Used to remember which row | col | box ( section ) have which values
__shared__ bool rowHas[N][N];
__shared__ bool colHas[N][N];
__shared__ bool boxHas[N][N];
// Used to ensure that the table has changed
__shared__ bool changed;
// Number of spaces which can place the number in each section
// Where the square is located in the Sudoku
int row = threadIdx.x;
int col = threadIdx.y;
int box = row / BOX_W + (col / BOX_W) * BOX_W;
// Unique identifier for each square in row, col, box
// Corresponds to the generic Sudoku Solve
// Using a Sudoku to solve a Sudoku !!!
int offset = col + (row % BOX_W) * BOX_W + (box % BOX_W);
// Square's location in the Sudoku
int gridIdx = col * N + row;
int at = d_a[gridIdx];
bool notSeen[N];
for (int i = 0; i < N; ++i)
notSeen[i] = true;
rowHas[col][row] = false;
colHas[col][row] = false;
boxHas[col][row] = false;
__syncthreads();
if (at != UNASSIGNED) {
rowHas[row][at - 1] = true;
colHas[col][at - 1] = true;
boxHas[box][at - 1] = true;
}
// Previous loop has not changed any values
do {
// RESET counters
__syncthreads();
if (gridIdx == 0) // forget previous change
changed = false;
int count = 0; // number of values which can fit in this square
int guess = 0; // last value found which can fit in this square
for (int idx = 0; idx < N; ++idx) {
// Ensures that every square in each section is working on a different number in the section
int num = (idx + offset) % N;
if (at == UNASSIGNED && notSeen[num]) {
if (rowHas[row][num] || boxHas[box][num] || colHas[col][num])
notSeen[num] = false;
else {
++count;
guess = num;
}
}
__syncthreads();
}
if (count == 1) {
at = guess + 1;
rowHas[row][guess] = true;
colHas[col][guess] = true;
boxHas[box][guess] = true;
changed = true;
}
__syncthreads();
} while (changed);
//SOLVED CHECK
if (!(rowHas[row][col] || colHas[row][col] || boxHas[row][col]))
changed = true;
__syncthreads();
if (changed && gridIdx == 0)
at = 0;
d_a[gridIdx] = at;
}
|}
 
Change : Quickly finds one section that requires a single value in one spot, by checking all sections at once and remembering a single section
Theory : Similar to the previous Kernel, trying to remove the second loop
Result : Surprisingly slow, gains little benefit from the section logic and shared memory, yet is still required to count all values
{| class="wikitable mw-collapsible mw-collapsed"
! Notify - Determines a single section that has a limited value (removes section loop)
|-
|
__global__ void solve(int* d_a) {
// Used to remember which row | col | box ( section ) have which values
__shared__ bool rowHas[N][N];
__shared__ bool colHas[N][N];
__shared__ bool boxHas[N][N];
// Used to ensure that the table has changed
__shared__ bool changed;
// Number of spaces which can place the number in each section
__shared__ int rowCount[N][N];
__shared__ int colCount[N][N];
__shared__ int boxCount[N][N];
// Where the square is located in the Sudoku
int row = threadIdx.x;
int col = threadIdx.y;
int box = row / BOX_W + (col / BOX_W) * BOX_W;
// Unique identifier for each square in row, col, box
// Corresponds to the generic Sudoku Solve
// Using a Sudoku to solve a Sudoku !!!
int offset = col + (row % BOX_W) * BOX_W + (box % BOX_W);
// Square's location in the Sudoku
int gridIdx = col * N + row;
int at = d_a[gridIdx];
bool notSeen[N];
for (int i = 0; i < N; ++i)
notSeen[i] = true;
rowHas[col][row] = false;
colHas[col][row] = false;
boxHas[col][row] = false;
__syncthreads();
__shared__ int notify;
if (at != UNASSIGNED) {
rowHas[row][at - 1] = true;
colHas[col][at - 1] = true;
boxHas[box][at - 1] = true;
}
// Previous loop has not changed any values
do {
// RESET counters
rowCount[col][row] = 0;
colCount[col][row] = 0;
boxCount[col][row] = 0;
__syncthreads();
if (gridIdx == 0) { // forget previous change
changed = false;
notify = -1;
}
int count = 0; // number of values which can fit in this square
int guess = 0; // last value found which can fit in this square
for (int idx = 0; idx < N; ++idx) {
// Ensures that every square in each section is working on a different number in the section
int num = (idx + offset) % N;
if (at == UNASSIGNED && notSeen[num]) {
if (rowHas[row][num] || boxHas[box][num] || colHas[col][num])
notSeen[num] = false;
else {
++count;
guess = num;
rowCount[row][num]++;
colCount[col][num]++;
boxCount[box][num]++;
}
}
__syncthreads();
}
if (rowCount[row][col] == 1 || colCount[row][col] == 1 || boxCount[row][col] == 1)
notify = col;
__syncthreads();
// Find values which can go in only one spot in the section
if (notify > 0 && at == UNASSIGNED && notSeen[notify] &&
(rowCount[row][notify] == 1 || boxCount[box][notify] == 1 || colCount[col][notify] == 1)) {
// In this section this value can only appear in this square
guess = notify;
count = 1;
}
if (count == 1) {
at = guess + 1;
rowHas[row][guess] = true;
colHas[col][guess] = true;
boxHas[box][guess] = true;
changed = true;
}
__syncthreads();
} while (changed);
//SOLVED CHECK
if (!(rowHas[row][col] || colHas[row][col] || boxHas[row][col]))
changed = true;
__syncthreads();
if (changed && gridIdx == 0)
at = 0;
d_a[gridIdx] = at;
}
|}
 
Change : Refactors the algorithm to count the total numbers that can fit in a square or section
Then counts down as values are added
Theory : Remove redundant counting logic that occurred during the Optimized Kernel each pass
Result : Not faster, HOWEVER there is a slight error, by setting notSeen = 0, the section counters will rarely reach one
{| class="wikitable mw-collapsible mw-collapsed"
! CountDown - using Int as Boolean Array(EDITED now 4.28 seconds)
|-
|
__global__ void solve(int* d_a) {
// Used to remember which row | col | box ( section ) have which values
__shared__ bool rowHas[N][N];
__shared__ bool colHas[N][N];
__shared__ bool boxHas[N][N];
// Used to ensure that the table has changed
__shared__ bool changed;
// Number of spaces which can place the number in each section
__shared__ int rowCount[N][N];
__shared__ int colCount[N][N];
__shared__ int boxCount[N][N];
// Where the square is located in the Sudoku
int row = threadIdx.x;
int col = threadIdx.y;
int box = row / BOX_W + (col / BOX_W) * BOX_W;
int gridIdx = col * N + row;
int at = d_a[gridIdx];
// Unique identifier for each square in row, col, box
// Corresponds to the generic Sudoku Solve
// Using a Sudoku to solve a Sudoku !!!
int offset = col + (row % BOX_W) * BOX_W + (box % BOX_W);
// Square's location in the Sudoku
int count = 0; //Number of values which can fit in this square
int notSeen = 0; //Boolean Array as an Integer
if (gridIdx == 0)
changed = true;
rowHas[col][row] = false;
colHas[col][row] = false;
boxHas[col][row] = false;
rowCount[col][row] = 0;
colCount[col][row] = 0;
boxCount[col][row] = 0;
__syncthreads();
if (at != UNASSIGNED) {
rowHas[row][at - 1] = true;
colHas[col][at - 1] = true;
boxHas[box][at - 1] = true;
}
__syncthreads();
int guess;
int b_shuttle = 1;
for (int idx = 0; idx < N; ++idx) {
int num = (idx + offset) % N;
if (at == UNASSIGNED && !(rowHas[row][num] || boxHas[box][num] || colHas[col][num])) {
notSeen |= b_shuttle; //this value can go here
++count; //how many values this square can have
guess = num;
//how many values this section can have
rowCount[row][num]++;
colCount[col][num]++;
boxCount[box][num]++;
}
__syncthreads();
b_shuttle <<= 1;
}
if (at == UNASSIGNED && count == 0) //NOT POSSIBLE SUDOKU
changed = false;
__syncthreads();
if (count == 1) {
at = guess + 1;
notSeen = count = 0;
rowHas[row][guess] = true;
colHas[col][guess] = true;
boxHas[box][guess] = true;
}
// Previous loop has not changed any values
while (changed) {
__syncthreads();
if (gridIdx == 0) // forget previous change
changed = false;
int b_shuttle = 1;
for (int idx = 0; idx < N; ++idx) {
// Ensures that every square in each section is working on a different number in the section
int num = (idx + offset) % N;
if (b_shuttle & notSeen &&
(at != UNASSIGNED || rowHas[row][num] || boxHas[box][num] || colHas[col][num])) {
rowCount[row][num]--;
colCount[col][num]--;
boxCount[box][num]--;
notSeen ^= b_shuttle;
--count;
}
__syncthreads();
if (b_shuttle & notSeen &&
(count == 1 || rowCount[row][num] == 1 || boxCount[box][num] == 1 || colCount[col][num] == 1)) {
rowHas[row][num] = true;
colHas[col][num] = true;
boxHas[box][num] = true;
changed = true;
notSeen ^= b_shuttle;
at = num + 1;
count = 0;
}
b_shuttle <<= 1;
}
__syncthreads();
};
if (!(rowHas[row][col] && colHas[row][col] && boxHas[box][col]))
changed = true; //HAVE NOT SOLVED the sudoku
__syncthreads();
if (changed && gridIdx == 0)
at = 0;
d_a[gridIdx] = at;
}
|}
 
Change : uses countdown logic with a boolean array
Result : Similar times to other Countdown kernel
 
{| class="wikitable mw-collapsible mw-collapsed"
! Countdown Boolean Array (EDITED - now 4.37ms)
|-
|
__global__ void solve(int* d_a) {
// Used to remember which row | col | box ( section ) have which values
__shared__ bool rowHas[N][N];
__shared__ bool colHas[N][N];
__shared__ bool boxHas[N][N];
// Used to ensure that the table has changed
__shared__ bool changed;
// Number of spaces which can place the number in each section
__shared__ int rowCount[N][N];
__shared__ int colCount[N][N];
__shared__ int boxCount[N][N];
// Where the square is located in the Sudoku
int row = threadIdx.x;
int col = threadIdx.y;
int box = row / BOX_W + (col / BOX_W) * BOX_W;
int gridIdx = col * N + row;
int at = d_a[gridIdx];
// Unique identifier for each square in row, col, box
// Corresponds to the generic Sudoku Solve
// Using a Sudoku to solve a Sudoku !!!
int offset = col + (row % BOX_W) * BOX_W + (box % BOX_W);
// Square's location in the Sudoku
int count = 0; //Number of values which can fit in this square
bool notSeen[N]; //Boolean Array as an Integer
for(int idx = 0; idx < N; ++idx)
notSeen[idx] = false;
if (gridIdx == 0)
changed = true;
rowHas[col][row] = false;
colHas[col][row] = false;
boxHas[col][row] = false;
rowCount[col][row] = 0;
colCount[col][row] = 0;
boxCount[col][row] = 0;
__syncthreads();
if (at != UNASSIGNED) {
rowHas[row][at - 1] = true;
colHas[col][at - 1] = true;
boxHas[box][at - 1] = true;
}
__syncthreads();
int guess;
for (int idx = 0; idx < N; ++idx) {
int num = (idx + offset) % N;
if (at == UNASSIGNED && !(rowHas[row][num] || boxHas[box][num] || colHas[col][num])) {
notSeen[num] = true; //this value can go here
++count; //how many values this square can have
guess = num;
//how many values this section can have
rowCount[row][num]++;
colCount[col][num]++;
boxCount[box][num]++;
}
__syncthreads();
}
if (at == UNASSIGNED && count == 0) //NOT POSSIBLE SUDOKU
changed = false;
__syncthreads();
if (count == 1) {
at = guess + 1;
count = 0;
notSeen[guess] = false;
rowHas[row][guess] = true;
colHas[col][guess] = true;
boxHas[box][guess] = true;
}
// Previous loop has not changed any values
while (changed) {
__syncthreads();
if (gridIdx == 0) // forget previous change
changed = false;
for (int idx = 0; idx < N; ++idx) {
// Ensures that every square in each section is working on a different number in the section
int num = (idx + offset) % N;
if (notSeen[num] &&
(at != UNASSIGNED || rowHas[row][num] || boxHas[box][num] || colHas[col][num])) {
rowCount[row][num]--;
colCount[col][num]--;
boxCount[box][num]--;
notSeen[num] = false;
--count;
}
__syncthreads();
if ( notSeen[num] &&
(count == 1 || rowCount[row][num] == 1 || boxCount[box][num] == 1 || colCount[col][num] == 1)) {
rowHas[row][num] = true;
colHas[col][num] = true;
boxHas[box][num] = true;
changed = true;
notSeen[num] = false;
at = num + 1;
count = 0;
}
}
__syncthreads();
};
if (!(rowHas[row][col] && colHas[row][col] && boxHas[box][col]))
changed = true; //HAVE NOT SOLVED the sudoku
__syncthreads();
if (changed && gridIdx == 0)
at = 0;
d_a[gridIdx] = at;
}
|}
[[File:Kernel_Compare.png]]
==== Occupancy Calculations ====
{| class="wikitable mw-collapsible mw-collapsed"
! For 9x9:
|-
|
[[File:Occupancy_9x9.png]]
|}
{| class="wikitable mw-collapsible mw-collapsed"
! For 16x16:
|-
|
[[File:Occupancy_16x16.png]]
|}
{| class="wikitable mw-collapsible mw-collapsed"
! For 25x25:
|-
|
[[File:Occupancy_25x25.png]]
|}

Navigation menu