Open main menu

CDOT Wiki β

Changes

GPU610/TeamLean

21,341 bytes added, 03:37, 19 April 2013
updated with findings... thought I did this last week but it was either overwritten or I didn't save the page.
[mailto:aadamico@myseneca.ca,btulchinsky@myseneca.ca?subject=gpu610 Email All]
== Progress ===== Assignment 1 ===
'''For the first assignment we each profiled open source libraries, Lame - a audio compression utility and Squish an image compression utility.'''
return 0;
}
 
'''
'''
 
'''Barry - results for Squish'''
I looked at a image compression library called squish.
 
There are several possibilities for compression. It depends on the amount of adjacent colours and their relativity to one another. In other words, the more common the colours, the better the compression as it tries to fit RGB schemes in a smaller vector object.
 
Below is my findings for the library:
 
There were several test files included, one that tested a PNG file compression. However, Linux didn't have a required library so I was only able to profile simple colour compression. In the future if I get the PNG compression test to work I will attempt to profile it and discuss with my team member if he would like to pursue Nonetheless, I believe that it was sufficient to profile the area of the code that could potentially benefit from parallelization.
 
Looking at the 3 profiles (squishtest.select1.flt, squishtest.select2.flt, squishtest.select3.flt) it seems that compression and decompression of one colour is extremely fast (select1 and select2 files). The problem comes when there are 2 (or more) colours involved (select3 file). I also noticed that the FloatTo565 function is called the most out of all the functions. Looking into that function however, its a method that compacts the RBG into a single value using bitwise operations. There are 3 and bitwise operations are quite fast, so I don't think we should focus on that even though it's called many times.
 
I believe the best place to offset the workload into the GPU would be the Compress3 and Compress4 function primarily, as well as the ComputeWeightedCovariance function. The profile is only for the Compress4 and not for the Compress3 function, but I presume that it may be called a lot of times as well depending on the compression format.
 
There is also a function that orders the vectors of colours but that can't be parralelized because it's dependent on a previous iteration.
 
=== squishtest.select1.flt ===
 
Flat profile:
 
Each sample counts as 0.01 seconds.
no time accumulated
 
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 3366 0.00 0.00 squish::FloatTo565(squish::Vec3 const&)
0.00 0.00 0.00 2000 0.00 0.00 squish::SingleColourFit::ComputeEndPoints(squish::SingleColourLookup const* const*)
0.00 0.00 0.00 2000 0.00 0.00 squish::FixFlags(int)
0.00 0.00 0.00 2000 0.00 0.00 squish::Unpack565(unsigned char const*, unsigned char*)
0.00 0.00 0.00 1683 0.00 0.00 squish::WriteColourBlock(int, int, unsigned char*, void*)
0.00 0.00 0.00 1683 0.00 0.00 squish::ColourSet::RemapIndices(unsigned char const*, unsigned char*) const
0.00 0.00 0.00 1000 0.00 0.00 GetColourError(unsigned char const*, unsigned char const*)
0.00 0.00 0.00 1000 0.00 0.00 squish::Decompress(unsigned char*, void const*, int)
0.00 0.00 0.00 1000 0.00 0.00 squish::CompressMasked(unsigned char const*, int, void*, int)
0.00 0.00 0.00 1000 0.00 0.00 squish::SingleColourFit::Compress3(void*)
0.00 0.00 0.00 1000 0.00 0.00 squish::SingleColourFit::Compress4(void*)
0.00 0.00 0.00 1000 0.00 0.00 squish::SingleColourFit::SingleColourFit(squish::ColourSet const*, int)
0.00 0.00 0.00 1000 0.00 0.00 squish::DecompressColour(unsigned char*, void const*, bool)
0.00 0.00 0.00 1000 0.00 0.00 squish::WriteColourBlock3(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*)
0.00 0.00 0.00 1000 0.00 0.00 squish::Compress(unsigned char const*, void*, int)
0.00 0.00 0.00 1000 0.00 0.00 squish::ColourFit::Compress(void*)
0.00 0.00 0.00 1000 0.00 0.00 squish::ColourFit::ColourFit(squish::ColourSet const*, int)
0.00 0.00 0.00 1000 0.00 0.00 squish::ColourSet::ColourSet(unsigned char const*, int, int)
0.00 0.00 0.00 683 0.00 0.00 squish::WriteColourBlock4(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*)
0.00 0.00 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z14GetColourErrorPKhS0_
 
=== squishtest.select2.flt ===
 
Flat profile:
 
Each sample counts as 0.01 seconds.
no time accumulated
 
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 2104 0.00 0.00 squish::FloatTo565(squish::Vec3 const&)
0.00 0.00 0.00 1530 0.00 0.00 squish::SingleColourFit::ComputeEndPoints(squish::SingleColourLookup const* const*)
0.00 0.00 0.00 1530 0.00 0.00 squish::FixFlags(int)
0.00 0.00 0.00 1530 0.00 0.00 squish::Unpack565(unsigned char const*, unsigned char*)
0.00 0.00 0.00 1052 0.00 0.00 squish::WriteColourBlock(int, int, unsigned char*, void*)
0.00 0.00 0.00 1052 0.00 0.00 squish::ColourSet::RemapIndices(unsigned char const*, unsigned char*) const
0.00 0.00 0.00 765 0.00 0.00 GetColourError(unsigned char const*, unsigned char const*)
0.00 0.00 0.00 765 0.00 0.00 squish::Decompress(unsigned char*, void const*, int)
0.00 0.00 0.00 765 0.00 0.00 squish::CompressMasked(unsigned char const*, int, void*, int)
0.00 0.00 0.00 765 0.00 0.00 squish::SingleColourFit::Compress3(void*)
0.00 0.00 0.00 765 0.00 0.00 squish::SingleColourFit::Compress4(void*)
0.00 0.00 0.00 765 0.00 0.00 squish::SingleColourFit::SingleColourFit(squish::ColourSet const*, int)
0.00 0.00 0.00 765 0.00 0.00 squish::DecompressColour(unsigned char*, void const*, bool)
0.00 0.00 0.00 765 0.00 0.00 squish::WriteColourBlock3(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*)
0.00 0.00 0.00 765 0.00 0.00 squish::Compress(unsigned char const*, void*, int)
0.00 0.00 0.00 765 0.00 0.00 squish::ColourFit::Compress(void*)
0.00 0.00 0.00 765 0.00 0.00 squish::ColourFit::ColourFit(squish::ColourSet const*, int)
0.00 0.00 0.00 765 0.00 0.00 squish::ColourSet::ColourSet(unsigned char const*, int, int)
0.00 0.00 0.00 287 0.00 0.00 squish::WriteColourBlock4(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*)
0.00 0.00 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z14GetColourErrorPKhS0_
 
=== squishtest.select3.flt ===
 
Flat profile:
 
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
26.09 0.06 0.06 97155 0.62 0.84 squish::ClusterFit::Compress4(void*)
21.74 0.11 0.05 292548 0.17 0.17 squish::FloatTo565(squish::Vec3 const&)
21.74 0.16 0.05 97155 0.51 1.01 squish::ClusterFit::Compress3(void*)
4.35 0.17 0.01 194310 0.05 0.05 squish::ClusterFit::ConstructOrdering(squish::Vec3 const&, int)
4.35 0.18 0.01 194310 0.05 0.05 squish::Unpack565(unsigned char const*, unsigned char*)
4.35 0.19 0.01 97155 0.10 0.21 squish::DecompressColour(unsigned char*, void const*, bool)
4.35 0.20 0.01 97155 0.10 0.44 squish::WriteColourBlock3(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*)
4.35 0.21 0.01 97155 0.10 0.10 squish::ComputeWeightedCovariance(int, squish::Vec3 const*, float const*)
4.35 0.22 0.01 97155 0.10 0.10 squish::ColourSet::ColourSet(unsigned char const*, int, int)
4.35 0.23 0.01 TestTwoColour(int)
0.00 0.23 0.00 194310 0.00 0.00 squish::FixFlags(int)
0.00 0.23 0.00 146274 0.00 0.00 squish::WriteColourBlock(int, int, unsigned char*, void*)
0.00 0.23 0.00 146274 0.00 0.00 squish::ColourSet::RemapIndices(unsigned char const*, unsigned char*) const
0.00 0.23 0.00 97155 0.00 0.00 GetColourError(unsigned char const*, unsigned char const*)
0.00 0.23 0.00 97155 0.00 0.10 squish::ClusterFit::ClusterFit(squish::ColourSet const*, int)
0.00 0.23 0.00 97155 0.00 0.21 squish::Decompress(unsigned char*, void const*, int)
0.00 0.23 0.00 97155 0.00 2.06 squish::CompressMasked(unsigned char const*, int, void*, int)
0.00 0.23 0.00 97155 0.00 0.00 squish::ComputePrincipleComponent(squish::Sym3x3 const&)
0.00 0.23 0.00 97155 0.00 2.06 squish::Compress(unsigned char const*, void*, int)
0.00 0.23 0.00 97155 0.00 1.01 squish::ColourFit::Compress(void*)
0.00 0.23 0.00 97155 0.00 0.00 squish::ColourFit::ColourFit(squish::ColourSet const*, int)
0.00 0.23 0.00 97155 0.00 0.00 squish::GetMultiplicity1Evector(squish::Sym3x3 const&, float)
0.00 0.23 0.00 49119 0.00 0.34 squish::WriteColourBlock4(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*)
0.00 0.23 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z14GetColourErrorPKhS0_
 
== Assignment 2 ==
 
We ran into a lot of difficulty with this assignment from beginning until the time of submission.
 
We decided to work on the squish project. However, the first problem that we ran into was that we didn't know how to convert the makefile on linux to work on Windows and nvcc. We finally managed to devise a workaround where we edited the header files to point to squish.h in the current directory so we won't require dynamic linking and just compile everything in one line, unfortunately it took us several days.
 
The next issue we had was finding and getting all the various class methods to be available on the device and host via the __device__ __host__ modifiers. This was a lot more difficult than we first anticipated as some classes had different versions depending on how it's configured and did take us quite a bit of time as some methods used other (or nested) methods which made it hard to follow and debug.
 
In addition, this brought more difficulty due to the fact that standard library functions were used, such as min and max. We looked into using thrust, however, we decided to just use conditional operations to work around the problem for simplicity.
 
The major problem was debugging the maths.cu code. On our first successful compilation with the kernel code, we noticed that there was an error somewhere as our third, and most important test, which was mixing 2 colours didn't produce any results. We initially assumed it was a kernel error. But when we tried to compile the maths.cu file from the original maths.cpp code (we just copied the original maths.cpp file and changed the extension to .cu) it was producing the same results. This made it virtually impossible to debug, even when using the visual profiler, as it only tells us that there was a non-zero (or 1) return from main.
 
It's due to this reason that we were unable to produce any visual chart or comparison with the original code, as we still need to debug the code. However, it appears that the kernel code is fine and the error is after it's execution, this is according to the profiler.
 
During our search for answers we noticed that CUDA uses squish for texture operations (https://developer.nvidia.com/gpu-accelerated-texture-compression). We will try to get in touch with Simon Brown who is the creator of squish and see if he can lead us in the proper path, as it can be clearly parallelized if it is used in some CUDA functionality.
 
== Assignment 3 ==
 
This project had a lot of potential at first but both Alex and I found it very disappointing and frustrating as we couldn't manage to get it to run with much success.
Because we didn't get it working. We decided to write about the theory behind our intentions for optimization, after we explain what we've done.
 
As we originally mentioned in our second assignment findings, we thought that our kernel code was fine and that we had issues elsewhere. However, that was not the case.
 
It turns out that there was an error in our kernel, and we did notice some logical errors that we overlooked. We were able to get some profiling through the visual profiler
but the code crashes and therefore the profile is incomplete, we were unable to get the code to work for our 3rd and most important case, and we couldn't figure out why.
Since we couldn't get it working with the two sources that we mentioned, we decided to focus on one, which was the ComputeWeightedCovariance method and its kernel as seen
below (Please excuse the excessive comments):
 
=== KERNEL: ===
<pre>
__global__ void kernelWeightedConvariance (float* weights, Vec3* points, float* total, Vec3* cudaCentroid, int n) {
//shared memory to reduce memory latency
__shared__ float sharedTotal;
//centroid members
// __shared__ float cx;
// __shared__ float cxTotal;
// __shared__ float cy;
// __shared__ float cyTotal;
// __shared__ float cz;
// __shared__ float czTotal;
__shared__ Vec3 sharedCentroid;
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx == 0) {
*total = 0.0f;
*cudaCentroid = Vec3(0.0f);
}
if (threadIdx.x == 0) {
sharedCentroid = Vec3(0.0f);
sharedTotal = 0;
}
__syncthreads();
// cx = 0;
// cy = 0;
// cz = 0;
// cxTotal = 0.0f;
// cyTotal = 0.0f;
// czTotal = 0.0f;
if (idx < n) {
sharedTotal += weights[idx];
// cxTotal += weights[idx] * points[idx
sharedCentroid = weights[idx] * points[idx];
}
__syncthreads();
//copy to global memory
if (threadIdx.x == 0) {
*total += sharedTotal;
*cudaCentroid += sharedCentroid;
}
}
</pre>
METHOD CONTAINING KERNEL CALL
 
<pre>
Sym3x3 ComputeWeightedCovariance( int n, Vec3 const* points, const float * weights )
{
// compute the centroid
float total = 0.0f;
Vec3 centroid( 0.0f );
// float centroidX = 0.0f;
// float centroidY = 0.0f;
// float centroidZ = 0.0f;
cudaError_t error;
bool cudaContinue = true;
// device memory addresses
float* cudaWeights;
Vec3* cudaPoints;
float* cudaTotal;
Vec3* cudaCentroid;
// float* cudaCentroidX;
// float* cudaCentroidY;
// float* cudaCentroidZ;
//calculate number of blocks
int nblocks = (n + ntpb - 1) / ntpb;
// int nblocks = n / ntpb + 1;
//allocate device memory
if (cudaContinue && (error = cudaMalloc((void**)&cudaWeights, n * sizeof(float))) != cudaSuccess) {
cout<< "unable to create device memory for cudaWeights: " << cudaGetErrorString(error) << endl;
cudaContinue = false;
}
if (cudaContinue && (error = cudaMalloc((void**)&cudaPoints, n * sizeof(Vec3))) != cudaSuccess) {
cout<< "unable to create device memory for cudaPoints: " << cudaGetErrorString(error) << endl;
cudaContinue = false;
}
if (cudaContinue && (error = cudaMalloc((void**)&cudaTotal, sizeof(float))) != cudaSuccess) {
cout<< "unable to create device memory for cudaTotal: " << cudaGetErrorString(error) << endl;
cudaContinue = false;
}
if (cudaContinue && (error = cudaMalloc((void**)&cudaCentroid, sizeof(Vec3))) != cudaSuccess) {
cout<< "unable to create device memory for cudaCentroid: " << cudaGetErrorString(error) << endl;
cudaContinue = false;
}
// cout<<"cudamemcpy: "<<cudaGetErrorString(cudaGetLastError())<<" "<<temp<<" "<< weights<<endl;
// cout<<"cudamemcpyTotal: "<<cudaGetErrorString(cudaGetLastError())<<" "<<cudaTotal<<" "<< weights<<endl;
// cout<<"cudamemcpyCentroid: "<<cudaGetErrorString(cudaGetLastError())<<" "<<cudaCentroid<<" "<< weights<<endl;
// cudaMalloc((void**)&cudaCentroidX, sizeof(float));
// cudaMalloc((void**)&cudaCentroidY, sizeof(float));
// cudaMalloc((void**)&cudaCentroidZ, sizeof(float));
//copy the weights and points to the device
if ((error = cudaMemcpy(cudaWeights, weights, n * sizeof(float), cudaMemcpyHostToDevice)) != cudaSuccess) {
cout<<"failed to copy weights to device: "<<cudaGetErrorString(error)<<" "<<weights<<endl;
}
if ((error = cudaMemcpy(cudaPoints, points, n * sizeof(Vec3), cudaMemcpyHostToDevice)) != cudaSuccess) {
cout<<"failed to copy points to device: "<<cudaGetErrorString(error)<<" "<<points<<endl;
}
// ;
cout<<"GOING INTO KERNEL"<<endl;
//OFFSET LOOP TO GPU
// kernelWeightedConvariance<<<nblocks, ntpb>>>(cudaWeights, cudaPoints, cudaTotal, cudaCentroidX, cudaCentroidY, cudaCentroidZ);
kernelWeightedConvariance<<<nblocks, ntpb>>>(cudaWeights, cudaPoints, cudaTotal, cudaCentroid, n);
// kernelWeightedConvariance<<<1, ntpb>>>(cudaWeights, cudaPoints, cudaTotal, cudaCentroid, n);
//ensure synchronization
cudaDeviceSynchronize();
cudaContinue = true;
//copy back to host
if (cudaContinue && (error = cudaMemcpy(&total, cudaTotal, sizeof(float), cudaMemcpyDeviceToHost))) {
// cout<<"failed to copy total from device: "<<cudaGetErrorString(error)<<" "<<total<<endl;
}
if (cudaContinue && (error = cudaMemcpy(&centroid, cudaCentroid, sizeof(Vec3), cudaMemcpyDeviceToHost))) {
// cout<<"failed to copy total from device: "<<cudaGetErrorString(error)<<" X:"<<centroid.X() << \
" Y:"<<centroid.Y() << \
" Z:"<<centroid.Z() <<endl;
}
// cudaMemcpy(&centroidX, cudaCentroidX, sizeof(float), cudaMemcpyDeviceToHost);
// cudaMemcpy(&centroidY, cudaCentroidX, sizeof(float), cudaMemcpyDeviceToHost);
// cudaMemcpy(&centroidZ, cudaCentroidX, sizeof(float), cudaMemcpyDeviceToHost);
for( int i = 0; i < n; ++i )
{
total += weights[i];
centroid += weights[i]*points[i];
}
//create centroid from kernel results
// Vec3 centroid(centroidX, centroidY, centroidZ);
centroid /= total;
 
// accumulate the covariance matrix
Sym3x3 covariance( 0.0f );
for( int i = 0; i < n; ++i )
{
Vec3 a = points[i] - centroid;
Vec3 b = weights[i]*a;
covariance[0] += a.X()*b.X();
covariance[1] += a.X()*b.Y();
covariance[2] += a.X()*b.Z();
covariance[3] += a.Y()*b.Y();
covariance[4] += a.Y()*b.Z();
covariance[5] += a.Z()*b.Z();
}
cudaFree(cudaTotal);
cudaFree(cudaCentroid);
// cudaFree(cudaCentroidX);
// cudaFree(cudaCentroidY);
// cudaFree(cudaCentroidZ);
cudaFree(cudaWeights);
cudaFree(cudaPoints);
cudaDeviceReset();
cout<<"something outta nothing\n";
// return it
return covariance;
}
</pre>
 
We added some error correction thinking an errpr was in memory allocation or wrong computation, but that was not the case. We tried also to get some debuggers to work, but
since we couldn't we relied on printf style debugging, which is only useful OUTSIDE the kernel, unfortunately. However, looking at the naive code it appears correct.
 
What we were able to profile for the test case is that it spends less than 2 microsecond copying to the device and less than 3 microseconds in the kernel. This doesn't
say much as we're dealing only with 2 random colours, but the amount of computation is dependent on the file size.
 
 
=== OPTIMIZATION ===
 
As far as what we planned to look on how to optimize this code, we noticed several options.
 
1) as we learned in class, we can use a thread divergent reduction algorithm and store each blocks result separately instead of having everything flushed into one global location. This
will reduce the number of operations in the block and potentially even lead to reduction in threads required.
 
2) Since we know that the number of computations is dependent on the size of the file (image file), we would be able to optimize the number of threads per blocks and number of blocks
required according to the file size and compute capability in order to reduce overhead and extra fragmentation of threads (ie. threads in a block at are not needed to complete the
computation in the last block)
 
 
=== OTHER NOTES ===
 
 
TO RUN ON WINDOWS:
 
nvcc alpha.cpp clusterfit.cpp colourblock.cpp colourfit.cpp colourset.cpp maths.cu rangefit.cpp singlecolourfit.cpp squish.cpp squishtest.cu alpha.cpp
 
*Note we excluded the clusterfit.cu from the build command because we didn't concentrate on it for this iteration of the assignment.