Difference between revisions of "GPU610/TeamEh"
(Fixed headers) |
|||
Line 170: | Line 170: | ||
=== Assignment 2 === | === Assignment 2 === | ||
+ | We chose to parallelize the image processing program. Image processing is easy to parallelize but this project was a challenge due to the number of functions the program has. | ||
+ | |||
+ | ==== Benchmarking ==== | ||
+ | Intel Core i7 2600K (standard clock) | ||
+ | NVidia GeForce GTX 750 Ti | ||
+ | |||
+ | Two images were used to test each image function, a large one and a small one. | ||
+ | |||
+ | The small sample test image: | ||
+ | |||
+ | [[File:GPU610_2014_1_Team_Eh_sample.jpg]] | ||
+ | |||
+ | The sample after being processed by the canny filter: | ||
+ | |||
+ | [[File:GPU610_2014_1_Team_Eh_canny.jpg]] | ||
+ | |||
+ | ==== Results ==== | ||
+ | |||
+ | [[File:GPU610_2014_1_Team_Eh_chart.png]] | ||
+ | |||
+ | This chart compares the run times for the original and parallelized image processing functions. We saw dramatic improvements to image filtering performance. All functions are down to constant time with respect to image dimensions, down from O(n^2). | ||
+ | |||
+ | ==== Problems Encountered ==== | ||
+ | Many of the operations were composed of several different kernels and other operations. To avoid repeated copies to and from the device we wrapped each kernel in a function that took device pointers. That way images could be loaded once and passed through multiple filters without returning them to the host. | ||
+ | |||
+ | Several of the operations scan a pixels neighbors to determine the pixels value. This creates a problem when a pixel is near the edge of an image. To solve this problem we interpenetrated the image not as a flat surface but as a torus. Anytime a thread would access an off image pixel it would wrap around and use a pixel from the opposite side of the image. | ||
+ | |||
=== Assignment 3 === | === Assignment 3 === |
Revision as of 22:34, 30 October 2014
GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary
Contents
Team Eh
Team Members
- Benjamin Snively, Some responsibility
- Brad Hoover, Some other responsibility
- Balint Czunyi, Some other responsibility
- ...
Progress
Assignment 1
Benjamin Snively's Results
Introduction
This image processing program was found on github. It processes and manipulates images using convolutions matrices (kernels). It has several different functions including aligning and sharpening images.
To convolve an image the kernel is applied to each pixel. Using the kernel, the pixel's value is combined with that of its neighbors to create a new pixel value. This program implements the filter using two loops to loop over each pixel in sequence. For a given an image convolution is an O(rows x columns) function. As blurring operation on each pixel is independent of the others, therefore it is a perfect candidate for parallelization.
To profile the application, I created a large bitmap file (about 800 x 800, 2MB) and ran it through three different operations. To conserve space, I have not included a profile of all of the available operations.
Gassian Blur
Command: --gassian 5
Each sample counts as 0.01 seconds.
% cumulative self self total time seconds seconds calls s/call s/call name 44.81 88.57 88.57 _mcount_private 31.92 151.66 63.09 __fentry__ 4.85 161.25 9.59 1 9.59 45.84 Gauss_filter::smooth_ord(Matrix<std::tuple<unsigned int, unsigned int, unsigned int> >&) 1.92 165.05 3.80 633231640 0.00 0.00 Matrix<std::tuple<unsigned int, unsigned int, unsigned int> >::operator()(unsigned int, unsigned int) 1.43 167.88 2.83 633887736 0.00 0.00 std::__shared_ptr<std::tuple<unsigned int, unsigned int, unsigned int>, (__gnu_cxx::_Lock_policy)2>::get() const 1.37 170.58 2.70 630508256 0.00 0.00 std::_Tuple_impl<0ul, int&, int&, int&>& std::_Tuple_impl<0ul, int&, int&, int&>::operator=<unsigned int, unsigned int, unsigned int>(std::_Tuple_impl<0ul, unsigned int, unsigned int, unsigned int> const&) 0.92 172.40 1.82 630508256 0.00 0.00 std::_Head_base<0ul, int&, false>::_Head_base(int&) 0.87 174.12 1.72 630508256 0.00 0.00 std::_Tuple_impl<2ul, int&>& std::_Tuple_impl<2ul, int&>::operator=<unsigned int>(std::_Tuple_impl<2ul, unsigned int> const&) 0.86 175.81 1.69 630508256 0.00 0.00 std::_Head_base<1ul, int&, false>::_Head_base(int&) 0.84 177.47 1.66 630508256 0.00 0.00 std::_Tuple_impl<1ul, int&, int&>& std::_Tuple_impl<1ul, int&, int&>::operator=<unsigned int, unsigned int>(std::_Tuple_impl<1ul, unsigned int, unsigned int> const&) 0.78 179.02 1.55 630508256 0.00 0.00 std::_Head_base<2ul, int&, false>::_Head_base(int&) 0.77 180.54 1.52 630508256 0.00 0.00 std::tuple<int&, int&, int&> std::tie<int, int, int>(int&, int&, int&) 0.74 182.00 1.46 630508256 0.00 0.00 std::_Tuple_impl<2ul, int&>::_Tuple_impl(int&) 0.65 183.28 1.28 630508256 0.00 0.00 std::tuple<int&, int&, int&>& std::tuple<int&, int&, int&>::operator=<unsigned int, unsigned int, unsigned int, void>(std::tuple<unsigned int, unsigned int, unsigned int> const&) 0.57 184.41 1.13 630508256 0.00 0.00 std::tuple<int&, int&, int&>::tuple(int&, int&, int&) 0.55 185.50 1.09 630508256 0.00 0.00 std::_Head_base<0ul, int&, false>::_M_head(std::_Head_base<0ul, int&, false>&) 0.52 186.53 1.03 630508256 0.00 0.00 std::_Tuple_impl<0ul, int&, int&, int&>::_Tuple_impl(int&, int&, int&)
Sharpen
Command: --unsharp
Each sample counts as 0.01 seconds.
% cumulative self self total time seconds seconds calls ms/call ms/call name 44.44 0.96 0.96 _mcount_private 27.31 1.55 0.59 __fentry__ 7.41 1.71 0.16 1 160.00 458.44 unsharp(Matrix<std::tuple<unsigned int, unsigned int, unsigned int> >) 5.56 1.83 0.12 20345464 0.00 0.00 Matrix<std::tuple<unsigned int, unsigned int, unsigned int> >::operator()(unsigned int, unsigned int) 1.39 1.86 0.03 21001560 0.00 0.00 std::__shared_ptr<std::tuple<unsigned int, unsigned int, unsigned int>, (__gnu_cxx::_Lock_policy)2>::get() const 1.39 1.89 0.03 7876396 0.00 0.00 std::_Tuple_impl<0ul, unsigned int, unsigned int, unsigned int>::_M_head(std::_Tuple_impl<0ul, unsigned int, unsigned int, unsigned int>&) 1.39 1.92 0.03 656096 0.00 0.00 std::_Tuple_impl<2ul, unsigned int>& std::_Tuple_impl<2ul, unsigned int>::operator=<unsigned char>(std::_Tuple_impl<2ul, unsigned char>&&) 0.93 1.94 0.02 7876396 0.00 0.00 std::_Tuple_impl<1ul, unsigned int, unsigned int>::_M_head(std::_Tuple_impl<1ul, unsigned int, unsigned int>&) 0.93 1.96 0.02 1968288 0.00 0.00 unsigned char&& std::forward<unsigned char>(std::remove_reference<unsigned char>::type&) 0.93 1.98 0.02 656096 0.00 0.00 std::_Head_base<0ul, unsigned char, false>::_M_head(std::_Head_base<0ul, unsigned char, false>&)
Identity
command: --custom '0,0,0,0,1,0,0,0,0'
Each sample counts as 0.01 seconds.
% cumulative self self total time seconds seconds calls ms/call ms/call name 53.61 1.71 1.71 _mcount_private 28.21 2.61 0.90 __fentry__ 4.39 2.75 0.14 2 70.00 218.45 Use_kernel::new_im() 1.88 2.81 0.06 8542240 0.00 0.00 Matrix<std::tuple<unsigned int, unsigned int, unsigned int> >::operator()(unsigned int, unsigned int) 0.94 2.84 0.03 5904864 0.00 0.00 std::_Tuple_impl<0ul, int&, int&, int&>::_Tuple_impl(int&, int&, int&) 0.63 2.86 0.02 13126802 0.00 0.00 __gnu_cxx::__enable_if<std::__is_integer<int>::__value, double>::__type std::floor<int>(int) 0.63 2.88 0.02 9841440 0.00 0.00 double& std::forward<double&>(std::remove_reference<double&>::type&) 0.63 2.90 0.02 7223552 0.00 0.00 std::_Head_base<0ul, unsigned int, false>::_M_head(std::_Head_base<0ul, unsigned int, false> const&) 0.63 2.92 0.02 5904864 0.00 0.00 std::_Tuple_impl<1ul, int&, int&>& std::_Tuple_impl<1ul, int&, int&>::operator=<unsigned int, unsigned int>(std::_Tuple_impl<1ul, unsigned int, unsigned int> const&) 0.63 2.94 0.02 5904864 0.00 0.00 std::_Tuple_impl<2ul, int&>::_Tuple_impl(int&)
Summary
The functions that perform the filtering are Gauss_filter::smooth_ord
, unsharp
and Use_kernel::new_im()
. These functions are all O(r x c) with respect to image dimensions and thus where the biggest gains from parallelization will be found.
Bradly Hoovers Results
Introduction
The SHA-1 algorithm used in this project was implemented by Paul E. Jones[1]. It is a C++ version of the algorithm. The recursive permutation algorithm was taken from a user submission[2] on stack exchange.
[1]http://www.packetizer.com/security/sha1/
[2] http://codereview.stackexchange.com/questions/38474/brute-force-algorithm-in-c
After some tweaking to integrate the two to work in conjunction, I ran the program using the Upper and lower case alphabet for the permutation, with a length of 5, as input. The length and character set are hard coded.
Length of 5, upper and lowercase
Command: ./brutis
Each sample counts as 0.01 seconds.
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 83.38 109.96 109.96 387659012 0.00 0.00 SHA1::ProcessMessageBlock() 8.77 121.52 11.56 387659012 0.00 0.00 SHA1::PadMessage() 4.01 126.81 5.28 1930693908 0.00 0.00 SHA1::Input(unsigned char const*, unsigned int) 1.48 128.76 1.96 387659012 0.00 0.00 SHA1::operator<<(char const*) 1.19 130.33 1.57 387659012 0.00 0.00 SHA1::Result(unsigned int*) 0.73 131.29 0.96 387659012 0.00 0.00 SHA1::Reset() 0.36 131.76 0.47 5 0.09 26.35 SHA1::check(char*, int, int, int, char const*) 0.05 131.83 0.07 SHA1::~SHA1() 0.03 131.88 0.05 SHA1::SHA1() 0.03 131.92 0.05 SHA1::operator<<(unsigned char) 0.02 131.94 0.02 SHA1::Input(char) 0.00 131.94 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN4SHA1C2Ev 0.00 131.94 0.00 1 0.00 0.00 _GLOBAL__sub_I_main
Summary
The total runtime for this test was approximately 132 seconds. Using Amdahl's law to calculate the speed up I would obtain on my laptop, the equation is:
S384 = 1 / ((1 - 0.8338) + (0.8338/384)) = 5.9393
The maximum expected speed is 5.9393. My laptop’s 650M GPU only has 384 cores. This is not a significant increase in speed.
Using my desktop’s GTX780 has 2304 core. Using my desktop’s gpu the resulting speed up would be:
S2304 = 1 / ((1 - 0.8338) + (0.8338/2304)) = 6.004
After observing these results, and further analysis of the algorithm, I have found that the SHA-1 algorithm is a sequential algorithm not entirely suitable for parallelisation.
Due to this, I choose Ben's image processing for parallelisation.
Balint Czunyi's Results
Introduction
My Project used a Heat Equation calculator.
Source: https://github.com/MyCodes/Heat-Equation
After some changes to the Makefile to work with the profiler and testing several different parameters for the calculations I have come up with the following results:
Explicit from -15 to +15
Each sample counts as 0.01 seconds.
% cumulative self self total time seconds seconds calls ms/call ms/call name 34.78 0.08 0.08 __fentry__ 26.09 0.14 0.06 2 30.00 31.67 Heat::writePlot() 26.09 0.20 0.06 _mcount_private 8.70 0.22 0.02 8986009 0.00 0.00 std::vector<double, std::allocator<double> >::operator[](unsigned long) 4.35 0.23 0.01 1 10.00 26.66 Heat::solveEquation(char) 0.00 0.23 0.00 8986009 0.00 0.00 std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > >::operator[](unsigned long) 0.00 0.23 0.00 1496502 0.00 0.00 heatFunction(double, double)
. . .
Summary
As you can see the std::vector<double, std::allocator<double> >::operator[](unsigned long) function is where 8.7 % of the time is spent and thus this is where the program would most benefit from parallelisation.
S1920 = (1 / ( 1 - 0.087 + 0.087 / 1920 ) = 0.913
As this program wont get much faster even with my high number of cores I also choose Ben's Image Processor for parallelisation.
Assignment 2
We chose to parallelize the image processing program. Image processing is easy to parallelize but this project was a challenge due to the number of functions the program has.
Benchmarking
Intel Core i7 2600K (standard clock) NVidia GeForce GTX 750 Ti
Two images were used to test each image function, a large one and a small one.
The small sample test image:
The sample after being processed by the canny filter:
Results
This chart compares the run times for the original and parallelized image processing functions. We saw dramatic improvements to image filtering performance. All functions are down to constant time with respect to image dimensions, down from O(n^2).
Problems Encountered
Many of the operations were composed of several different kernels and other operations. To avoid repeated copies to and from the device we wrapped each kernel in a function that took device pointers. That way images could be loaded once and passed through multiple filters without returning them to the host.
Several of the operations scan a pixels neighbors to determine the pixels value. This creates a problem when a pixel is near the edge of an image. To solve this problem we interpenetrated the image not as a flat surface but as a torus. Anytime a thread would access an off image pixel it would wrap around and use a pixel from the opposite side of the image.