Difference between revisions of "GPU610 Team Tsubame"
Yanhao Lei (talk | contribs) (→Pi) |
Yanhao Lei (talk | contribs) (→Pi) |
||
Line 358: | Line 358: | ||
Pi: 3.14172063999999995331791069475002586841583251953125000000000000 | Pi: 3.14172063999999995331791069475002586841583251953125000000000000 | ||
− | ''' | + | '''Summary''' |
+ | |||
Both programs calculate finite results of Pi, however the Leibniz implementation has higher accuracy and calculates more digits per iteration. Since both algorithms have O(n) runtime, the Leibniz implementation is also superior in terms of speed. | Both programs calculate finite results of Pi, however the Leibniz implementation has higher accuracy and calculates more digits per iteration. Since both algorithms have O(n) runtime, the Leibniz implementation is also superior in terms of speed. | ||
Revision as of 00:05, 10 February 2017
Contents
TBD...
Team Members
- Mark Anthony Villaflor (Leader)
- Huachen Li
- Yanhao Lei
eMail All
Progress
PHASE 1
Pi
This is a comparison between two programs that calculate Pi.
- Test System Specifications
OS: Windows 7 (64-bit) CPU: Intel Core i3-2350M @ 2.30GHz GPU: GeForce GT 520MX (48 CUDA cores) IDE: Microsoft Visual Studio Enterprise 2015; Version 14.0.25431.01 Update 3
- How To Execute On Linux?
1. Here is the Makefile:
# Change this to "monte-carlo" if needed VER = leibniz # Uncomment and modify the following lines to specify a specific version of GCC #GCC_VERSION = 5.2.0 #PREFIX = /usr/local/gcc/${GCC_VERSION}/bin/ CC = ${PREFIX}gcc CPP = ${PREFIX}g++ $(VER): $(VER).o ; \ $(CPP) -g -pg -o$(VER) $(VER).o $(VER).o: $(VER).cpp ; \ $(CPP) -c -O2 -g -pg -std=c++14 $(VER).cpp # Remember to ">make clean" after ">make" to cleanup if the cleaning does not happen automatically clean: ; \ rm *\.o
2. Copy leibniz.cpp and/or monte-carlo.cpp and put them into the same directory as the Makefile.
3. Execute the following command (in the same directory as the Makefile):
> make
4. Execute the binary with:
> ./leibniz [ n ]
OR
> ./monte-carlo [ n ]
Where n is the number of iterations.
Leibniz formula implementation:
leibniz.cpp
00 #include <iostream> 01 #include <iomanip> 02 #include <string> 03 04 #include <chrono> 05 06 // Function duplicated from Workshop 2 - BLAS 07 void reportTime(const char* msg, std::chrono::steady_clock::duration span) { 08 auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(span); 09 std::cout << msg << " - took - " 10 << ms.count() << " millisecs" << std::endl; 11 } 12 13 // Arctangent of 1; execute for ~n iterations to refine the result 14 long double arctan1( unsigned long long int it ) { 15 long double r = 0.0; // 1 op. (=) runs 1 time 16 17 // v0.25; due to performing the operations in a different order, there are rounding issues... 18 for ( long double d = 0.0; d < it; d++ ) { // 1 op. (=) runs 1 time; 2 op.s (<, ++) run n times 19 long double de = d * 4.0 + 1.0; // 3 op.s (=, *, +) run n times 20 r += (1.0 / de) - (1.0 / (de + 2.0)); // 6 op.s (+ & =, /, -, /, +) run n times 21 } 22 23 return r; // 1 op. (return) runs 1 time 24 } 25 26 int main( int argc, char* argv[] ) { 27 unsigned long long int n = std::stoull(argv[1], 0); 28 29 std::chrono::steady_clock::time_point ts, te; 30 ts = std::chrono::steady_clock::now(); 31 long double pi = 4.0 * arctan1( n ); 32 te = std::chrono::steady_clock::now(); 33 reportTime("Arctangent(1) ", te - ts); 34 35 // Maximum length of a long double is 64 digits; minus "3." gives 62 digits. 36 std::cout.precision(62); 37 std::cout << "Pi: " << std::fixed << pi << std::endl; 40 }
- STAGE 1 - Big-O:
There is only one for loop in this program (on line 18); it executes d times, where d is the first argument provided to the program on the command line. Summing up the operations in the (predicted) hotspot, T(n) = 11n + 3; therefore O(n) runtime.
- STAGE 2 - Potential Speedup
Using Amdahl's Law:
P = 1708ms / 1725ms (using the third test data from below...) P = 0.99014 n = 48 (processors reported by deviceQuery.exe) Sn = 1 / ( 1 - P + P / n ) S48 = 1 / ( 1 - 0.99014 + 0.99014 / 48 ) S48 = 32.79988
The maximum speedup on the test system is approximately 33 times.
- STAGE 3 - Test Runs
The following tests were done on the Matrix server:
4 digits are correct at 10K iterations:
> time ./leibniz 10000 Arctangent(1) - took - 0 millisecs Pi: 3.14154265358982449354505184224706226814305409789085388183593750 real 0m0.016s user 0m0.004s sys 0m0.008s
7 digits are correct at 10M iterations:
> time ./leibniz 10000000 Arctangent(1) - took - 171 millisecs Pi: 3.14159260358979321929411010483335076060029678046703338623046875 real 0m0.187s user 0m0.172s sys 0m0.012s
No difference at 100M iterations:
> time ./leibniz 100000000 Arctangent(1) - took - 1708 millisecs Pi: 3.14159264858979533105269588144636827564681880176067352294921875 real 0m1.725s user 0m1.704s sys 0m0.008s
The following tests were done on the test system; these results will be used for the comparison between the serial and parallel versions.
n = 10,000
Arctangent(1) - took - 0 millisecs Pi: 3.14154265358982032196877298702020198106765747070312500000000000
n = 10,000,000
Arctangent(1) - took - 199 millisecs Pi: 3.14159260358809833135751432564575225114822387695312500000000000
n = 100,000,000
Arctangent(1) - took - 2087 millisecs Pi: 3.14159264457621567601108836242929100990295410156250000000000000
Monte-Carlo algorithm implementation:
monte-carlo.cpp
00 #include <iostream> 01 #include <random> 02 #include <string> 03 04 #include <chrono> 05 06 // Duplicated from https://scs.senecac.on.ca/~gpu610/pages/workshops/w2.html 07 void reportTime(const char* msg, std::chrono::steady_clock::duration span) { 08 auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(span); 09 std::cout << msg << " - took - " << 10 ms.count() << " millisecs" << std::endl; 11 } 12 13 int main(int argc, char* argv[]) { 14 std::chrono::steady_clock::time_point ts, te; 15 16 ts = std::chrono::steady_clock::now(); 17 unsigned long long int n = std::stoull(argv[1], 0), 18 totalCircle = 0; 19 20 int stride = 1000, 21 circleSize = n / stride; 22 23 unsigned int* circle = new unsigned int[circleSize]; 24 25 for (int i = 0; i < circleSize; i++) 26 circle[i] = 0; 27 28 std::random_device rd; 29 std::mt19937 mt(rd()); 30 std::uniform_real_distribution<long double> dist(0.0, 1.0); 31 te = std::chrono::steady_clock::now(); 32 reportTime("Init. ", te - ts); 33 34 ts = std::chrono::steady_clock::now(); 35 for (unsigned long long int i = 0; i < circleSize; i++) { 36 for (int j = 0; j < stride; j++) { 37 long double x = dist(mt), 38 y = dist(mt); 39 // if position is inside the circle... 40 if (x * x + y * y < 1.0) { 41 circle[i]++; 42 } 43 } 44 } 45 46 for (int i = 0; i < circleSize; i++) 47 totalCircle += circle[i]; 48 49 long double pi = 4.0 * ((long double) totalCircle) / ((long double) n); 50 te = std::chrono::steady_clock::now(); 51 reportTime("Drop points ", te - ts); 52 53 std::cout.precision(62); 54 std::cout << "Pi: " << std::fixed << pi << std::endl; 55 56 delete [] circle; 57 }
- STAGE 1 - Big-O:
The (predicted) hotspot begins from line 35 and ends at line 44. Although there are two for loops, the outer for loop executes n / stride times while the inner for loop executes stride times; the actual iteration is just n ( O(n) runtime ).
- STAGE 2 - Potential Speedup:
Using Amdahl's Law:
P = 10883ms / 10903ms (using the last sample of the third test data from below...) P = 0.99817 n = 48 (processors reported by deviceQuery.exe) Sn = 1 / ( 1 - P + P / n ) S48 = 1 / ( 1 - 0.99817 + 0.99817 / 48 ) S48 = 44.19015
The maximum speedup on the test system is approximately 44 times.
- STAGE 3 - Test Runs
The following tests were done on the Matrix server:
At around 10K iterations, the first decimal is stable.
> time ./monte-carlo 10000 Init. - took - 0 millisecs Drop points - took - 1 millisecs Pi: 3.11679999999999999995940747066214271399076096713542938232421875 real 0m0.019s user 0m0.004s sys 0m0.008s > time ./monte-carlo 10000 Init. - took - 0 millisecs Drop points - took - 1 millisecs Pi: 3.16480000000000000009124645483638005316606722772121429443359375 real 0m0.018s user 0m0.008s sys 0m0.004s > time ./monte-carlo 10000 Init. - took - 0 millisecs Drop points - took - 1 millisecs Pi: 3.16639999999999999995108079797745403993758372962474822998046875 real 0m0.018s user 0m0.004s sys 0m0.008s
The next digit is stable at around 10M iterations
> time ./monte-carlo 10000000 Init. - took - 0 millisecs Drop points - took - 1096 millisecs Pi: 3.14150879999999999990685506379151092914980836212635040283203125 real 0m1.114s user 0m1.092s sys 0m0.008s > time ./monte-carlo 10000000 Init. - took - 0 millisecs Drop points - took - 1097 millisecs Pi: 3.14219679999999999993332000514101309818215668201446533203125000 real 0m1.114s user 0m1.092s sys 0m0.016s > time ./monte-carlo 10000000 Init. - took - 0 millisecs Drop points - took - 1097 millisecs Pi: 3.14158840000000000010696443730751070688711479306221008300781250 real 0m1.115s user 0m1.088s sys 0m0.012s
By 100M, the third digit appears to be stable.
> time ./monte-carlo 100000000 Init. - took - 1 millisecs Drop points - took - 10910 millisecs Pi: 3.14138611999999999989559296142971334120375104248523712158203125 real 0m10.930s user 0m10.881s sys 0m0.012s > time ./monte-carlo 100000000 Init. - took - 1 millisecs Drop points - took - 10847 millisecs Pi: 3.14185203999999999998835042980260823242133483290672302246093750 real 0m10.868s user 0m10.833s sys 0m0.016s > time ./monte-carlo 100000000 Init. - took - 1 millisecs Drop points - took - 10883 millisecs Pi: 3.14160056000000000009896028441147564080893062055110931396484375 real 0m10.903s user 0m10.865s sys 0m0.016s
The following tests were done on the test system; these results will be used for the comparison between the serial and parallel versions.
n = 10,000
Init. - took - 2 millisecs Drop points - took - 1 millisecs Pi: 3.12440000000000006608047442568931728601455688476562500000000000 Init. - took - 1 millisecs Drop points - took - 2 millisecs Pi: 3.14880000000000004334310688136611133813858032226562500000000000 Init. - took - 2 millisecs Drop points - took - 2 millisecs Pi: 3.12760000000000015774048733874224126338958740234375000000000000
n = 10,000,000
Init. - took - 2 millisecs Drop points - took - 2046 millisecs Pi: 3.14098960000000015924115359666757285594940185546875000000000000 Init. - took - 2 millisecs Drop points - took - 2059 millisecs Pi: 3.14231840000000017809611563279759138822555541992187500000000000 Init. - took - 2 millisecs Drop points - took - 2052 millisecs Pi: 3.14155439999999996913970790046732872724533081054687500000000000
n = 100,000,000
Init. - took - 2 millisecs Drop points - took - 21263 millisecs Pi: 3.14143352000000009027758096635807305574417114257812500000000000 Init. - took - 2 millisecs Drop points - took - 21101 millisecs Pi: 3.14146268000000006281879905145615339279174804687500000000000000 Init. - took - 3 millisecs Drop points - took - 20884 millisecs Pi: 3.14172063999999995331791069475002586841583251953125000000000000
Summary
Both programs calculate finite results of Pi, however the Leibniz implementation has higher accuracy and calculates more digits per iteration. Since both algorithms have O(n) runtime, the Leibniz implementation is also superior in terms of speed.
With regards to the hotspots, both programs contain one area that holds over 99% of their total execution time; these areas will be the focus for parallelization. The programs were revised to remove all data dependencies, thus there should be no conflicts between threads when the parallel codes are applied.
Maze
Testing Environment:
- Operating system: Linux
- Compiler: g++ 5.2.0
- Processor: Dual-Core AMD Opteron(tm) Processor 8220
- System memory: 3098088 kB
How to setup:
1. Download source file: https://github.com/corzani/maze
2. Create “Makefile” file:
# Makefile for GPU610/assigment1/maze # GCC_VERSION = 5.2.0 PREFIX = /usr/local/gcc/${GCC_VERSION}/bin/ CC = ${PREFIX}gcc CPP = ${PREFIX}g++ OBJS = Maze.o MazeDebug.o MazePng.o main.o SOURCE = Maze.cpp MazeDebug.cpp MazePng.cpp main.cpp LIBS = -lpng TARGET = maze $(TARGET): $(OBJS) $(CPP) -pg -o$(TARGET) $(OBJS) $(LIBS) $(OBJS): $(SOURCE) $(CPP) -c -O2 -g -pg -std=c++14 $(SOURCE) #all: $(TARGET) clean: rm *.o
3. Complile and run:
> make > maze <maze x axes> <maze y axes>
Example:
> maze 5000 5000
Analysis:
The program function named toPng() takes up an average of 42% (min: 35%; max: 50%) of the execution time. The reason for toPng() to take so much runtime is because of the for loop that building the walls of the maze.
for (int i = 0; i < height; ++i) { row_pointers[i] = new png_byte[width * 3]; for (int j = 0; j < width * 3; j += 3) { row_pointers[i][j] = WALL; row_pointers[i][j + 1] = WALL; row_pointers[i][j + 2] = WALL; } }
The function also has a runtime of O(n^2).
For n = 1000
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 50.00 0.05 0.05 MazePng::toPng(unsigned int) 30.00 0.08 0.03 1999978 0.00 0.00 AbstractMaze::getNext(unsigned int, short*, unsigned int*) 10.00 0.09 0.01 1 10.00 10.00 MazePng::createImage(unsigned char**, unsigned int) 10.00 0.10 0.01 AbstractMaze::generate(int) 0.00 0.10 0.00 999999 0.00 0.00 AbstractMaze::knockWall(unsigned int, unsigned int, AbstractMaze::directions) 0.00 0.10 0.00 7798 0.00 0.00 _ZNSt5dequeIjSaIjEE16_M_push_back_auxIJjEEEvDpOT_ 0.00 0.10 0.00 2 0.00 0.00 std::_Deque_base<unsigned int, std::allocator<unsigned int> >::_M_initialize_map(unsigned int) 0.00 0.10 0.00 2 0.00 0.00 std::_Deque_base<unsigned int, std::allocator<unsigned int> >::~_Deque_base() 0.00 0.10 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN12AbstractMaze3mapE 0.00 0.10 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN7MazePngC2Ejj 0.00 0.10 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN9MazeDebugC2Ejj 0.00 0.10 0.00 1 0.00 0.00 _GLOBAL__sub_I_main 0.00 0.10 0.00 1 0.00 0.00 AbstractMaze::AbstractMaze(unsigned int, unsigned int)
For n = 2000
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 36.21 0.21 0.21 7999992 0.00 0.00 AbstractMaze::getNext(unsigned int, short*, unsigned int*) 24.14 0.35 0.14 MazePng::toPng(unsigned int) 12.07 0.42 0.07 3999999 0.00 0.00 AbstractMaze::knockWall(unsigned int, unsigned int, AbstractMaze::directions) 12.07 0.49 0.07 1 70.00 70.00 MazePng::createImage(unsigned char**, unsigned int) 10.34 0.55 0.06 AbstractMaze::generate(int) 5.17 0.58 0.03 1 30.00 30.00 AbstractMaze::AbstractMaze(unsigned int, unsigned int) 0.00 0.58 0.00 31113 0.00 0.00 _ZNSt5dequeIjSaIjEE16_M_push_back_auxIJjEEEvDpOT_ 0.00 0.58 0.00 2 0.00 0.00 std::_Deque_base<unsigned int, std::allocator<unsigned int> >::_M_initialize_map(unsigned int) 0.00 0.58 0.00 2 0.00 0.00 std::_Deque_base<unsigned int, std::allocator<unsigned int> >::~_Deque_base() 0.00 0.58 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN12AbstractMaze3mapE 0.00 0.58 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN7MazePngC2Ejj 0.00 0.58 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN9MazeDebugC2Ejj 0.00 0.58 0.00 1 0.00 0.00 _GLOBAL__sub_I_main
For n = 3000
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 33.08 0.44 0.44 MazePng::toPng(unsigned int) 24.06 0.76 0.32 17999970 0.00 0.00 AbstractMaze::getNext(unsigned int, short*, unsigned int*) 14.29 0.95 0.19 8999999 0.00 0.00 AbstractMaze::knockWall(unsigned int, unsigned int, AbstractMaze::directions) 12.78 1.12 0.17 1 170.00 170.00 MazePng::createImage(unsigned char**, unsigned int) 11.28 1.27 0.15 AbstractMaze::generate(int) 3.01 1.31 0.04 1 40.00 40.00 AbstractMaze::AbstractMaze(unsigned int, unsigned int) 1.50 1.33 0.02 70076 0.00 0.00 _ZNSt5dequeIjSaIjEE16_M_push_back_auxIJjEEEvDpOT_ 0.00 1.33 0.00 2 0.00 0.00 std::_Deque_base<unsigned int, std::allocator<unsigned int> >::_M_initialize_map(unsigned int) 0.00 1.33 0.00 2 0.00 0.00 std::_Deque_base<unsigned int, std::allocator<unsigned int> >::~_Deque_base() 0.00 1.33 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN12AbstractMaze3mapE 0.00 1.33 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN7MazePngC2Ejj 0.00 1.33 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN9MazeDebugC2Ejj 0.00 1.33 0.00 1 0.00 0.00 _GLOBAL__sub_I_main
For n = 4000
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 34.51 0.78 0.78 MazePng::toPng(unsigned int) 28.32 1.42 0.64 31999957 0.00 0.00 AbstractMaze::getNext(unsigned int, short*, unsigned int*) 13.72 1.73 0.31 1 310.00 310.00 MazePng::createImage(unsigned char**, unsigned int) 10.18 1.96 0.23 15999999 0.00 0.00 AbstractMaze::knockWall(unsigned int, unsigned int, AbstractMaze::directions) 10.18 2.19 0.23 AbstractMaze::generate(int) 3.10 2.26 0.07 1 70.00 70.00 AbstractMaze::AbstractMaze(unsigned int, unsigned int) 0.00 2.26 0.00 124808 0.00 0.00 _ZNSt5dequeIjSaIjEE16_M_push_back_auxIJjEEEvDpOT_ 0.00 2.26 0.00 2 0.00 0.00 std::_Deque_base<unsigned int, std::allocator<unsigned int> >::_M_initialize_map(unsigned int) 0.00 2.26 0.00 2 0.00 0.00 std::_Deque_base<unsigned int, std::allocator<unsigned int> >::~_Deque_base() 0.00 2.26 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN12AbstractMaze3mapE 0.00 2.26 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN7MazePngC2Ejj 0.00 2.26 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN9MazeDebugC2Ejj 0.00 2.26 0.00 1 0.00 0.00 _GLOBAL__sub_I_main
For n = 5000
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 35.16 1.35 1.35 MazePng::toPng(unsigned int) 27.08 2.39 1.04 49999831 0.00 0.00 AbstractMaze::getNext(unsigned int, short*, unsigned int*) 12.50 2.87 0.48 1 480.00 480.00 MazePng::createImage(unsigned char**, unsigned int) 11.72 3.32 0.45 AbstractMaze::generate(int) 10.68 3.73 0.41 24999999 0.00 0.00 AbstractMaze::knockWall(unsigned int, unsigned int, AbstractMaze::directions) 2.86 3.84 0.11 1 110.00 110.00 AbstractMaze::AbstractMaze(unsigned int, unsigned int) 0.00 3.84 0.00 195749 0.00 0.00 _ZNSt5dequeIjSaIjEE16_M_push_back_auxIJjEEEvDpOT_ 0.00 3.84 0.00 2 0.00 0.00 std::_Deque_base<unsigned int, std::allocator<unsigned int> >::_M_initialize_map(unsigned int) 0.00 3.84 0.00 2 0.00 0.00 std::_Deque_base<unsigned int, std::allocator<unsigned int> >::~_Deque_base() 0.00 3.84 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN12AbstractMaze3mapE 0.00 3.84 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN7MazePngC2Ejj 0.00 3.84 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN9MazeDebugC2Ejj 0.00 3.84 0.00 1 0.00 0.00 _GLOBAL__sub_I_main
Summary:
Function toPng() has an average of 42% execution time. It also has O(n^2) runtime.