Open main menu

CDOT Wiki β

Difference between revisions of "GPU621 Team 1"

(VTune Tutorial 1: Finding HotSpot)
(Replaced content with "[https://wiki.cdot.senecacollege.ca/wiki/DPS921_Team_1 Please go to this page title]")
 
(19 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
+
[https://wiki.cdot.senecacollege.ca/wiki/DPS921_Team_1 Please go to this page title]
== What is VTune Amplifier? ==
 
 
 
Application that apply different analysis to your program<br />
 
Helps programmer debugs and improve program <br />
 
Provide GUI version <br />
 
It has both a standalone and a IDE add-on <br />
 
 
 
== Where To Get Vtune ? ==
 
 
 
Price: $899 <br/>
 
Window / Linux <br/>
 
Free Software Tools <br/>
 
- Students <br/>
 
- Educators <br/>
 
- Academic researchers <br/>
 
- Open source contributors <br/>
 
 
 
 
 
[https://software.intel.com/en-us/intel-vtune-amplifier-xe/try-buy Download Vtune Here]
 
 
 
== Getting Started ==
 
 
 
 
 
 
 
== VTune Tutorial 1: Finding HotSpot ==
 
[[File:1.png|400px]]
 
 
 
This example program will be downloaded when you install VTune Amplifier. Following is the directory that contain the sample code from Intel.
 
[Program Files]\IntelSWTools\VTune Amplifier XE <version>\samples\en\C++\tachyon_vtune_amp_xe.zip
 
Open the project using Visual Studio. Then you can run the the VTune Amplifier and click new Analysis. (You need to download Vtune Amplifier to have that tab on Visual Studio)
 
<br />
 
 
 
[[File:Tim Hotspt 2.png|400px]]
 
 
 
This should be the next page you will get. You begin to choose different type of Analysis here. We are going do a Basic Hotpots Analysis.Then click start to start the Analysis.
 
<br />
 
 
 
[[File:Tim Hotspt 3.png|400px]]
 
 
 
The program should run itself after you begin.You will notice that the image is loading from the bottom to the top. After the program finish running, it will take a while for Amplifer to generate report.
 
<br />
 
 
 
[[File:Tim Hotspt 4.png|400px]]
 
 
 
After the program finish running, it will take a while for Amplifer to generate report.
 
<br />
 
 
 
[[File:Tim Hotspt 5.png|400px]]
 
 
 
The first page will shows a summary of the program.The time it took, the top hotspots, CPU usage etc. We will focus looking at the Hotspots table. we notice that the "initialize_2D_buffer" use the most the CPU time. If you look at the code on ''find_hotspots.cpp'' you will notice it is actually one function in side that cpp file
 
<br />
 
 
 
[[File:Tim Hotspt 7.png|400px]]
 
 
 
We go to the bootom up tab. it will give you a graph that shows the Hotspots table you got. You can clearly see that "initialize_2D_buffer" use the most time compare to the other function.
 
<br />
 
 
 
[[File:Tim Hotspt 8.png|400px]]
 
 
 
If we double click on this function we it will shows the sources code and shows you which line of the code actually use the most time in the specific function. Now we can tell that most of the time are spend on the while loop.
 
<br />
 
 
 
[[File:Tim Hotspt 9.png|400px]]
 
 
 
To compare a paralleled version of this code I already have a program that use CILK PLUS to parallelize that program. Below is the link to download that code. Simply replace the ''find_hotspots.cpp'' with this code, build it and run the Analysis again.
 
 
 
Link:[[File:Find hotspots.zip]]
 
 
 
[[File:Tim hotspot 15.PNG|150px]] [[File:Tim hotspot 13.PNG|400px]] [[File:Tim hotspot 14.PNG|500px]]
 
 
 
This is the code we change.
 
- First we change to header to allow Cilk Plus.
 
- Second we comment the bad slow method from "initialize_2D_buffer" and use the faster method.
 
- Third we add some Cilk Plus code to make it the program run in parallel.
 
<br />
 
 
 
 
 
[[File:Tim Hotspt 10.png|400px]]
 
 
 
When you run the Analysis this time you will see the program image load at the same time in different level instead of loading from the bottom to the top.
 
<br />
 
 
 
 
 
[[File:Tim Hotspt 11.png|400px]]
 
 
 
This time we should able to see that the Elased Time is shorter than the old time about 10second and the Top Hotspot is no longer "initialize_2D_buffer".
 
<br />
 
 
 
 
 
[[File:Tim Hotspt 12.png|400px]]
 
 
 
If we go to bottom up tab you can see that the "initialize_2D_buffer" is no longer exist and it shows the Cilk worker graph that shows the program is not run a parallel.
 
<br />
 
 
 
== VTune Tutorial 2: Locks and Wait Tutorial ==
 
 
 
'''1. Prepare for Analysis'''
 
 
 
Note: configuration step is skipped and using default application configuration.
 
 
 
Determine the baseline (total execution time which you will compare subsequent runs of the application).
 
 
 
Do this by running the application for the first time.
 
 
 
After running the application, the baseline for the first run is 6.063s.
 
 
 
[[File:1.PNG]]
 
 
 
 
 
'''2. Find lock'''
 
 
 
'''2a.''' Choose and run locks and waits analysis:
 
 
 
With visual studio, click on "new analysis".
 
 
 
Choose your analysis target (the application executable).
 
 
 
Click on the Analysis Type tab. Under algorithm analysis, click on "Locks and Waits" and click start to run the analysis.
 
 
 
Click on the Analysis Type tab. Under algorithm analysis, click on "Locks and Waits" and click start to run the analysis.
 
 
 
[[File:3.PNG]]
 
 
 
 
 
You should see the Locks and Waits viewport and the summary of the results.
 
 
 
[[File:4.PNG]]
 
 
 
 
 
'''2b.''' Interpret result data:
 
 
 
To interpret the data on the sample code performance, do the following:
 
 
 
1. Analyze the basic performance metrics.
 
2. Identify locks.
 
 
 
 
 
Analyze the basic performance metrics:
 
 
 
The Result summary section provides data on the overall application performance per the following metric:
 
 
 
[[File:5.PNG]]
 
 
 
1.) Elapsed Time is the total time the application ran, including data allocation and calculations
 
 
 
2.) Wait Time occurs when software threads are waiting due to APIs that block or cause synchronization. Wait Time is calculated per thread, so the total wait time may exceed the application Elapsed time. Expand the wait time metric to view a distribution per processor utilization level. In the sample application, most of the Wait time is characterized with an ineffective processor usage.
 
 
 
3.) Wait Count is the overall number of times the system wait API was called for the analyzed application.
 
 
 
4.) Spin Time is the time a thread is active in a synchronization construct; the current value exceeds the threshold, so it classified as a performance issue and highlighted in pink.
 
 
 
5.) CPU Time is the sum of CPU time for all threads.
 
 
 
6.) Total Thread Count is the number of threads in the application.
 
 
 
7.) Paused Time is the amount of Elapsed time during which the analysis was paused via GUI, CLI commands, or user API.
 
 
 
For the analyze_locks application, the Wait time is high, to identify the cause you need to understand how this Wait time was distributed per synchronization objects.
 
 
 
The Top Waiting Objects section provides the list of synchronization objects with the highest Wait Time and Wait Count, sorted by the Wait Time metric.
 
 
 
[[File:6.PNG]]
 
 
 
 
 
For the analyze_locks application, focus on the first three objects and explore the Bottom-up pane for more details.
 
 
 
The Thread Concurrency Histogram represents the Elapsed time and concurrency level for the specified number of running threads. Ideally, the highest bar of your chart should be within the OK or Ideal utilization range.
 
 
 
[[File:7.PNG]]
 
 
 
Note the Target Concurrency value. By default, this number is equal to the number of physical cores. Consider this number as your optimization goal.
 
 
 
For the sample code, the chart shows that analyze_locks is a multithreaded application running maximum 4 threads simultaneously on a machine with 4 cores. But it is not using available cores effectively.
 
 
 
Hover over the second bar to understand how long the application ran serially. The tooltip shows that the application ran one thread for almost 6.611 seconds, which is classified as Poor concurrency.
 
 
 
The CPU Usage Histogram represents the Elapsed time and usage level for the logical CPUs. Ideally, the highest bar of your chart should be within the Ok or Ideal utilization range.
 
 
 
[[File:8.PNG]]
 
 
 
== VTune Tutorial 3: Disk input Output Analysis ==
 
 
 
 
 
 
 
== Resources ==
 

Latest revision as of 12:29, 5 December 2016