Open main menu

CDOT Wiki β

Changes

GPU621/Intel Parallel Studio VTune Amplifier

5,045 bytes added, 19:33, 8 December 2021
Features & Functionalities
VTune uses a low overhead (~5%) sampling and tracing collection that works to get the information needed without slowing down the application significantly. The data collector uses the OS timer to profile the application, collects samples of all active instruction addresses in intervals of 10ms, and captures a call sequence. Once everything has been collected, it will display the results of the data collection in the results tab.
=====Hardware Event-Based Sampling======
VTune will analyze not just the application running, but all processes running on your system at the moment of run-time and will provide CPU run time performance on the system as a whole. It will still create a list of functions that run in the current application while timing them, but it won't capture the call sequences as hotspots.
----
For more information on Hotspots click [https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/algorithm-group/basic-hotspots-analysis.html here]
====Anomaly Detection Analysis====
 
This feature will analyze your code and search for different anomalies caught during run-time. The different type of anomalies is:
=====Context Switch Anomaly=====
This helps to pinpoint issues with threads idling too long due to synchronization issues.
=====Kernal-Induced Anomaly=====
This provides insight on any issues with the connection between the internal kernel and the software.
=====Frequency Drops=====
These can be rather concerning since CPU frequency drops are likely due to an issue with your hardware like inefficient cooling or other CPU related issues.
=====Control Flow Deviation Anomaly=====
When the Instructions Retired metric is exceptionally large for some threads which could be the cause of a code deviation during execution.
 
----
For more information on Anomaly Detection click [https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/algorithm-group/anomaly-detection-analysis.html here]
 
===Microarchitecture===
====Microarchitecture Exploration====
 
This feature does an analysis to give you information on how well your code is going through the core pipeline and it does some calculations to determine ratios used for identifying the issues that you may have at a hardware-level.
 
[[File:microarchitecture.png]]
 
====Memory Access====
 
This feature is used to locate memory related issues, like proper memory allocations/de-allocations, high bandwidth issues, and NUMA(Non-Uniform Memory Access) problems
 
Memory Access analysis type uses hardware event-based sampling to collect data for the following metrics:
 
*Loads and Stores metrics that show the total number of loads and stores
 
*LLC Miss Count metric that shows the total number of last-level cache misses
 
**Local DRAM Access Count metric that shows the total number of LLC misses serviced by the local memory
**Remote DRAM Access Count metric that shows the number of accesses to the remote socket memory
**Remote Cache Access Count metric that shows the number of accesses to the remote socket cache
*Memory Bound metric that shows a fraction of cycles spent waiting due to demand load or store instructions
**L1 Bound metric that shows how often the machine was stalled without missing the L1 data cache
**L2 Bound metric that shows how often the machine was stalled on L2 cache
**L3 Bound metric that shows how often the CPU was stalled on L3 cache, or contended with a sibling core
**L3 Latency metric that shows a fraction of cycles with demand load accesses that hit the L3 cache under unloaded scenarios (possibly L3 latency limited)
**NUMA: % of Remote Accesses metric shows percentage of memory requests to remote DRAM. The lower its value is, the better.
**DRAM Bound metric that shows how often the CPU was stalled on the main memory (DRAM). This metric enables you to identify
 
*DRAM Bandwidth Bound, UPI Utilization Bound issues, as well as Memory Latency issues with the following metrics:
**Remote / Local DRAM Ratio metric that is defined by the ratio of remote DRAM loads to local DRAM loads
**Local DRAM metric that shows how often the CPU was stalled on loads from the local memory
**Remote DRAM metric that shows how often the CPU was stalled on loads from the remote memory
**Remote Cache metric that shows how often the CPU was stalled on loads from the remote cache in other sockets
 
*Average Latency metric that shows an average load latency in cycles
 
[[File:memoryaccess.png]]
 
----
For more information on Memory Access click [https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/microarchitecture-analysis-group/memory-access-analysis.html here]
 
===Parallelism===
====Threading====
 
This feature will analyze your program and provide you with results explaining how well you’re utilizing your cores, how many threads you use throughout your program, shows how much load each thread takes on, and even more in-depth information like wait-time and spin and overhead time.
 
[[File:threading.png]]
 
Like Hotspots there are two modes of data collection User-Mode Sampling and Tracing and Hardware Event-Based Sampling and Context Switches.
 
=====User-Mode Sampling and Tracing=====
 
This mode recognizes synchronization objects and collect thread wait time by objects. The data can help the user understand the thread interactions and pinpointing where optimization can be performed within the code.
 
=====Hardware Event-Based Sampling and Context Switches=====
 
This mode collects thread idle wait-time and even though there aren’t any object definition, the problematic synchronization functions and be identified based on the wait time attributed with call stacks.
 
----
For more information on Threading click [https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/parallelism-analysis-group/threading-analysis.html here]
 
====HPC Performance====
===Accelerators===
====GPU Offload====
====CPU/FPGA Interaction====
===Platform Analyses===
====Platform Profiler====
====System Overview====
'''Versions of the software:'''
21
edits