GPU621/VTuners
Contents
Intel Vtune Profiler
Intel VTune Profiler, called Intel VTune Amplifier, is an application performance evaluation and analysis tool working in Microsoft Windows or Linux systems. Its features mainly run for Intel and AMD hardware and some are only for Intel-made CPUs or GPUs. There are six main analysis features: Algorithm Optimization, Microarchitecture and Memory Bottlenecks, Accelerators and XPUs, Parallelism, Platform and I/O, and Multi-Node. One can download it for free on the Intel® VTune™ Profiler website as a stand-alone version or as part of the Intel® oneAPI Base Toolkit. However, parts of advanced analysis in some features are paid services.
Group Members
Vtune Profiler Features
The Vtune Profiler has a variety of features that provide information to assist in the optimization of application performance, system performance. The profiler also assists in system configuration for HPC, Cloud, IoT, media, storage, etc.
The profiler provides compatibility for a variety of systems and platforms that include the following:
CPU, GPU, and FGPA
Any combination of the following languages: SYCL, C, C++, C+, Fortran, OpenCL, Python, Google Go, Java, .NET, Assembly
Optimized performance that avoids power or thermal throttling
Collection of coarse-grained data over extended periods with details results including mapping to source code
Algorithm Optimization
Analyzing Hot Code Paths
Flame Graphs
The Intel Vtune Profiler provides flame graphs to display a representation of stacks and stack frames in an application. All functions in an application are plotted on a graph and the associated stack depth is represented as height on the y-axis and the width of the bar represents the amount of CPU usage time. The “hottest” functions in an application are then the widest parts on the flame graph.
Analyzing Hot Spots
The Hotspot analysis feature in the Intel Vtune Profiler allows you to dig deeper into your application and identify pieces of code which are taking a long time to execute. These hot spots can be used to identify problem areas in your application and help improve performance.
User-Mode Sampling
User-Mode sampling is the default option for the Vtune Profiler and this sampling method utilizes a low overhead that allows collection of information without a significant impact on the run time of your application. Utilizing a sampling interval of 10ms, the profiler collects data using the following steps:
• Interrupts the process
• Collects samples of active instruction addresses
• Records a copy of the stack
The profiler then stores the sampled instruction pointer as well as the stacks to analyze and display back the data. The instruction pointers along with the stack data enable the profiler to put together a top-down tree which will allow a better understanding of the control flow of important code blocks.
The user-mode sampling method will only gather data relating to your application and not the wider system performance. The results will show total time usage of functions within the application. If many samples are collected during a specific process or thread, we can identify these as hotspots and potential bottlenecks in the performance of the application.
Hardware Event-Based Sampling
Event-Based sampling is based more on hardware events. It utilizes the hardware events to collect data on all the processes running on your CPU for a given moment and provides analysis for performance of the whole system. Similar to the user-mode sampling the profiler generates a list of the functions being used in your application and the time spent for each of them. By default the event-based sampling mode does not collects stacks like user-mode sampling, but you can choose to turn that option on.
Microarchitecture and Memory Bottlenecks
Main Benefits of The Microarchitecture and Memory Modules
The Intel Vtune Profiler allows you to utilize microarchitecture exploration analysis to improve the performance of your applications by pinpointing issues with hardware and is also able to identify memory-access-related problems including cache misses and high-bandwidth problems.
Top-down Microarchitecture Analysis
The Intel Vtune Profiler includes a tool to conduct Microarchitecture Exploration analysis using events collected in the top-down characterization and allows user to pinpoint hardware issues in an application. The Microarchitecture Exploration records other metrics important to performance and are displayed in the Microarchitecture Exploration viewpoint. Using the hotspot analysis from the algorithm optimization section we are able to identify areas in which our code is taking a lot of CPU time to run. This then allows us to pinpoint an area to utilize the ME analysis tool to determine the level of efficiency the code running through the core pipeline. The ME analysis instructs the Vtune Profiler to collect a list of events for analysis and determines metrics which allow easier identification of performance issues at the hardware level.
Accelerators and XPUs
Why XPUs?
Nowadays, it’s irreversible that the way of computing has become heterogeneous, thanks to the fast-growing development of applications such as machine learning, video editing, and gameplay. That means separation of machine architecture is preferred instead of using multi-purpose hardware. The typical examples are the separation of GPUs from CPUs and the application of FPGAs . The GPU among the parts becoming critical for those compute-intensive applications. It is a highly parallelized machine with several smaller processing cores that work together. While single-core serial performance on a GPU is much slower than on a CPU, applications must take advantage of the massive parallelism available in a GPU. Also, the growth of heterogeneous computing has led developers to discover that different types of workloads perform best on different GPU hardware architectures. Thus, Intel VTune Profiler enables us to evaluate overhead when offloading onto an Intel GPU and analyze it. There are three measurements in this feature: GPU offload Explore code execution on various CPU and GPU cores on your platform, estimate how your code benefits from offloading to the GPU, and identify whether your application is CPU or GPU bound.
GPU Compute/Media Hotspot (preview)
Analyze the most time-consuming GPU kernels, characterize GPU utilization based on GPU hardware metrics, identify performance issues caused by memory latency or inefficient kernel algorithms, and analyze GPU instruction frequency per certain instruction types.
CPU/FPGA interaction
Analyze CPU/FPGA interaction issues through these ways:
1. Focus on the kernels running on the FPGA.
2. Identify the most time-consuming kernels.
3. Look at the corresponding metrics on the device side (like Occupancy or Stalls).
4. Correlate with CPU and platform profiling data.
Parallelism
By evaluating compute-intense or throughput high-performance computing (HPC) applications for CPU efficiency, vectorization, and memory allocation, the parallelism feature enables users to check how efficient their threaded code is and can identify the thread issues that affect performance. The terms explained below are the most common statistics, in an advanced version, algorithm-specific analysis may be available, (see Method for OpenMP Code Analysis and Schedule Overhead in Intel® oneAPI Threading Building Blocks Applications)
Main Analysis Features | Threading, HPC Performance Characterization |
Suggested Intel Compiler Version | Intel Composer XE 2013 Update 2 or higher (for CPU utilization analysis) |
Parallelism Pattern | OpenMP, OpenMP-MPI, TBB |
Total Thread Count: This section indicates the number of threads used when running the application. The term Thread Oversubscription indicates time spent in the code with the number of simultaneously working threads more than the number of available logical cores on the system.
Wait Time with poor CPU Utilization The value is the accumulated wait time of each thread where APIs blocks or cause synchronization. Therefore, this value can be higher than the application's Elapsed Time.
Top waiting objects': the Top Waiting Object section provides a table listing object names that took most time waiting in the application. Reasons for waiting could be function calls or synchronization. The higher wait time the more reductions of parallelism.
Spin and Overhead Time
Spin time is the Wait time occurred when the CPU is busy. This often happens when a synchronization API causes the CPU to poll while the software thread is waiting. Overhead time is CPU time spent on the overhead of known synchronization and threading libraries, such as system synchronization APIs, Intel TBB, and OpenMP. This section lists the top functions in the application with the most spin and overhead time. Bottom-Up Tab
The Bottom-up Tab
enables us to investigate the concurrency problems in the application and time-dependent the performance of each thread. In the figure below in the lower half part of the window is the timeline view. As shown in brown colour which indicates the CPU time. Not until ~12 second, the mater thread was split into 8 threads and the first five were off-loaded, while the last threes (TID: 14500, 16268, 28576) were waiting (shown in light green colour) and the last two even waited all the way end which weakened parallelism. When brown band (CPU Time) concurrently happened to multiple threads, it means high level of parallelism.
Platform and I/O
The VTune profiler can reveal to developers the utilization efficiency of Intel's Xeon processors by analyzing the input and output of the DDIO. The profiler analyses the DDIO (Data Direct I/O) technology hardware feature that is built into the processors. This functionality is always available, and always on.
Essentially, when a Network Interface Controller is being fully utilized and a new packet comes in. If any component in the chain takes longer than expected, we get packet loss. This is the main bottle neck of the traditional Direct Memory Approach (DMA).
Intel's solution to this problem is their DDIO Xeon hardware technology. It allows PCIe devices to read and write operation to and from the L3 cache. This gets the incoming data packets as close to the cores as possible. When properly utilized the device interactions can be solely served by the L3 cache.
Advantages:
• Completely remove the need for Dynamic Random Access Memory (DRAM)
• Low inbound read and write latencies that allow for high throughput
• Reduced DRAM bandwidth and power consumption
Depending on implementation, there can be the potential for non-optimal code performance. The areas that can be tuned are the Topology configuration, and L3 cache management.
Multi-Node
VTune profiler helps analyze large-scale Message Passing Interfaces (MPI) and OpenMP workloads. It can help identify issues related to scalability, highlight threading implementation issues, identify imbalances and communications issues in MPI applications. It provides in-depth analysis and recommendations to the user. This functionality extends to High Performance Computing (HPC).
The Profiler can typically (by default) takes a snapshot of the whole application. Although, there is functionality to have it focus on particular area within an application to analyze. It will provide a general program overview, while highlighting specific problematic areas. These problematic areas can then be further analyzed to improve performance.
Vtune Profiler in Practice
The following is code we utilized to test out the features of the Vtune Profiler.
The code is produced by Microsoft and is intended to demonstrate how to convert a basic loop with OpenMP using the Concurrency Runtime algorithm.
The purpose of the code itself is to compute the number of prime numbers found in an array of randomly generated numbers.
// concrt-omp-count-primes.cpp // compile with: /EHsc /openmp #include <ppl.h> #include <random> #include <array> #include <iostream> using namespace concurrency; using namespace std; // Determines whether the input value is prime. bool is_prime(int n) { if (n < 2) return false; for (int i = 2; i < n; ++i) { if ((n % i) == 0) return false; } return true; } // Uses OpenMP to compute the count of prime numbers in an array. void omp_count_primes(int* a, size_t size) { if (size == 0) return; size_t count = 0; #pragma omp parallel for for (int i = 0; i < static_cast<int>(size); ++i) { if (is_prime(a[i])) { #pragma omp atomic ++count; } } wcout << L"found " << count << L" prime numbers." << endl; } // Uses the Concurrency Runtime to compute the count of prime numbers in an array. void concrt_count_primes(int* a, size_t size) { if (size == 0) return; combinable<size_t> counts; parallel_for<size_t>(0, size, [&](size_t i) { if (is_prime(a[i])) { counts.local()++; } }); wcout << L"found " << counts.combine(plus<size_t>()) << L" prime numbers." << endl; } int wmain() { // The length of the array. const size_t size = 1000000; // Create an array and initialize it with random values. int* a = new int[size]; mt19937 gen(42); for (size_t i = 0; i < size; ++i) { a[i] = gen(); } // Count prime numbers by using OpenMP and the Concurrency Runtime. wcout << L"Using OpenMP..." << endl; omp_count_primes(a, size); wcout << L"Using the Concurrency Runtime..." << endl; concrt_count_primes(a, size); delete[] a; }
The output of the code is fairly simple and only relays back the number of prime numbers found using the OpenMP and Concurrency Runtime methods and nothing else.
Using OpenMP... found 107254 prime numbers. Using the Concurrency Runtime... found 107254 prime numbers.
The results of the Vtune Profiler on the above code produces the results below
Here we have the Hot Spots in our code and since it is a relatively simple application and we only have one main function that makes up a majority of CPU usage time. If we were to utilize the Vtune Profiler on a more complex application we would definitely see other functions and more interesting results overall.
This is our Flame Graph here and again since we have a simple application which only ran for 10 seconds there is little to see. What we can see is that we have 3 chunks of CPU usage throughout the lifetime of the application. Our first chunk appears to be the initialization of the code and functions to start the code running. The second chunk shows the Concurrency Runtime algorithm being executing the is_prime function, similarly in the final chunk we see the OMP version of the is_prime function.