Changes

← Older edit

GPU621/Intel oneMKL - Math Kernel Library

10,565 bytes added, 09:34, 2 December 2021

no edit summary

# Syed Muhammad Saad Bukhari

# Lin Xu

==Progress Report==

progress 100%

==Introduction==

Intel Math Kernel Library, or now known as '''oneMKL''' (as part of Intel’s oneAPI), is a library of highly optimized and extensively parallelized routines, that was built to provide maximum performance across a variety of CPUs and accelerators. It is toolkit that adds to the Intel® oneAPI Base Toolkit for building high-performance, scalable parallel code on C++, Fortran, OpenMP & MPI from enterprise to cloud, and HPC to AI applications.

There are many functions included in domains such as sparse and dense linear algebra, sparse solvers, fast Fourier transforms, random number generation, basic statistics etc., and there are many routines supported by the DPC++ Interface on CPU and GPU.

==~~Progress Report~~=Why is it important?===* Accelerate performance on Intel® Xeon® & CoreTM Processors and Accelerators.* Deliver fast, scalable, reliable parallel code with less effort; built on industry standards. ~~progress 100%~~[[File:Whatsinsidemkl.png|800px]]

==~~Main Fountain~~Important Areas Tackled by MKL==Solve large-scale calculation problems, provide BLAS, LAPACK linear algebra programs, fast Fourier transform, vector mathematical functions, random number generation functions, and other functions.<br\/>

1) BLAS and LAPACK

==MKL Testing==

[[File:Mkltestingimage.png]]

In this project I want to compare the running time of the serial version and the optimized version of MKL under multithreading.

The dgemm routine can perform several calculations, so here is two same soulutions to calculate.

'''C = alpha *A * B + beta * C

'''

Integers indicating the size of the matrices:

A: m rows by k columns

B: k rows by n columns

C: m rows by n columns

'''alpha'''

Real value used to scale the product of matrices A and B.

'''A'''

Array used to store matrix A.

'''k'''

Leading dimension of array A, or the number of elements between successive rows (for row major storage) in memory. In the case of this exercise the leading dimension is the same as the number of columns.

'''B'''

Array used to store matrix B.

'''beta'''

Real value used to scale matrix C.

'''C'''

Array used to store matrix C.

'''n'''

Leading dimension of array C, or the number of elements between successive rows (for row major storage) in memory. In the case of this exercise the leading dimension is the same as the number of columns.

serial version

<pre>

s_elapsed = (dsecnd() - s_initial) / LOOP_COUNT;

</pre>

~~https://raw.githubusercontent.com/MenglinWu9527/m3u/main/mkl.jpeg~~

==Output==

{| border="1" cellspacing="0" cellpadding="5" align="center"

! serial

| 7.5

|}

[[File:mklchart.jpeg]]

Here is my computer's number of logical processors.

<pre>

When mkl_get_max_threads is equal to the number of physical cores, the performance is the best, not the number of threads, which is the following 3 instead of 6.

Through matrix calculation (BLAS), Intel mkl can significantly improve performance and is optimized for multi-threading.

==Source Code==

</pre>

==Intel®-Optimized Math Library for Numerical Computing==

Data Parallel C++ (DPC++) APIs, with OpenMP acceleration, maximizes performance and portability across architectures in science, engineering, and finance using enhanced math routines

===Data Parallel C++===

DPC++ is an open alternative to single-architecture proprietary languages.

===Foundations===

OneAPI concepts are demonstrated in the Vector Add sample as well as using the DPC++ programming language.

* Device selectors targeting different accelerators including GPU and FPGA

* Buffers and accessors

* Queues

* Data parallel kernel “parallel_for”

====A Code Walk-Through for DPC++ Foundations====

OneAPI concepts and functionality are demonstrated in this sample walk-through through vector_add, which is written in Data Parallel C++. The program adds two arrays of integers together using hardware acceleration.

* DPC++ headers

* Asynchronous exceptions from kernels

* Device selectors for different accelerators

* Buffers and accessors

* Queues

* parallel_for kernel

====DPC++Headers====

DPC++ is based on familiar and industry-standard C++, plus it incorporates the SYCL* specification 1.2.1 from the Khronos Group* and includes language extensions developed using an open community process. The header file sycl.hpp, as specified in the SYCL specification, is also provided in the Intel® oneAPI DPC++/C++ Compiler. FPGA support is included with a DPC++ extension with the fpga_extensions.hpp header file.

The code below, from vector_add,illustrates the different headers needed when you are supporting different accelerators.

https://github.com/oneapi-src/oneAPI-samples/blob/master/DirectProgramming/DPC%2B%2B/DenseLinearAlgebra/vector-add/src/vector-add-buffers.cpp

====DPC++ Kernels Exceptions====

DPC++ kernels run asynchronously on accelerators in different stackframes. The kernel may have asynchronous errors that cannot be propagated up to the stack. In order to catch the asynchronous exceptions, the SYCL queue class incorporates error handler functions.

====Selector for Accelerators====

SYCL and oneAPI selectors can discover and provide access to the hardware available on host environment. The default_selector selects the most performant accelerator, while DPC++ provides additional selector classes for the FPGA accelerator.

====Queue and parallel_for Kernels====

A DPC++ queue encapsulates the context required by kernel execution. A queue can take a specific device selector and an asynchronous exception handler, as is used in vector_add.

Three different types of kernels: single task kernel, basic data-parallel kernel, hierarchical parallel kernel, are used in kernel execution, while the basic data-parallel, parallel_for kernel, is used in vector_add.

The kernel body is an addition of two arrays captured in the Lambda function. sum[i] = a[i] + b[i];

* The range of data the kernel can process is specified in the first parameter num_items of h.parallel_for. Example: A 1-D range with size of num_items. Two read-only data, a_array and b_array, are transferred to the accelerator by the runtime. When the kernel is completed, the sum of the data in the sum_buf buffer is copied to host when the sum_buf goes out of scope.

oneAPI programs are built on device selectors, buffers, accessors, queues and kernels. DPC++ incorporates SYCL and community extensions to simplify data parallel programming. DPC++ allows code reuse across hardware targets, and enables high productivity and performance across CPU, GPU, and FPGA architectures, while permitting accelerator-specific tuning.

===Unified Shared Memory===

The Mandelbrot Set is a program that demonstrates oneAPI concepts and functionally using the DPC++ programming language.

* Unified shared memory

* Managing and accessing memory

* Parallel implementation

====A Code Walk-Through for DPC++ Using Unified Shared Memory====

The host offers three distinct allocation types of memory, host memory, device memory and shared memory managed by compiler. Unified Shared Memory, USM, is an alternative to buffers for managing and accessing memory from the host and device. The program calculates if each point in a two-dimensional complex plane exists in the set, using parallel computing patterns and DPC++. The code walkthrough uses a Mandelbrot sample to explore USM. This walkthrough demonstrates how you can use familiar C/C++ patterns to manage data within host and device memory, using Mandlebrot as a test case.

https://github.com/oneapi-src/oneAPI-samples/blob/master/DirectProgramming/DPC%2B%2B/DenseLinearAlgebra/vector-add/src/vector-add-usm.cpp

https://github.com/oneapi-src/oneAPI-samples/blob/master/DirectProgramming/DPC%2B%2B/CombinationalLogic/mandelbrot/src/mandel.hpp

===Driver Functions: main.cpp===

The driver function, main.cpp, contains the infrastructure to execute and evaluate the computation of the Mandelbrot set.

====Queue Creation====

The queue is created in main using the default selector, which first attempts to launch a kernel code on the GPU, and then it falls back to the Host/CPU if no compatible device is found. It utilizes the dpc_common exception handler, which allows for asynchronous exception handling of your kernel code.

====ShowDevice()====

The ShowDevice() function displays information about the chosen device.

====Execute()====

The Execute() function initializes the MandelParallelUsm object, uses it to evaluate the Mandelbrot set, and outputs the results.

===Mandelbrot USM Usage===

====MandleParameter Class====

The MandelParameter struct contains all the necessary functionality to calculate the Mandelbrot set.

====Datatype: ComplexF====

The MandelParameter defines a datatype ComplexF,which represents a complex floating-point number.

typedef std::complex<float> ComplexF;

====Point()====

The Point() function takes a complex point,c, as an argument and determines whether or not it belongs to the Mandelbrot set. The function checks for how many iterations (up to an arbitrary max_iterations) that the parameter, z, remains bounded given the recursive function, zn+1 = (zn)2 + c, where z0= 0. Then it returns the number of iterations.

====ScaleRow()/ScaleCol()====

The scale functions convert row/column indices to coordinates within the complex plane. This is necessary to convert array indices to their corresponding complex coordinates. This application can be seen below in the MandelParallelUsm Class section.

====Mandle Class====

MandelParallelUsm inherits from its parent class, the Mandel class. It contains member functions for outputting the data visualization, addressed in the Other Functions section below.

'''Member Variables'''

* MandelParameters p_: A MandelParameters object

* int *data_: A pointer to the memory for storing the output data

===MandleParallelUsm Class===

This class is derived from the Mandel class, and handles all the device code for offloading the Mandelbrot calculation using USM.

====Device Initialization: Constructor====

The MandelParallelUSM constructor first calls the Mandel constructor, which assigns the values of the arguments to their corresponding member variables. It passes the address of the queue object to the member variable, q, so that it can later be used to launch the device code. Finally, it calls the Alloc() virtual member function.

====USM Initialization: Alloc()====

The Alloc() virtual member function is overridden in the MandelParallelUsm class to enable USM. It calls malloc_shared() which creates and returns the address to a block of memory. This is shared across the host and device.

====Launching the Kernel: Evaluate()====

The Evaluate() member function launches the kernel code and calculates the Mandelbrot set.

Inside parallel_for(), the work item id (index) is mapped to row and column coordinates, which are used to construct a point in the complex plane using the ScaleRow()/ScaleCol() functions. The MandelParameters Point() function is called to determine if the complex point belongs to the Mandelbrot set, with its result written to the corresponding location in shared memory.

====Freeing Shared Memory: Destructor====

The destructor frees the shared memory by calling the Free() member function, ensuring no memory leaks in the program.

===Other Functions===

====Producing a Basic Visualization of the Mandlebrot Set====

The Mandel class also contains member functions for data visualization. WriteImage() generates a PNG image representation of the data, where each pixel represents a point on the complex plane, and its luminosity represents the iteration depth calculated by Point().

====Example Image of Data Output====

The Mandel class’s Print()member function produces a similar visualization as is written to stdout.

[[File:mandelbot.png]]

==References==

https://www.intel.com/content/www/us/en/developer/articles/technical/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function.html

https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html#gs.hfplbb

https://www.youtube.com/watch?v=pzVaJgdN9Fw

==Presentation PDF File==

https://wiki.cdot.senecacollege.ca/wiki/File:Intel_mkl_syed_-_menglin_-_lin.pdf

Direct link: https://wiki.cdot.senecacollege.ca/w/imgs/Intel_mkl_syed_-_menglin_-_lin.pdf

Smsbukhari

19

edits

CDOT Wiki β

Changes

GPU621/Intel oneMKL - Math Kernel Library

CDOT Wiki ^β