Changes

Jump to: navigation, search

GPU621/MKL

12,566 bytes added, 15:22, 30 November 2022
no edit summary
In this case we will be using the online installer provided by Intel, support for offline installation and installation via packet managers is also available, for example NuGet Package Manager on Visual Studio.
'''1. [https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html Download MKL]''' [[File:1_download.png| 800px]]  '''2. Open installer''' [[File:2_run_installer.png]]  '''3. Follow installer instructions''' [[File:3_follow_instruction.png|800px]] [[File:4_follow_instruction.png|800px]] [[File:5_follow_instruction.png|800px]] [[File:6_wait_for_installation.png|800px]] [[File:7_finish_installation.png|800px]]  '''4. Access project properties in Visual Studio''' [[File:8_access_project_properties.png]]  '''5. Enable usage of MKL''' [[File:9_enable_mkl.png]]  '''6. Include MKL header file "mkl.h"''' [[File:10_include_mkl_header.png]]
2. Open installer
3. Follow installer instructions
4. Access project properties in VS
5. Enable usage of MKL
6. Include MKL header file "mkl.h"
Installing and compiling on Linux or macOS may require additional steps such as linking code.
 
 
 
== Effectiveness of MKL in Matrix Multiplication ==
 
To determine the effectiveness of Math Kernel Library functionality, a comparison using matrix multiplication will be done to see the difference between unoptimized computation and computation using MKL. The source code for both sets of calculations are presented below.
 
Matrix Multiplication without MKL Optimization (nested loops)
 
printf (" Making the first run of matrix product using triple nested loop\n"
" to get stable run time measurements \n\n");
for (i = 0; i < m; i++) {
for (j = 0; j < n; j++) {
sum = 0.0;
for (k = 0; k < p; k++)
sum += A[p*i+k] * B[n*k+j];
C[n*i+j] = sum;
}
}
printf (" Measuring performance of matrix product using triple nested loop \n\n");
s_initial = dsecnd();
for (r = 0; r < LOOP_COUNT; r++) {
for (i = 0; i < m; i++) {
for (j = 0; j < n; j++) {
sum = 0.0;
for (k = 0; k < p; k++)
sum += A[p*i+k] * B[n*k+j];
C[n*i+j] = sum;
}
}
}
 
Result of Unoptimized Matrix Multiplication (using nested loops)
 
[[File:11_no_mkl_computation.png]]
 
 
Matrix Multiplication with MKL Optimization (cblas_dgemm())
 
printf(" Making the first run of matrix product using Intel(R) MKL dgemm function \n"
" via CBLAS interface to get stable run time measurements \n\n");
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, p, alpha, A, p, B, n, beta, C, n);
printf(" Measuring performance of matrix product using Intel(R) MKL dgemm function \n"
" via CBLAS interface \n\n");
s_initial = dsecnd();
for (r = 0; r < LOOP_COUNT; r++) {
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, p, alpha, A, p, B, n, beta, C, n);
}
 
 
Result of Optimized Matrix Multiplication (using cblas_dgemm())
 
[[File:12_mkl_computation.png]]
 
 
In this case, the effects of using MKL functionality can be clearly seen in a noticeable 167x increase in computation speed between the unoptimized and optimized calculations. While the proportional increase in performance may not always be so drastic, the difference in performance by using MKL extends both towards smaller and larger scales.
 
 
== Matrix Multiplication Comparison Source Code ==
 
Full Source Code for Unoptimized Calculation
 
#define min(x,y) (((x) < (y)) ? (x) : (y))
#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"
/* Consider adjusting LOOP_COUNT based on the performance of your computer */
/* to make sure that total run time is at least 1 second */
#define LOOP_COUNT 10
int main()
{
double *A, *B, *C;
int m, n, p, i, j, k, r;
double alpha, beta;
double sum;
double s_initial, s_elapsed;
printf ("\n This example measures performance of rcomputing the real matrix product \n"
" C=alpha*A*B+beta*C using a triple nested loop, where A, B, and C are \n"
" matrices and alpha and beta are double precision scalars \n\n");
m = 2000, p = 200, n = 1000;
printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
" A(%ix%i) and matrix B(%ix%i)\n\n", m, p, p, n);
alpha = 1.0; beta = 0.0;
printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n"
" performance \n\n");
A = (double *)mkl_malloc( m*p*sizeof( double ), 64 );
B = (double *)mkl_malloc( p*n*sizeof( double ), 64 );
C = (double *)mkl_malloc( m*n*sizeof( double ), 64 );
if (A == NULL || B == NULL || C == NULL) {
printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
mkl_free(A);
mkl_free(B);
mkl_free(C);
return 1;
}
printf (" Intializing matrix data \n\n");
for (i = 0; i < (m*p); i++) {
A[i] = (double)(i+1);
}
for (i = 0; i < (p*n); i++) {
B[i] = (double)(-i-1);
}
for (i = 0; i < (m*n); i++) {
C[i] = 0.0;
}
printf (" Making the first run of matrix product using triple nested loop\n"
" to get stable run time measurements \n\n");
for (i = 0; i < m; i++) {
for (j = 0; j < n; j++) {
sum = 0.0;
for (k = 0; k < p; k++)
sum += A[p*i+k] * B[n*k+j];
C[n*i+j] = sum;
}
}
printf (" Measuring performance of matrix product using triple nested loop \n\n");
s_initial = dsecnd();
for (r = 0; r < LOOP_COUNT; r++) {
for (i = 0; i < m; i++) {
for (j = 0; j < n; j++) {
sum = 0.0;
for (k = 0; k < p; k++)
sum += A[p*i+k] * B[n*k+j];
C[n*i+j] = sum;
}
}
}
s_elapsed = (dsecnd() - s_initial) / LOOP_COUNT;
printf (" == Matrix multiplication using triple nested loop completed == \n"
" == at %.5f milliseconds == \n\n", (s_elapsed * 1000));
printf (" Deallocating memory \n\n");
mkl_free(A);
mkl_free(B);
mkl_free(C);
if (s_elapsed < 0.9/LOOP_COUNT) {
s_elapsed=1.0/LOOP_COUNT/s_elapsed;
i=(int)(s_elapsed*LOOP_COUNT)+1;
printf(" It is highly recommended to define LOOP_COUNT for this example on your \n"
" computer as %i to have total execution time about 1 second for reliability \n"
" of measurements\n\n", i);
}
printf (" Example completed. \n\n");
return 0;
}
 
 
Full Source Code for Optimized Calculations
 
#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"
/* Consider adjusting LOOP_COUNT based on the performance of your computer */
/* to make sure that total run time is at least 1 second */
#define LOOP_COUNT 10
int main()
{
double* A, * B, * C;
int m, n, p, i, r;
double alpha, beta;
double s_initial, s_elapsed;
printf("\n This example measures performance of Intel(R) MKL function dgemm \n"
" computing real matrix C=alpha*A*B+beta*C, where A, B, and C \n"
" are matrices and alpha and beta are double precision scalars\n\n");
m = 2000, p = 200, n = 1000;
printf(" Initializing data for matrix multiplication C=A*B for matrix \n"
" A(%ix%i) and matrix B(%ix%i)\n\n", m, p, p, n);
alpha = 1.0; beta = 0.0;
printf(" Allocating memory for matrices aligned on 64-byte boundary for better \n"
" performance \n\n");
A = (double*)mkl_malloc(m * p * sizeof(double), 64);
B = (double*)mkl_malloc(p * n * sizeof(double), 64);
C = (double*)mkl_malloc(m * n * sizeof(double), 64);
if (A == NULL || B == NULL || C == NULL) {
printf("\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
mkl_free(A);
mkl_free(B);
mkl_free(C);
return 1;
}
printf(" Intializing matrix data \n\n");
for (i = 0; i < (m * p); i++) {
A[i] = (double)(i + 1);
}
for (i = 0; i < (p * n); i++) {
B[i] = (double)(-i - 1);
}
for (i = 0; i < (m * n); i++) {
C[i] = 0.0;
}
printf(" Making the first run of matrix product using Intel(R) MKL dgemm function \n"
" via CBLAS interface to get stable run time measurements \n\n");
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, p, alpha, A, p, B, n, beta, C, n);
printf(" Measuring performance of matrix product using Intel(R) MKL dgemm function \n"
" via CBLAS interface \n\n");
s_initial = dsecnd();
for (r = 0; r < LOOP_COUNT; r++) {
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, p, alpha, A, p, B, n, beta, C, n);
}
s_elapsed = (dsecnd() - s_initial) / LOOP_COUNT;
printf(" == Matrix multiplication using Intel(R) MKL dgemm completed == \n"
" == at %.5f milliseconds == \n\n", (s_elapsed * 1000));
printf(" Deallocating memory \n\n");
mkl_free(A);
mkl_free(B);
mkl_free(C);
if (s_elapsed < 0.9 / LOOP_COUNT) {
s_elapsed = 1.0 / LOOP_COUNT / s_elapsed;
i = (int)(s_elapsed * LOOP_COUNT) + 1;
printf(" It is highly recommended to define LOOP_COUNT for this example on your \n"
" computer as %i to have total execution time about 1 second for reliability \n"
" of measurements\n\n", i);
}
printf(" Example completed. \n\n");
return 0;
}
 
 
== How MKL Improves Efficiency ==
 
In this instance the MKL used DGEMM to improve the calculation time. DGEMM stands for '''D'''ouble-precision, '''GE'''neral '''M'''atrix-'''M'''atrix multiplication. In the example used to demonstrate matrix multiplication, the code defines the multiplication of two matrices along with scaling factors alpha and beta. It can be noted that without MKL implementation the matrix multiplication is done though nested loops, however in the MKL optimized version cblas_dgemm() is called. The dgemm refers to DGEMM defined above and cblas refers to the CBLAS interface, which stands for '''B'''asic '''L'''inear '''A'''lgebra '''S'''ubprograms in '''C'''. One part of BLAS, level 3, is dedicated to matrix-matrix operations, which in this case includes the matrix multiplication calculations. While the math and logic behind the implementation of the cblas_dgemm() function is fairly complicated, a simplified explanation on how it works can be expressed as the decomposition of either one or both of the matrices being multiplied and taking advantage of cache memory to improve computation speed.
 
 
 
== Other Mathematical Functionality ==
 
While the example used to explore the effectiveness of MKL utilizes matrix multiplication, the library provides functionality in a variety of other mathematical categories.
 
[[File:13_mkl_math_functionality.png]]
 
Some notable functionality provided in the Math Kernel Library include the following.
 
Basic Linear Algebra Subprograms (BLAS):
 
− level 1 BLAS: vector operations
 
− level 2 BLAS: matrix-vector operations
 
− level 3 BLAS: matrix-matrix operations
 
Sparse BLAS Level 1, 2, and 3: Basic operations on sparse vectors and matrices
 
LAPACK^2: Linear equations, least square problems, eigenvalues
 
ScaLAPACK^3: Computational, driver, and auxiliary routines for solving systems of linear equations across a compute cluster
 
PBLAS: Distributed vector, matrix-vector, and matrix-matrix operations
 
General and Cluster Fast Fourier Transform functions: Computation of Discrete Fourier Transformations and FFT across complete clusters
 
Solver Routines: Non linear least squares through Trust-Region algorithms
 
Data Fitting: Spline based approximation functions, derivatives, integrals and cell search
 
 
 
== Advantages ==
 
- Wide variety of functions
 
- Highly optimized computation algorithms
 
- Wide range of compatibility
 
- Effective use of parallelization in conjunction with hardware
 
 
 
== Disadvantages ==
 
- Requires standalone installation and implementation
 
- Mathematical knowledge can be highly beneficial to efficient usage of library functionality
 
- Poor(er) compatibility and efficiency on AMD hardware
 
 
 
== Conclusion ==
 
Intel Math Kernel Library is a powerful library which can greatly enhance a program's ability to perform mathematical computations. With proper knowledge on how to effectively utilize MKL functionality, the overall efficiency of the program can be drastically improved. Usage of MKL has already yielded real world results in a variety of fields, and will continue to do so as the scope of programs, data and the issues they tackle increases.
 
 
 
== References ==
 
https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-mkl-for-dpcpp/top.html
 
https://www.intel.com/content/www/us/en/develop/documentation/onemkl-tutorial-c/top/measuring-performance-onemkl-support-functions.html
 
https://www.usenix.org/legacy/publications/library/proceedings/als00/2000papers/papers/full_papers/thomas/thomas_html/node7.html
 
https://www.sciencedirect.com/topics/computer-science/intel-math-kernel-library
24
edits

Navigation menu