GPU621/Intel oneMKL - Math Kernel Library

From CDOT Wiki
Revision as of 16:07, 1 December 2021 by Menglinwu (talk | contribs) (Main Fountain)
Jump to: navigation, search

Intel® oneAPI Math Kernel Library

Group Members

  1. Menglin Wu
  2. Syed Muhammad Saad Bukhari
  3. Lin Xu

Introduction

Intel Math Kernel Library, or now known as oneMKL (as part of Intel’s oneAPI), is a library of highly optimized and extensively parallelized routines, that was built to provide maximum performance across a variety of CPUs and accelerators.

There are many functions included in domains such as sparse and dense linear algebra, sparse solvers, fast Fourier transforms, random number generation, basic statistics etc., and there are many routines supported by the DPC++ Interface on CPU and GPU.

Progress Report

progress 100%

Main Fountain

Solve large-scale calculation problems, provide BLAS, LAPACK linear algebra programs, fast Fourier transform, vector mathematical functions, random number generation functions, and other functions.<br\>

1) BLAS and LAPACK

Deploying highly optimized basic linear algebra routines BLAS (Basic Linear Algebra Subroutines) and linear algebra package LAPACK (Linear Algebra Package) routines in Intel processors provides significant performance improvements.


2) ScaLAPACK

ScaLAPACK is a parallel computing software package suitable for MIMD parallel machines with distributed storage. ScaLAPACK provides several linear algebra solving functions, which are highly efficient, portable, scalable, and highly reliable. Using its solving library, parallel applications based on linear algebra operations can be developed.

The Intel® MKL implementation of ScaLAPACK can provide significant performance improvements far beyond what a standard NETLIB implementation can achieve.


3) PARDISO sparse matrix solver

Use the PARDISO direct sparse matrix solver to solve large sparse linear equations. The solver is authorized by the University of Basel. It is an easy-to-use, thread-safe, high-performance memory-efficient software library. Intel? MKL also includes a conjugate gradient solver and FGMRES iterative sparse matrix solver.


4) Fast Fourier Transform (FFT)

Take advantage of multi-dimensional FFT subroutines (from 1 to 7 dimensions) with a new, easy-to-use C/Fortran interface. Intel? MKL supports distributed memory clusters that use the same API, allowing workloads to be easily distributed to a large number of processors, thereby achieving substantial performance improvements. In addition, Intel? MKL also provides a series of C language routines ("wrapper"), these routines can simulate FFTW 2.x and 3.0 interfaces, thereby supporting current FFTW users to integrate Intel? MKL into existing applications.


5) Vector Math Library (VML)

The Vector Math Library uses vector implementations of computationally intensive core mathematical functions (power functions, trigonometric functions, exponential functions, hyperbolic functions, logarithmic functions, etc.) to significantly increase the application speed.


6) Vector Statistics Library-Random Number Generator (VSL)

Use the Vector Statistical Library (Vector Statistical Library) random number generator to accelerate the simulation, so as to achieve a system performance improvement far higher than that of the scalar random number generator.

Setting up MKL

First, you need to download the mkl library from the intel official website through the URL: https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html
Then you need to set additional include directories and additional library directories on visual studio, don’t forget to change the configuration and platform.
Thirdly, set use intel mkl to sequential in the Intel Math Kernel Library.
Finally, modify the additional dependencies with the help of the URL https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html

MKL Testing

In this project I want to compare the running time of the serial version and the optimized version of MKL under multithreading.
The dgemm routine can perform several calculations, so here is two same soulutions to calculate. serial version

clock_t startTime = clock();
    for (i = 0; i < m; i++) {
        for (j = 0; j < n; j++) {
            sum = 0.0;
            for (k = 0; k < p; k++)
                sum += A[p * i + k] * B[n * k + j];
            C[n * i + j] = sum;
        }
    }
    clock_t endTime = clock();


MKL version
Used to set the number of threads that MKL runs, mkl_set_num_threads().

max_threads = mkl_get_max_threads();
    printf(" Finding max number %d of threads Intel(R) MKL can use for parallel runs \n\n", max_threads);

    printf(" Running Intel(R) MKL from 1 to %i threads \n\n", max_threads * 2);
    for (i = 1; i <= max_threads * 2; i++) {
        for (j = 0; j < (m * n); j++)
            C[j] = 0.0;

        mkl_set_num_threads(i);

        cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
            m, n, p, alpha, A, p, B, n, beta, C, n);

        s_initial = dsecnd();
        for (r = 0; r < LOOP_COUNT; r++) {
            cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                m, n, p, alpha, A, p, B, n, beta, C, n);
        }
        s_elapsed = (dsecnd() - s_initial) / LOOP_COUNT;

https://raw.githubusercontent.com/MenglinWu9527/m3u/main/mkl.jpeg

serial 1 2 3 4 5 6
9000 15.7 7.7 6.4 8.1 7.4 7.5

Here is my computer's number of logical processors.

wmic:root\cli>cpu get numberoflogicalprocessors
NumberOfLogicalProcessors
6 

When mkl_get_max_threads is equal to the number of physical cores, the performance is the best, not the number of threads, which is the following 3 instead of 6.

Source Code

Serial

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
/* Consider adjusting LOOP_COUNT based on the performance of your computer */
/* to make sure that total run time is at least 1 second */
#define LOOP_COUNT 220  //220 for more accurate statistics
int main()
{
    double* A, * B, * C;
    int m, n, p, i, j, k, r;
    double alpha, beta;
    double sum;
    double s_initial, s_elapsed;
    printf("\n This example demonstrates threading impact on computing real matrix product \n"
        " C=alpha*A*B+beta*C using Intel(R) MKL function dgemm, where A, B, and C are \n"
        " matrices and alpha and beta are double precision scalars \n\n");
    m = 2000, p = 200, n = 1000;
    printf(" Initializing data for matrix multiplication C=A*B for matrix \n"
        " A(%ix%i) and matrix B(%ix%i)\n\n", m, p, p, n);
    alpha = 1.0; beta = 0.0;
    printf(" Allocating memory for matrices aligned on 64-byte boundary for better \n"
        " performance \n\n");
    A = (double*)malloc(m * p * sizeof(double), 64);
    B = (double*)malloc(p * n * sizeof(double), 64);
    C = (double*)malloc(m * n * sizeof(double), 64);
    if (A == NULL || B == NULL || C == NULL) {
        printf("\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
        free(A);
        free(B);
        free(C);
        return 1;
    }
    printf(" Intializing matrix data \n\n");
    for (i = 0; i < (m * p); i++) {
        A[i] = (double)(i + 1);
    }
    for (i = 0; i < (p * n); i++) {
        B[i] = (double)(-i - 1);
    }
    for (i = 0; i < (m * n); i++) {
        C[i] = 0.0;
    }  
   clock_t startTime = clock();
    for (i = 0; i < m; i++) {
        for (j = 0; j < n; j++) {
            sum = 0.0;
            for (k = 0; k < p; k++)
                sum += A[p * i + k] * B[n * k + j];
            C[n * i + j] = sum;
        }
    }
    clock_t endTime = clock();   
    s_elapsed = (endTime - startTime) / LOOP_COUNT;
    printf(" == Matrix multiplication using triple nested loop completed == \n"
        " == at %.5f milliseconds == \n\n", (s_elapsed * 1000));
    printf(" Deallocating memory \n\n");
    free(A);
    free(B);
    free(C);
    if (s_elapsed < 0.9 / LOOP_COUNT) {
        s_elapsed = 1.0 / LOOP_COUNT / s_elapsed;
        i = (int)(s_elapsed * LOOP_COUNT) + 1;
        printf(" It is highly recommended to define LOOP_COUNT for this example on your \n"
            " computer as %i to have total execution time about 1 second for reliability \n"
            " of measurements\n\n", i);
    }
    printf(" Example completed. \n\n");
    return 0;
}

MKL version

#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"

/* Consider adjusting LOOP_COUNT based on the performance of your computer */
/* to make sure that total run time is at least 1 second */
#define LOOP_COUNT 220  // 220 for more accurate statistics

int main()
{
    double* A, * B, * C;
    int m, n, p, i, j, r, max_threads;
    double alpha, beta;
    double s_initial, s_elapsed;

    printf("\n This example demonstrates threading impact on computing real matrix product \n"
        " C=alpha*A*B+beta*C using Intel(R) MKL function dgemm, where A, B, and C are \n"
        " matrices and alpha and beta are double precision scalars \n\n");

    m = 2000, p = 200, n = 1000;
    printf(" Initializing data for matrix multiplication C=A*B for matrix \n"
        " A(%ix%i) and matrix B(%ix%i)\n\n", m, p, p, n);
    alpha = 1.0; beta = 0.0;

    printf(" Allocating memory for matrices aligned on 64-byte boundary for better \n"
        " performance \n\n");
    A = (double*)mkl_malloc(m * p * sizeof(double), 64);
    B = (double*)mkl_malloc(p * n * sizeof(double), 64);
    C = (double*)mkl_malloc(m * n * sizeof(double), 64);
    if (A == NULL || B == NULL || C == NULL) {
        printf("\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
        mkl_free(A);
        mkl_free(B);
        mkl_free(C);
        return 1;
    }

    printf(" Intializing matrix data \n\n");
    for (i = 0; i < (m * p); i++) {
        A[i] = (double)(i + 1);
    }

    for (i = 0; i < (p * n); i++) {
        B[i] = (double)(-i - 1);
    }

    for (i = 0; i < (m * n); i++) {
        C[i] = 0.0;
    }

    max_threads = mkl_get_max_threads();
    printf(" Finding max number %d of threads Intel(R) MKL can use for parallel runs \n\n", max_threads);

    printf(" Running Intel(R) MKL from 1 to %i threads \n\n", max_threads * 2);
    for (i = 1; i <= max_threads * 2; i++) {
        for (j = 0; j < (m * n); j++)
            C[j] = 0.0;

        mkl_set_num_threads(i);

        cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
            m, n, p, alpha, A, p, B, n, beta, C, n);

        s_initial = dsecnd();
        for (r = 0; r < LOOP_COUNT; r++) {
            cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                m, n, p, alpha, A, p, B, n, beta, C, n);
        }
        s_elapsed = (dsecnd() - s_initial) / LOOP_COUNT;

        printf(" == Matrix multiplication using Intel(R) MKL dgemm completed ==\n"
            " == at %.5f milliseconds using %d thread(s) ==\n\n", (s_elapsed * 1000), i);
    }

    printf(" Deallocating memory \n\n");
    mkl_free(A);
    mkl_free(B);
    mkl_free(C);

    if (s_elapsed < 0.9 / LOOP_COUNT) {
        s_elapsed = 1.0 / LOOP_COUNT / s_elapsed;
        i = (int)(s_elapsed * LOOP_COUNT) + 1;
        printf(" It is highly recommended to define LOOP_COUNT for this example on your \n"
            " computer as %i to have total execution time about 1 second for reliability \n"
            " of measurements\n\n", i);
    }

    printf(" Example completed. \n\n");
    return 0;
}

References

https://www.intel.com/content/www/us/en/developer/articles/technical/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function.html

https://www.youtube.com/watch?v=pzVaJgdN9Fw