Difference between revisions of "GPU621/Group 3"
(→OpenMP Implementation Summary) |
|||
(58 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | == '''Intel's | + | == '''Optimizing Image Processing using Intel's Integrated Performance Primitives, Thread Building Blocks, and OpenMP w/ Comparison''' == |
'''Introduction:''' | '''Introduction:''' | ||
− | In this project | + | |
+ | In this project we will be comparing Intel's Integrated Performance Primitives, Thread Building Blocks, and OpenMP API to optimize image processing using parallel computing and vectorization. We selected three tasks for this project: image sharpening, brightening, and adjusting the saturation of an image. The run-time of each task is recorded and able to be compared by our demo program. We will also be comparing the implementation for each toolset we utilize. | ||
+ | |||
+ | In order to be able to more easily engage with image files, we will be utilizing the OpenCV library, leaning especially on the Mat class therein. The Mat class allows us to access the image as a n-dimensional array. Furthermore with our implementation we are able to rely on our parellelization choices instead of that built into the OpenCV library. | ||
+ | |||
+ | We had originally intended to use Intel's Data Analytics Acceleration Library, but as work progressed on the project we realized that the library was not well suited to our needs. Intel's oneAPI DAL (Data Analytics Library) was our chosen library to complete this project. However, due to changes to our team and the nature of the project we wanted to pursue, we decided to use Intel IPP (Integrated Performance Primitives) instead of DAL. | ||
+ | DAL is a robust and capable library for data analytics and machine learning. It is designed with linear algebra and statistical operations in mind. | ||
+ | DAL offers parallelization capabilities but is not explicitly optimized for image processing operations. Image processing involves working with large arrays of pixel data, which requires specialized data structures and memory access patterns. DAL's focus on linear algebra and statistical operations may not be well-suited to these procedures. Other libraries specifically designed for image processing, such as OpenCV, offer better performance because they can take advantage of GPUs' parallel processing for Image processing. Also, they offer a more significant number of features than DAL. Libraries specifically designed for image processing, such as OpenCV, can take advantage of GPUs' parallel processing capabilities for image processing. | ||
+ | |||
+ | Incorporating Intel's oneAPI DAL into image processing applications is only efficient when massive datasets of Image Data need processing or very computationally intensive operations such as image compression, dimensionality reduction, or feature extraction is required. DAL's optimized algorithms provide a performance advantage over other libraries when performing heavy linear algebra and statistical functions. DAL also offers excellent flexibility for creating custom algorithms. | ||
+ | |||
+ | |||
+ | |||
+ | '''Intel oneAPI's IPP Library Overview:''' | ||
+ | |||
+ | The Intel oneAPI's IPP (Integrated Performance Primitives) library enhances signal and image processing by providing the necessary mathematical operations. IPP runs on Windows, Linux, and macOS, although it is optimized for Intel processors, taking advantage of Intel instruction sets like Streaming SIMD Extensions (SSE) to maximize runtime speeds on Intel CPUs. Because it provides an accessible API with multiple methods for tasks such as image filtering, audio processing, and even cryptography, IPP can be integrated into existing programs and projects. Intel IPP is included as part of Intel's oneAPI Base Toolkit. | ||
+ | |||
+ | '''OpenMP API Library Overview:''' | ||
+ | |||
+ | OpenMP (Open Multi-Processing) is a robust API for multi-platform shared-memory multi-processing programming in C and C++. It provides developers with compiler directives, library routines, and environment variables to use when writing parallel programs that can run on multiple processor cores. Some of the functionalities provided by OpenMP are as follows: | ||
+ | |||
+ | :-Parallel computing | ||
+ | :-Vectorization | ||
+ | :-Thread management | ||
+ | :-Memory management | ||
+ | :-Loop scheduling | ||
+ | :-etc. | ||
+ | |||
'''Data Analytics Library Overview:''' | '''Data Analytics Library Overview:''' | ||
+ | |||
Intel's Data Analytics Library offers a robust collection of tools and algorithms that can assist programmers in building high-performance applications tailored for Intel chips. These tools are designed to interact with various data sources, such as data stored in memory, hard disc, or distributed systems. These functions available in Intel's Data Analytics Library are usable by a broad range of developers because it supports various programming languages, such as C++, Python, and Java. | Intel's Data Analytics Library offers a robust collection of tools and algorithms that can assist programmers in building high-performance applications tailored for Intel chips. These tools are designed to interact with various data sources, such as data stored in memory, hard disc, or distributed systems. These functions available in Intel's Data Analytics Library are usable by a broad range of developers because it supports various programming languages, such as C++, Python, and Java. | ||
Data Analytics Library offers functionalities for: | Data Analytics Library offers functionalities for: | ||
Line 17: | Line 45: | ||
• Statistical analysis. | • Statistical analysis. | ||
• Data visualization. | • Data visualization. | ||
+ | |||
+ | == '''OpenMP Implementation Summary''' == | ||
+ | |||
+ | '''OpenMP Implementation''' | ||
+ | |||
+ | OpenMP provides extremely simple implementation, especially the process which we are using in our code. In this process we were able to simply use a ''#pragma parallel for'' declaration for the OpenMP API to parallelize the process. With this we saw at the operations being performed at a quarter of the time it took the serial version of these processes. Originally in the sharpen function we were only using the parallel for, but it was pointed out that we could avoid some false sharing issues and shave a few milliseconds off our processing time by using a reduction targeting the 'sum' variable. | ||
+ | |||
+ | ==='''Image Processing, parallelized with OpenMP'''=== | ||
+ | |||
+ | |||
+ | ===='''Class Declaration'''==== | ||
+ | |||
+ | In this class declaration for what will hold the OpenMP parallelized functionality we include a Laplacian kernel which will be applied to the sample images in order to sharpen details. How this is achieved is essentially highlighting areas on a greyscale version of the orignal image where the picture goes quickly from light to dark, and applies that highlight to the same locations on the original image. For those familiar with Laplacian filters, you may notice that ours is very much non-standard. Through testing we determined that this was the filter that created the best results across all use cases, though it should be noted that the scaling done when applying the highlight in the sharpening operation that is currently set to 0.99 in the code below would need to be reduced significantly if applied to artist illustrations. | ||
+ | |||
+ | The sharpening process is the most interesting of the processes as it provides a similar effect to an artist adding white lines around the outlines of an illustration, which is often a stylistic choice made in character illustration. | ||
+ | |||
+ | <syntaxhighlight> | ||
+ | class openMP_imgProcessor { | ||
+ | //laplacian kernel used in sharpening | ||
+ | std::vector<std::vector<double>> LapKernel_ = { | ||
+ | {0, 0, 1}, | ||
+ | {0, 1, 2}, | ||
+ | {1, 2, -7} | ||
+ | }; | ||
+ | |||
+ | public: | ||
+ | openMP_imgProcessor() { } | ||
+ | void sharpenImg(cv::Mat& image); | ||
+ | void brightenImg(cv::Mat& image, int brightnessLvl); | ||
+ | void saturateImg(cv::Mat& image, double saturationLvl); | ||
+ | }; | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | <syntaxhighlight> | ||
+ | #include "openMP_imgProc.h" | ||
+ | |||
+ | void openMP_imgProcessor::sharpenImg(cv::Mat& image) { | ||
+ | |||
+ | //supressing OpenCV messages | ||
+ | std::streambuf* coutbuf = std::cout.rdbuf(); | ||
+ | std::cout.rdbuf(nullptr); | ||
+ | |||
+ | // Convert the image to grayscale | ||
+ | cv::Mat grayscale; | ||
+ | cv::cvtColor(image, grayscale, cv::COLOR_BGR2GRAY); | ||
+ | |||
+ | double sum = 0.0; | ||
+ | #pragma omp parallel for reduction(+:sum) | ||
+ | for (int x = 1; x < image.cols - 1; x++) { | ||
+ | for (int y = 1; y < image.rows - 1; y++) { | ||
+ | double local_sum = 0.0; | ||
+ | for (int i = -1; i <= 1; i++) { | ||
+ | for (int j = -1; j <= 1; j++) { | ||
+ | local_sum += grayscale.at<uchar>(y + j, x + i) * LapKernel_[i + 1][j + 1]; | ||
+ | } | ||
+ | } | ||
+ | |||
+ | for (int c = 0; c < 3; c++) { | ||
+ | image.at<cv::Vec3b>(y, x)[c] = cv::saturate_cast<uchar>(image.at<cv::Vec3b>(y, x)[c] + local_sum * .99); | ||
+ | } | ||
+ | |||
+ | sum += local_sum; | ||
+ | } | ||
+ | } | ||
+ | |||
+ | //stop supressing | ||
+ | std::cout.rdbuf(coutbuf); | ||
+ | } | ||
+ | |||
+ | void openMP_imgProcessor::brightenImg(cv::Mat& image, int brightnessLvl) { | ||
+ | //supressing OpenCV messages | ||
+ | std::streambuf* coutbuf = std::cout.rdbuf(); | ||
+ | std::cout.rdbuf(nullptr); | ||
+ | |||
+ | int width = image.cols; | ||
+ | int height = image.rows; | ||
+ | int channels = image.channels(); | ||
+ | |||
+ | #pragma omp parallel for | ||
+ | for (int row = 0; row < height; row++) { | ||
+ | for (int col = 0; col < width; col++) { | ||
+ | for (int c = 0; c < channels; c++) { | ||
+ | uchar& pixel = image.at<cv::Vec3b>(row, col)[c]; | ||
+ | pixel = cv::saturate_cast<uchar>(pixel + brightnessLvl); | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | |||
+ | //stop supressing | ||
+ | std::cout.rdbuf(coutbuf); | ||
+ | } | ||
+ | |||
+ | void openMP_imgProcessor::saturateImg(cv::Mat& image, double saturationLvl) { | ||
+ | |||
+ | //supressing OpenCV messages | ||
+ | std::streambuf* coutbuf = std::cout.rdbuf(); | ||
+ | std::cout.rdbuf(nullptr); | ||
+ | |||
+ | //HSV stands for hue saturation value | ||
+ | cv::Mat hsv; | ||
+ | cv::cvtColor(image, hsv, cv::COLOR_BGR2HSV); | ||
+ | |||
+ | |||
+ | #pragma omp parallel for | ||
+ | for (int y = 0; y < hsv.rows; ++y) | ||
+ | { | ||
+ | for (int x = 0; x < hsv.cols; ++x) | ||
+ | { | ||
+ | // Get pixel value | ||
+ | cv::Vec3b color = hsv.at<cv::Vec3b>(cv::Point(x, y)); | ||
+ | |||
+ | // Increase saturation by saturation Lvl color[1] is for saturation | ||
+ | color[1] = cv::saturate_cast<uchar>(color[1] * saturationLvl); | ||
+ | |||
+ | // Set pixel value | ||
+ | hsv.at<cv::Vec3b>(cv::Point(x, y)) = color; | ||
+ | } | ||
+ | } | ||
+ | |||
+ | cv::cvtColor(hsv, image, cv::COLOR_HSV2BGR); | ||
+ | |||
+ | //stop supressing | ||
+ | std::cout.rdbuf(coutbuf); | ||
+ | } | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | =='''IPP Implementation Summary'''== | ||
+ | |||
+ | IPP implementation class for this project is called IppImgProc. It performs three image processing tasks (sharpening, brighten, adjustSaturation) using the Intel Integrated Performance Primitives (IPP) library and OpenCV. | ||
+ | |||
+ | By using IPP we can leverage optimized functions that performs tasks in parallel without any additional library. In the constructor the number of Threads are defined using std::thread::hardware_concurrency() and the image is loaded using OpenCV function imread(). | ||
+ | |||
+ | In the main functions sharpening(), brighten(), adjustSaturation(), specialized functions is used for each specific task such as ippiFilterLaplaceBorder_8u_C3R(), ippiAddC_8u_C3RSfs(), ippiHSVToRGB_8u_C3R() in order. | ||
+ | |||
+ | Each function that contains C3R uses 3 channels for image processing meaning it processes the colors as well. Note that function ippiFilterLaplaceBorder_8u_C3R() needs Buffer memory allocations. | ||
+ | |||
+ | This allocation can be made utilizing ippiFilterLaplaceBorderGetBufferSize(). | ||
+ | |||
+ | Image data is captured using openCV methods like | ||
+ | img_.convertTo(gray8Img_, CV_8U); | ||
+ | width_ = gray8Img_.cols; | ||
+ | height_ = gray8Img_.rows; | ||
+ | |||
+ | <syntaxhighlight> | ||
+ | void IppImgProc::brighten(int brightness, int scaleFactor) | ||
+ | { | ||
+ | // Create a matrix for the output image | ||
+ | outImg_ = cv::Mat::zeros(img_.size(), img_.type()); | ||
+ | |||
+ | // Convert the image to 8-bit format | ||
+ | img_.convertTo(gray8Img_, CV_8U); | ||
+ | |||
+ | // Get the image dimensions | ||
+ | width_ = gray8Img_.cols; | ||
+ | height_ = gray8Img_.rows; | ||
+ | |||
+ | // Create an IPP image for the input image data | ||
+ | IppiSize roi = { width_, height_ }; | ||
+ | Ipp8u* pData = gray8Img_.data; | ||
+ | int step = gray8Img_.step; | ||
+ | |||
+ | // Set up the brightening parameters | ||
+ | Ipp8u value[3] = { static_cast<Ipp8u>(brightness), static_cast<Ipp8u>(brightness), static_cast<Ipp8u>(brightness) }; | ||
+ | IppStatus status = ippiAddC_8u_C3RSfs(pData, step, value, outImg_.data, outImg_.step, roi, scaleFactor); | ||
+ | |||
+ | // Measure the time it takes to perform the brightening operation using IPP | ||
+ | status = ippiAddC_8u_C3RSfs(pData, step, value, outImg_.data, outImg_.step, roi, scaleFactor); | ||
+ | |||
+ | |||
+ | // Check for errors | ||
+ | if (status != ippStsNoErr) { | ||
+ | throw std::runtime_error("IPP error"); | ||
+ | } | ||
+ | |||
+ | } | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | == '''TBB Implementation Summary''' == | ||
+ | |||
+ | The TBB implementation was relatively simple, though not quite as simple as the OpenMP implementation. It's class declaration is essentially the same as the OpenMP image processor, and uses the same Laplacian kernel. The primary difference is that instead of being able to simply use a #pragma to parallelize the code, we use the parallel_for functionality from TBB. We use the dimensions of the image to get the range, and then placed our functionality inside the lambda to be passed into the parallel_for call. | ||
+ | |||
+ | <syntaxhighlight> | ||
+ | void tbb_imgProcessor::saturateImg(cv::Mat& image, double saturationLvl) { | ||
+ | //supressing OpenCV messages | ||
+ | |||
+ | std::streambuf* coutbuf = std::cout.rdbuf(); | ||
+ | std::cout.rdbuf(nullptr); | ||
+ | |||
+ | //HSV stands for hue saturation value | ||
+ | cv::Mat hsv; | ||
+ | cv::cvtColor(image, hsv, cv::COLOR_BGR2HSV); | ||
+ | |||
+ | |||
+ | //Set blocked range from the first entry to the last | ||
+ | tbb::parallel_for(tbb::blocked_range<int>(0, hsv.rows), [&](const tbb::blocked_range<int>& r) { | ||
+ | for (int y = r.begin(); y < r.end(); ++y) | ||
+ | { | ||
+ | for (int x = 0; x < hsv.cols; ++x) | ||
+ | { | ||
+ | // Get pixel value | ||
+ | cv::Vec3b color = hsv.at<cv::Vec3b>(cv::Point(x, y)); | ||
+ | |||
+ | // Increase saturation by saturation Lvl color[1] is for saturation | ||
+ | color[1] = cv::saturate_cast<uchar>(color[1] * saturationLvl); | ||
+ | |||
+ | // Set pixel value | ||
+ | hsv.at<cv::Vec3b>(cv::Point(x, y)) = color; | ||
+ | } | ||
+ | } | ||
+ | }); | ||
+ | |||
+ | //Convert image from HSV back to GBR | ||
+ | cv::cvtColor(hsv, image, cv::COLOR_HSV2BGR); | ||
+ | |||
+ | //stop supressing | ||
+ | std::cout.rdbuf(coutbuf); | ||
+ | } | ||
+ | |||
+ | void tbb_imgProcessor::brightenImg(cv::Mat& image, int brightnessLvl) { | ||
+ | //suppressing OpenCV messages | ||
+ | std::streambuf* coutbuf = std::cout.rdbuf(); | ||
+ | std::cout.rdbuf(nullptr); | ||
+ | |||
+ | int width = image.cols; | ||
+ | int height = image.rows; | ||
+ | int channels = image.channels(); | ||
+ | |||
+ | tbb::parallel_for(0, height, [&](int row) { | ||
+ | for (int col = 0; col < width; col++) { | ||
+ | for (int c = 0; c < channels; c++) { | ||
+ | uchar& pixel = image.at<cv::Vec3b>(row, col)[c]; | ||
+ | pixel = cv::saturate_cast<uchar>(pixel + brightnessLvl); | ||
+ | } | ||
+ | } | ||
+ | }); | ||
+ | |||
+ | //stop suppressing | ||
+ | std::cout.rdbuf(coutbuf); | ||
+ | } | ||
+ | |||
+ | void tbb_imgProcessor::sharpenImg(cv::Mat& image) { | ||
+ | |||
+ | //suppressing OpenCV messages | ||
+ | std::streambuf* coutbuf = std::cout.rdbuf(); | ||
+ | std::cout.rdbuf(nullptr); | ||
+ | |||
+ | // Convert the image to grayscale | ||
+ | cv::Mat grayscale; | ||
+ | cv::cvtColor(image, grayscale, cv::COLOR_BGR2GRAY); | ||
+ | |||
+ | tbb::parallel_for(1, image.cols - 1, [&](int x) { | ||
+ | for (int y = 1; y < image.rows - 1; y++) { | ||
+ | double sum = 0.0; | ||
+ | for (int i = -1; i <= 1; i++) { | ||
+ | for (int j = -1; j <= 1; j++) { | ||
+ | sum += grayscale.at<uchar>(y + j, x + i) * LapKernel_[i + 1][j + 1]; | ||
+ | } | ||
+ | } | ||
+ | |||
+ | for (int c = 0; c < 3; c++) { | ||
+ | image.at<cv::Vec3b>(y, x)[c] = cv::saturate_cast<uchar>(image.at<cv::Vec3b>(y, x)[c] + sum * .99); | ||
+ | } | ||
+ | } | ||
+ | }); | ||
+ | |||
+ | //stop suppressing | ||
+ | std::cout.rdbuf(coutbuf); | ||
+ | } | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | =='''Testing and Demonstration Program'''== | ||
+ | We've kept our demo program quite simple. Below you'll find a version of our Demo.cpp. If you'd like to see our full code and tinker with it yourself, you can view our git repository here: https://github.com/GPU621-DAL-OpenMP-Comparison/Project-Demo | ||
+ | |||
+ | |||
+ | <syntaxhighlight> | ||
+ | #include "Tester.h" | ||
+ | |||
+ | //argument is ../sample_images/test.jpg | ||
+ | int main(int argc, char* argv[]) { | ||
+ | |||
+ | Tester demo(argv[1]); | ||
+ | demo.display_img(0); | ||
+ | |||
+ | //run omp | ||
+ | //omp_set_num_threads(15); //Olivia- 15 was opt choice for my system | ||
+ | demo.omp_brighten(50); | ||
+ | demo.omp_sharpen(); | ||
+ | demo.omp_saturate(2.0); | ||
+ | //disable OpenMP so it can't be incidently used in the backend | ||
+ | omp_set_num_threads(1); | ||
+ | omp_set_dynamic(0); | ||
+ | |||
+ | //run ipp | ||
+ | demo.ipp_brighten(50); | ||
+ | demo.ipp_sharpen(); | ||
+ | demo.ipp_saturate(); | ||
+ | |||
+ | //run serial | ||
+ | cv::setNumThreads(0); //turn all parallelization of the backend off | ||
+ | demo.serial_brighten(50); | ||
+ | demo.serial_sharpen(); | ||
+ | demo.serial_saturate(2.0); | ||
+ | |||
+ | return 0; | ||
+ | } | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | =='''Results'''== | ||
+ | |||
+ | Testing these libraries in image manipulation displays some interesting differences in their runtime. In everything but the saturation process, the IPP implementations had the fastest run times by considerable margins, though it took around 2.5x longer to alter the image saturation, it was more than twice as fast in the brightening and took around a fifth of the time needed for the OpenMP and TBB parallelized sharpening operations. | ||
+ | |||
+ | The OpenMP and TBB solutions were similar in runtime but the TBB solutions were slightly faster. This is likely due to needing less overhead for the threading than the OpenMP processes. Both are relatively simple to implement so we believe that TBB should generally be the preference between the two tools in these image manipulation applications. | ||
+ | |||
+ | Of course, as can be seen from the charts below, each parallelized option is far faster than the serial implementation of these processes. | ||
+ | |||
+ | [[File:ResultsSheet.PNG]] | ||
+ | |||
+ | [[File:ResultsChart_woutSerial.PNG]] | ||
+ | [[File:ResultsChart_wSerial.PNG]] | ||
+ | |||
+ | [[File:Release_Run_Output.PNG]] | ||
+ | |||
+ | |||
+ | =='''Hardware Used in Testing'''== | ||
+ | |||
+ | It's important to note that your results may be quite different from ours. Multithreading performance can depend heavily on the hardware of the machine the program has been run on. | ||
+ | |||
+ | Here is a brief bit of information about the hardware utilized in our testing: | ||
+ | |||
+ | '''Processor''' AMD Ryzen 7 3800X 8-Core Processor 3.89 GHz | ||
+ | |||
+ | '''Installed RAM''' 32.0 GB | ||
+ | |||
+ | '''System type''' 64-bit operating system, x64-based processor |
Latest revision as of 12:16, 12 April 2023
Contents
Optimizing Image Processing using Intel's Integrated Performance Primitives, Thread Building Blocks, and OpenMP w/ Comparison
Introduction:
In this project we will be comparing Intel's Integrated Performance Primitives, Thread Building Blocks, and OpenMP API to optimize image processing using parallel computing and vectorization. We selected three tasks for this project: image sharpening, brightening, and adjusting the saturation of an image. The run-time of each task is recorded and able to be compared by our demo program. We will also be comparing the implementation for each toolset we utilize.
In order to be able to more easily engage with image files, we will be utilizing the OpenCV library, leaning especially on the Mat class therein. The Mat class allows us to access the image as a n-dimensional array. Furthermore with our implementation we are able to rely on our parellelization choices instead of that built into the OpenCV library.
We had originally intended to use Intel's Data Analytics Acceleration Library, but as work progressed on the project we realized that the library was not well suited to our needs. Intel's oneAPI DAL (Data Analytics Library) was our chosen library to complete this project. However, due to changes to our team and the nature of the project we wanted to pursue, we decided to use Intel IPP (Integrated Performance Primitives) instead of DAL. DAL is a robust and capable library for data analytics and machine learning. It is designed with linear algebra and statistical operations in mind. DAL offers parallelization capabilities but is not explicitly optimized for image processing operations. Image processing involves working with large arrays of pixel data, which requires specialized data structures and memory access patterns. DAL's focus on linear algebra and statistical operations may not be well-suited to these procedures. Other libraries specifically designed for image processing, such as OpenCV, offer better performance because they can take advantage of GPUs' parallel processing for Image processing. Also, they offer a more significant number of features than DAL. Libraries specifically designed for image processing, such as OpenCV, can take advantage of GPUs' parallel processing capabilities for image processing.
Incorporating Intel's oneAPI DAL into image processing applications is only efficient when massive datasets of Image Data need processing or very computationally intensive operations such as image compression, dimensionality reduction, or feature extraction is required. DAL's optimized algorithms provide a performance advantage over other libraries when performing heavy linear algebra and statistical functions. DAL also offers excellent flexibility for creating custom algorithms.
Intel oneAPI's IPP Library Overview:
The Intel oneAPI's IPP (Integrated Performance Primitives) library enhances signal and image processing by providing the necessary mathematical operations. IPP runs on Windows, Linux, and macOS, although it is optimized for Intel processors, taking advantage of Intel instruction sets like Streaming SIMD Extensions (SSE) to maximize runtime speeds on Intel CPUs. Because it provides an accessible API with multiple methods for tasks such as image filtering, audio processing, and even cryptography, IPP can be integrated into existing programs and projects. Intel IPP is included as part of Intel's oneAPI Base Toolkit.
OpenMP API Library Overview:
OpenMP (Open Multi-Processing) is a robust API for multi-platform shared-memory multi-processing programming in C and C++. It provides developers with compiler directives, library routines, and environment variables to use when writing parallel programs that can run on multiple processor cores. Some of the functionalities provided by OpenMP are as follows:
- -Parallel computing
- -Vectorization
- -Thread management
- -Memory management
- -Loop scheduling
- -etc.
Data Analytics Library Overview:
Intel's Data Analytics Library offers a robust collection of tools and algorithms that can assist programmers in building high-performance applications tailored for Intel chips. These tools are designed to interact with various data sources, such as data stored in memory, hard disc, or distributed systems. These functions available in Intel's Data Analytics Library are usable by a broad range of developers because it supports various programming languages, such as C++, Python, and Java. Data Analytics Library offers functionalities for: • Parallel computing. • Vectorization. • Machine learning. • Graph analytics. • Statistical analysis. • Data visualization.
OpenMP Implementation Summary
OpenMP Implementation
OpenMP provides extremely simple implementation, especially the process which we are using in our code. In this process we were able to simply use a #pragma parallel for declaration for the OpenMP API to parallelize the process. With this we saw at the operations being performed at a quarter of the time it took the serial version of these processes. Originally in the sharpen function we were only using the parallel for, but it was pointed out that we could avoid some false sharing issues and shave a few milliseconds off our processing time by using a reduction targeting the 'sum' variable.
Image Processing, parallelized with OpenMP
Class Declaration
In this class declaration for what will hold the OpenMP parallelized functionality we include a Laplacian kernel which will be applied to the sample images in order to sharpen details. How this is achieved is essentially highlighting areas on a greyscale version of the orignal image where the picture goes quickly from light to dark, and applies that highlight to the same locations on the original image. For those familiar with Laplacian filters, you may notice that ours is very much non-standard. Through testing we determined that this was the filter that created the best results across all use cases, though it should be noted that the scaling done when applying the highlight in the sharpening operation that is currently set to 0.99 in the code below would need to be reduced significantly if applied to artist illustrations.
The sharpening process is the most interesting of the processes as it provides a similar effect to an artist adding white lines around the outlines of an illustration, which is often a stylistic choice made in character illustration.
class openMP_imgProcessor {
//laplacian kernel used in sharpening
std::vector<std::vector<double>> LapKernel_ = {
{0, 0, 1},
{0, 1, 2},
{1, 2, -7}
};
public:
openMP_imgProcessor() { }
void sharpenImg(cv::Mat& image);
void brightenImg(cv::Mat& image, int brightnessLvl);
void saturateImg(cv::Mat& image, double saturationLvl);
};
#include "openMP_imgProc.h"
void openMP_imgProcessor::sharpenImg(cv::Mat& image) {
//supressing OpenCV messages
std::streambuf* coutbuf = std::cout.rdbuf();
std::cout.rdbuf(nullptr);
// Convert the image to grayscale
cv::Mat grayscale;
cv::cvtColor(image, grayscale, cv::COLOR_BGR2GRAY);
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int x = 1; x < image.cols - 1; x++) {
for (int y = 1; y < image.rows - 1; y++) {
double local_sum = 0.0;
for (int i = -1; i <= 1; i++) {
for (int j = -1; j <= 1; j++) {
local_sum += grayscale.at<uchar>(y + j, x + i) * LapKernel_[i + 1][j + 1];
}
}
for (int c = 0; c < 3; c++) {
image.at<cv::Vec3b>(y, x)[c] = cv::saturate_cast<uchar>(image.at<cv::Vec3b>(y, x)[c] + local_sum * .99);
}
sum += local_sum;
}
}
//stop supressing
std::cout.rdbuf(coutbuf);
}
void openMP_imgProcessor::brightenImg(cv::Mat& image, int brightnessLvl) {
//supressing OpenCV messages
std::streambuf* coutbuf = std::cout.rdbuf();
std::cout.rdbuf(nullptr);
int width = image.cols;
int height = image.rows;
int channels = image.channels();
#pragma omp parallel for
for (int row = 0; row < height; row++) {
for (int col = 0; col < width; col++) {
for (int c = 0; c < channels; c++) {
uchar& pixel = image.at<cv::Vec3b>(row, col)[c];
pixel = cv::saturate_cast<uchar>(pixel + brightnessLvl);
}
}
}
//stop supressing
std::cout.rdbuf(coutbuf);
}
void openMP_imgProcessor::saturateImg(cv::Mat& image, double saturationLvl) {
//supressing OpenCV messages
std::streambuf* coutbuf = std::cout.rdbuf();
std::cout.rdbuf(nullptr);
//HSV stands for hue saturation value
cv::Mat hsv;
cv::cvtColor(image, hsv, cv::COLOR_BGR2HSV);
#pragma omp parallel for
for (int y = 0; y < hsv.rows; ++y)
{
for (int x = 0; x < hsv.cols; ++x)
{
// Get pixel value
cv::Vec3b color = hsv.at<cv::Vec3b>(cv::Point(x, y));
// Increase saturation by saturation Lvl color[1] is for saturation
color[1] = cv::saturate_cast<uchar>(color[1] * saturationLvl);
// Set pixel value
hsv.at<cv::Vec3b>(cv::Point(x, y)) = color;
}
}
cv::cvtColor(hsv, image, cv::COLOR_HSV2BGR);
//stop supressing
std::cout.rdbuf(coutbuf);
}
IPP Implementation Summary
IPP implementation class for this project is called IppImgProc. It performs three image processing tasks (sharpening, brighten, adjustSaturation) using the Intel Integrated Performance Primitives (IPP) library and OpenCV.
By using IPP we can leverage optimized functions that performs tasks in parallel without any additional library. In the constructor the number of Threads are defined using std::thread::hardware_concurrency() and the image is loaded using OpenCV function imread().
In the main functions sharpening(), brighten(), adjustSaturation(), specialized functions is used for each specific task such as ippiFilterLaplaceBorder_8u_C3R(), ippiAddC_8u_C3RSfs(), ippiHSVToRGB_8u_C3R() in order.
Each function that contains C3R uses 3 channels for image processing meaning it processes the colors as well. Note that function ippiFilterLaplaceBorder_8u_C3R() needs Buffer memory allocations.
This allocation can be made utilizing ippiFilterLaplaceBorderGetBufferSize().
Image data is captured using openCV methods like img_.convertTo(gray8Img_, CV_8U);
width_ = gray8Img_.cols; height_ = gray8Img_.rows;
void IppImgProc::brighten(int brightness, int scaleFactor)
{
// Create a matrix for the output image
outImg_ = cv::Mat::zeros(img_.size(), img_.type());
// Convert the image to 8-bit format
img_.convertTo(gray8Img_, CV_8U);
// Get the image dimensions
width_ = gray8Img_.cols;
height_ = gray8Img_.rows;
// Create an IPP image for the input image data
IppiSize roi = { width_, height_ };
Ipp8u* pData = gray8Img_.data;
int step = gray8Img_.step;
// Set up the brightening parameters
Ipp8u value[3] = { static_cast<Ipp8u>(brightness), static_cast<Ipp8u>(brightness), static_cast<Ipp8u>(brightness) };
IppStatus status = ippiAddC_8u_C3RSfs(pData, step, value, outImg_.data, outImg_.step, roi, scaleFactor);
// Measure the time it takes to perform the brightening operation using IPP
status = ippiAddC_8u_C3RSfs(pData, step, value, outImg_.data, outImg_.step, roi, scaleFactor);
// Check for errors
if (status != ippStsNoErr) {
throw std::runtime_error("IPP error");
}
}
TBB Implementation Summary
The TBB implementation was relatively simple, though not quite as simple as the OpenMP implementation. It's class declaration is essentially the same as the OpenMP image processor, and uses the same Laplacian kernel. The primary difference is that instead of being able to simply use a #pragma to parallelize the code, we use the parallel_for functionality from TBB. We use the dimensions of the image to get the range, and then placed our functionality inside the lambda to be passed into the parallel_for call.
void tbb_imgProcessor::saturateImg(cv::Mat& image, double saturationLvl) {
//supressing OpenCV messages
std::streambuf* coutbuf = std::cout.rdbuf();
std::cout.rdbuf(nullptr);
//HSV stands for hue saturation value
cv::Mat hsv;
cv::cvtColor(image, hsv, cv::COLOR_BGR2HSV);
//Set blocked range from the first entry to the last
tbb::parallel_for(tbb::blocked_range<int>(0, hsv.rows), [&](const tbb::blocked_range<int>& r) {
for (int y = r.begin(); y < r.end(); ++y)
{
for (int x = 0; x < hsv.cols; ++x)
{
// Get pixel value
cv::Vec3b color = hsv.at<cv::Vec3b>(cv::Point(x, y));
// Increase saturation by saturation Lvl color[1] is for saturation
color[1] = cv::saturate_cast<uchar>(color[1] * saturationLvl);
// Set pixel value
hsv.at<cv::Vec3b>(cv::Point(x, y)) = color;
}
}
});
//Convert image from HSV back to GBR
cv::cvtColor(hsv, image, cv::COLOR_HSV2BGR);
//stop supressing
std::cout.rdbuf(coutbuf);
}
void tbb_imgProcessor::brightenImg(cv::Mat& image, int brightnessLvl) {
//suppressing OpenCV messages
std::streambuf* coutbuf = std::cout.rdbuf();
std::cout.rdbuf(nullptr);
int width = image.cols;
int height = image.rows;
int channels = image.channels();
tbb::parallel_for(0, height, [&](int row) {
for (int col = 0; col < width; col++) {
for (int c = 0; c < channels; c++) {
uchar& pixel = image.at<cv::Vec3b>(row, col)[c];
pixel = cv::saturate_cast<uchar>(pixel + brightnessLvl);
}
}
});
//stop suppressing
std::cout.rdbuf(coutbuf);
}
void tbb_imgProcessor::sharpenImg(cv::Mat& image) {
//suppressing OpenCV messages
std::streambuf* coutbuf = std::cout.rdbuf();
std::cout.rdbuf(nullptr);
// Convert the image to grayscale
cv::Mat grayscale;
cv::cvtColor(image, grayscale, cv::COLOR_BGR2GRAY);
tbb::parallel_for(1, image.cols - 1, [&](int x) {
for (int y = 1; y < image.rows - 1; y++) {
double sum = 0.0;
for (int i = -1; i <= 1; i++) {
for (int j = -1; j <= 1; j++) {
sum += grayscale.at<uchar>(y + j, x + i) * LapKernel_[i + 1][j + 1];
}
}
for (int c = 0; c < 3; c++) {
image.at<cv::Vec3b>(y, x)[c] = cv::saturate_cast<uchar>(image.at<cv::Vec3b>(y, x)[c] + sum * .99);
}
}
});
//stop suppressing
std::cout.rdbuf(coutbuf);
}
Testing and Demonstration Program
We've kept our demo program quite simple. Below you'll find a version of our Demo.cpp. If you'd like to see our full code and tinker with it yourself, you can view our git repository here: https://github.com/GPU621-DAL-OpenMP-Comparison/Project-Demo
#include "Tester.h"
//argument is ../sample_images/test.jpg
int main(int argc, char* argv[]) {
Tester demo(argv[1]);
demo.display_img(0);
//run omp
//omp_set_num_threads(15); //Olivia- 15 was opt choice for my system
demo.omp_brighten(50);
demo.omp_sharpen();
demo.omp_saturate(2.0);
//disable OpenMP so it can't be incidently used in the backend
omp_set_num_threads(1);
omp_set_dynamic(0);
//run ipp
demo.ipp_brighten(50);
demo.ipp_sharpen();
demo.ipp_saturate();
//run serial
cv::setNumThreads(0); //turn all parallelization of the backend off
demo.serial_brighten(50);
demo.serial_sharpen();
demo.serial_saturate(2.0);
return 0;
}
Results
Testing these libraries in image manipulation displays some interesting differences in their runtime. In everything but the saturation process, the IPP implementations had the fastest run times by considerable margins, though it took around 2.5x longer to alter the image saturation, it was more than twice as fast in the brightening and took around a fifth of the time needed for the OpenMP and TBB parallelized sharpening operations.
The OpenMP and TBB solutions were similar in runtime but the TBB solutions were slightly faster. This is likely due to needing less overhead for the threading than the OpenMP processes. Both are relatively simple to implement so we believe that TBB should generally be the preference between the two tools in these image manipulation applications.
Of course, as can be seen from the charts below, each parallelized option is far faster than the serial implementation of these processes.
Hardware Used in Testing
It's important to note that your results may be quite different from ours. Multithreading performance can depend heavily on the hardware of the machine the program has been run on.
Here is a brief bit of information about the hardware utilized in our testing:
Processor AMD Ryzen 7 3800X 8-Core Processor 3.89 GHz
Installed RAM 32.0 GB
System type 64-bit operating system, x64-based processor