Difference between revisions of "GPU621/NoName"

From CDOT Wiki
Jump to: navigation, search
(OpenMP code)
(OpenMP code)
Line 180: Line 180:
 
           T* reduced = new T[max_threads];
 
           T* reduced = new T[max_threads];
 
           T* scanRes = new T[max_threads];
 
           T* scanRes = new T[max_threads];
 
+
      #pragma omp parallel
    #pragma omp parallel
+
      {
                {
+
      int ntiles = omp_get_num_threads(); // Number of tiles
                    int ntiles = omp_get_num_threads(); // Number of tiles
+
      int itile = omp_get_thread_num();
                    int itile = omp_get_thread_num();
+
      int tile_size = (size - 1) / ntiles + 1;
                    int tile_size = (size - 1) / ntiles + 1;
+
      int last_tile = ntiles - 1;
int last_tile = ntiles - 1;
+
      int last_tile_size = size - last_tile * tile_size;
int last_tile_size = size - last_tile * tile_size;
+
      if (itile == 0)
if (itile == 0)
+
          nthreads = ntiles;
nthreads = ntiles;
+
          // step 1 - reduce each tile separately
// step 1 - reduce each tile separately
+
          for (int itile = 0; itile < ntiles; itile++)
for (int itile = 0; itile < ntiles; itile++)
+
                  reduced[itile] = reduce(in + itile * tile_size,
reduced[itile] = reduce(in + itile * tile_size,
+
                        itile == last_tile ? last_tile_size : tile_size, combine, T(0));
itile == last_tile ? last_tile_size : tile_size, combine, T(0));
+
                  // step 2 - perform exclusive scan on all tiles using reduction outputs  
 
+
                  // store results in scanRes[]
// step 2 - perform exclusive scan on all tiles using reduction outputs  
+
                  excl_scan(reduced, scanRes, ntiles, combine, T(0));
// store results in scanRes[]
+
                  // step 3 - scan each tile separately using scanRes[]
excl_scan(reduced, scanRes, ntiles, combine, T(0));
+
                  for (int itile = 0; itile < ntiles; itile++)
 
+
                        scan_fn(in + itile * tile_size, out + itile * tile_size,
// step 3 - scan each tile separately using scanRes[]
+
                              itile == last_tile ? last_tile_size : tile_size, combine,
for (int itile = 0; itile < ntiles; itile++)
+
                                    scanRes[itile]);
scan_fn(in + itile * tile_size, out + itile * tile_size,
+
                  }
itile == last_tile ? last_tile_size : tile_size, combine,
+
            delete[] reduced;
scanRes[itile]);
+
            delete[] scanRes;
}
+
      }
delete[] reduced;
+
      return nthreads;
delete[] scanRes;
 
}
 
return nthreads;
 
 
  }
 
  }
  

Revision as of 18:59, 26 November 2016

NoName

Our project: C++11 Threads Library Comparison to OpenMP

Group Members

  1. Saad Toor [1] Research etc.
  2. Danylo Medinski [2] Research etc.
  3. Ahmed Khan [3] Research etc.

Progress

Oct 17th:

  1. Picked topic
  2. Picked presentation date.
  3. Gathering information

Oct 20th:

  1. Created Wiki page

OpenMp vs C++ 11 Threads

What are C++ 11 Threads

With the introduction of C++ 11, there were major changes and additions made to the C++ Standard libraries. One of the most significant changes was the inclusion of multi-threading libraries. Before C++ 11 in order to implement multi-threading, external libraries or language extensions such as OpenMp was required. The C++ 11 thread support library includes these 4 files to enable multi-threading

  • <thread> - class and namespace for working with threads
  • <mutex> - provides support for mutual exclusion
  • <contition_variable> - a synchronization primitive that can be used to block a thread, or multiple threads at the same time, until another thread both modifies a shared variable (the condition), and notifies the condition_variable.
  • <future> - Describes components that a C++ program can use to retrieve in one thread the result (value or exception) from a function that has run in the same thread or another thread.

Creating and executing Threads

Inside a declared OpenMp parallel region, if not specified via an environment variable OMP_NUM_THREADS or the library routine omp_get_thread_num() , OpenMp will automatically decide how many threads are needed to execute parallel code. An issue with this approach is that OpenMp is unaware how many threads a CPU can support. A result of this can be OpenMp creating 4 threads for a single core processor which may result in a degradation of performance. C++ 11 Threads on the contrary always required to specify the number of threads required for a parallel region. If not specified by user input or hardcoding, the number of threads supported by a CPU can also be accurately via the std::thread::hardware_concurrency(); function. OpenMp automatically decides what order threads will execute. C++ 11 Threads require the developer to specify in what order threads will execute. This is typically done within a for loop block.

OpenMp

Automatic thread creation

#pragma omp parallel
     {
          int tid = omp_get_thread_num(); 
          std::cout << "Hi from thread "
          << tid << '\n';
     }

Programmer Specified thread creation

int numThreads = 4;
omp_set_num_threads(numThreads);
#pragma omp parallel
     {
          int tid = omp_get_thread_num(); 
          std::cout << "Hi from thread "
          << tid << '\n';
     }


STD Threads

int numThreads = std::thread::hardware_concurrency();
std::vector<std::thread> threads(numThreads);
for (int ID = 0; ID < numThreads; ID++) {
     threads[ID] = std::thread(function);
} 


Programming Models

SPMD

An example of the SPMD programming model in STD Threads using an atomic barrier

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>
 #include <chrono>
 #include <vector>
 #include <thread>
 #include <atomic>
 using namespace std::chrono;

 std::atomic<double> pi;

 void reportTime(const char* msg, steady_clock::duration span) {
     auto ms = duration_cast<milliseconds>(span);
     std::cout << msg << " - took - " <<
     ms.count() << " milliseconds" << std::endl;
 }
 void run(int ID, double stepSize, int nthrds, int n)
 {
     double x;
     double sum = 0.0;
     for (int i = ID; i < n; i = i + nthrds){
      	   x = (i + 0.5)*stepSize;
          sum += 4.0 / (1.0 + x*x);
     }
     sum = sum * stepSize;
     pi = pi + sum;
 }

 int main(int argc, char** argv) {
    if (argc != 3) {
         std::cerr << argv[0] << ": invalid number of arguments\n";
         return 1;
    }

    int n = atoi(argv[1]);
    int numThreads = atoi(argv[2]);

    steady_clock::time_point ts, te;

    // calculate pi by integrating the area under 1/(1 + x^2) in n steps 
    ts = steady_clock::now();

    std::vector<std::thread> threads(numThreads);

    double stepSize = 1.0 / (double)n;

    for (int ID = 0; ID < numThreads; ID++) {
         int nthrds = std::thread::hardware_concurrency();
         if (ID == 0) numThreads = nthrds;
         threads[ID] = std::thread(run, ID, stepSize, 8, n);
    }

    te = steady_clock::now();

    for (int i = 0; i < numThreads; i++){
         threads[i].join();
    }
	
    std::cout << "n = " << n << std::fixed << std::setprecision(15) << "\n pi(exact) = " << 3.141592653589793 << "\n pi(calcd) = " << pi << std::endl;

    reportTime("Integration", te - ts);

    // terminate
    char c;
    std::cout << "Press Enter key to exit ... ";
    std::cin.get(c);
 }

Question & Awnser

Can one safely use C++11 multi-threading as well as OpenMP in one and the same program but without interleaving them (i.e. no OpenMP statement in any code passed to C++11 concurrent features and no C++11 concurrency in threads spawned by OpenMP)?


On some platforms efficient implementation could only be achieved if the OpenMP run-time is the only one in control of the process threads. Also there are certain aspects of OpenMP that might not play well with other threading constructs, for example the limit on the number of threads set by OMP_THREAD_LIMIT when forking two or more concurrent parallel regions.Since the OpenMP standard itself does not strictly forbid using other threading paradigms, but neither standardises the interoperability with such, supporting such functionality is up to the implementers. This means that some implementations might provide safe concurrent execution of top-level OpenMP regions, some might not. The x86 implementers pledge to supporting it, may be because most of them are also proponents of other execution models (e.g. Intel with Cilk and TBB, GCC with C++11, etc.) and x86 is usually considered an "experimental" platform (other vendors are usually much more conservative).


OpenMP code

//Workshop 3 using the scan and reduce with openMp

template <typename T, typename R, typename C, typename S>
int scan(
     const T* in,   // source data
     T* out,        // output data
     int size,      // size of source, output data sets
     R reduce,      // reduction expression
     C combine,     // combine expression
     S scan_fn,     // scan function (exclusive or inclusive)
     T initial      // initial value
)
{
     /* int tile size = (n - 1)/ntiles + 1;
     reduced[tid] = reduce(in + tid * tilesize,itile == last_tile ? last_tile_size : tile_size, combine, T(0));
     #pragma omp barrier
     #pragma omp single */
     int nthreads = 1;
     if (size > 0) {
          // requested number of tiles
          int max_threads = omp_get_max_threads();
          T* reduced = new T[max_threads];
          T* scanRes = new T[max_threads];
     #pragma omp parallel
     {
     int ntiles = omp_get_num_threads(); // Number of tiles
     int itile = omp_get_thread_num();
     int tile_size = (size - 1) / ntiles + 1;
     int last_tile = ntiles - 1;
     int last_tile_size = size - last_tile * tile_size;
     if (itile == 0)
          nthreads = ntiles;
         // step 1 - reduce each tile separately
         for (int itile = 0; itile < ntiles; itile++)
                 reduced[itile] = reduce(in + itile * tile_size,
                       itile == last_tile ? last_tile_size : tile_size, combine, T(0));
                 // step 2 - perform exclusive scan on all tiles using reduction outputs 
                 // store results in scanRes[]
                 excl_scan(reduced, scanRes, ntiles, combine, T(0));
                 // step 3 - scan each tile separately using scanRes[]
                 for (int itile = 0; itile < ntiles; itile++)
                       scan_fn(in + itile * tile_size, out + itile * tile_size,
                             itile == last_tile ? last_tile_size : tile_size, combine,
                                   scanRes[itile]);
                 }
           delete[] reduced;
           delete[] scanRes;
     }
     return nthreads;
}

C++11 code

#include <iostream>
#include <omp.h>
#include <chrono>
#include <vector>
#include <thread>
using namespace std;
void doNothing() {}
int run(int algorithmToRun)
{
   auto startTime = std::chrono::system_clock::now();
   for(int j=1; j<100000; ++j)
   {
       if(algorithmToRun == 1)
       {
           vector<thread> threads;
           for(int i=0; i<16; i++)
           {
               threads.push_back(thread(doNothing));
           }
           for(auto& thread : threads) thread.join();
       }
       else if(algorithmToRun == 2)
       {
           #pragma omp parallel for num_threads(16)
           for(unsigned i=0; i<16; i++)
           {
               doNothing();
           }
       }
   }
   auto endTime = std::chrono::system_clock::now();
   std::chrono::duration<double> elapsed_seconds = endTime - startTime;
   return elapsed_seconds.count();
}
int main()
{
   int cppt = run(1);
   int ompt = run(2);
   cout<<cppt<<endl;
   cout<<ompt<<endl;
   return 0;
}