Difference between revisions of "GPU610/DPS915 Student Resources"

From CDOT Wiki
Jump to: navigation, search
(Student Resources)
(Visual Studio 2017 and CUDA 9.1 Problem)
 
(44 intermediate revisions by 11 users not shown)
Line 1: Line 1:
{{GPU610/DPS915 Index | 20123}}
+
{{GPU610/DPS915 Index | 20171}}
=Student Resources=
+
 
 
The purpose of this page is to share useful information that can help groups with their CUDA projects.
 
The purpose of this page is to share useful information that can help groups with their CUDA projects.
  
==BLAS Documentation --- This section is a work in progress -- please do not edit [jboelen] ==
+
= CUDA Enabled Cards =
 +
[http://en.wikipedia.org/wiki/CUDA#Supported_GPUs List @ CUDA Wiki]
  
Note: This information applies to the '''gsl_cblas''' library 
+
= Workshop Notes =
 +
==BLAS Documentation==
 +
See the [[GPU610/DPS915_BLAS_Documentation | BLAS Documentation Page]]
  
There are 2 main functions to use
+
For Documentation on Apple's implementation of BLAS see their
 +
[https://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.html docs] which are very easy to read and navigate.
  
===segmv===
+
==Getting Started on Mac==
void '''cblas_sgemv''' (''const enum CBLAS_ORDER order, const enum CBLAS_TRANSPOSE TransA, const int M, const int N, const float alpha, const float * A, const int lda, const float * x, const int incx, const float beta, float * y, const int incy'')<ref>http://www.gnu.org/software/gsl/manual/html_node/Level-2-CBLAS-Functions.html</ref>
+
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_Getting_Started_Mac.pdf
  
'''''order'''''
+
http://developer.nvidia.com/cuda/cuda-downloads
:Whether matrices are row major order (C-Style) for column major order (Fortran-style). One of enum CblasRowMajor or CblasColMajor.
 
  
'''''TransA'''''
+
==Makefile Documentation==
:Whether to transpose matrix A. One of enum CblasNoTrans, CBlasTrans.
+
See the [[GPU610/DPS915_Makefile_Documentation | Makefile Documentation Page]]
  
'''''M'''''
+
====Troubleshooting====
:represents:
+
Problem with CUDA driver version 5.0.24 on MacBook Pro 2012 [http://blogs.adobe.com/premiereprotraining/2012/08/known-issues-with-cuda-5-0-17-driver-including-crashes-and-kernel-panics.html Fix]
::the number of rows in input matrix a
 
:The number of rows must be greater than or equal to zero, and less than the leading dimension of the matrix a (specified in lda)
 
 
 
'''''N'''''
 
:represents:
 
::the number of columns in input matrix a
 
:The number of columns must be greater than or equal to zero.
 
  
'''''alpha'''''
+
==Visual Studio Common Problems & Solutions==
:is the scaling constant for matrix a
+
=== Cannot Open cublas.lib ===
  
'''''A'''''
+
"lnk1104 cannot open cublas.lib"
:is the input matrix of float (for sgemv) or double (for dgemv) values
 
  
'''''lda'''''
+
Add cublas.lib to Linker Input.
:is the leading dimension of the array specified by a. The leading dimension must be greater than zero. The leading dimension must be greater than or equal to 1 and greater than or equal to the value specified in m.
 
  
'''''x'''''
+
If you included cublas.lib in the linker configuration, but VS cannot find the file you may be building
:is the input vector of float (for sgemv) or double (for dgemv) values.
+
in Win32 instead of x64 (depending on CUDA installation directory). To build in x64:
 +
# Project Properties (alt+enter)
 +
# Click Configuration Options button
 +
# Active Solution Platform -> Change the dropdown box to x64
 +
# Close, OK
 +
# Try to build now
 +
(Boris Bershadsky + Yehoshua Ghitis)
  
'''''incx'''''
+
=== Cuda Win32/x64 Library ===
:is the stride for vector x. It can have any value.
 
  
'''''beta'''''
+
After following the instructions,,provided in today's lecture, to setting up the library and include files in the project properties to run Cuda on VS 2012 Express at home, I still encounter
:is the scaling constant for vector y
+
the linker error; "unable to find cuda_runtime.h". Googling around, there are two ways around this. By default, VS Studio uses the 32bit debugger, which you can change in project properties. You will have to
 +
use the Win32 version of the library directives (ie in my case "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\Win32") with the default debugger. If use the x64 library files, change the debugger to 64bit (which I neglected and lost a good portion of time). Cheers.
  
'''''y'''''
+
-- Peter Huang
:is the output vector of float (for sgemv) or double (for dgemv) values.
 
  
'''''incy'''''
+
=Ubuntu 12.04 LTS and CUDA 5 Toolkit Installation Guide=
:is the stride for vector y. It must not be zero.
+
[http://zenit.senecac.on.ca/wiki/index.php/GPU610/DPS915_Ubuntu_and_CUDA_Installation See the guide here; work in progress]
  
===segmm===
+
== SVGALIBS - Graphics Library ==
void '''cblas_sgemm''' (''const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const float alpha, const float * A, const int lda, const float * B, const int ldb, const float beta, float * C, const int ldc'') <ref> http://www.gnu.org/software/gsl/manual/html_node/Level-3-CBLAS-Functions.html</ref>
+
This library is a Linux graphics library and thus will not work on windows (I have tried very briefly on finding a way but could not for the reason that Windows does not have X11/xorgs/linux tty devices). The program needs to be run on a Linux machine because it is using svgalibs which is an archaic way to display stuff on the linux screen (from quick google search on the svga library).
  
'''''M'''''
+
[http://www.svgalib.org/ svgalibs link]
: is the number of  Rows in matrices A and C
 
:M must be greater than or equal to zero.  
 
  
'''''N'''''
+
== nvcc cannot find header files ==
:is the number of Columns in Matrices B and C
+
a.k.a. Dun Goofing where nvcc locates its header files - as experienced by Neil Guzman
  
'''''K'''''
+
Find nvcc.profile (usually located in "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\bin") and replace everything inside it with this (if you have not changed it before):
:is the number of Columns in matrix A and Rows in matrix B
+
<pre>
  
'''''alpha'''''
+
TOP              = $(_HERE_)/..
:is the scaling constant for matrix a
 
  
'''''A'''''
+
PATH            += $(TOP)/open64/bin;$(TOP)/nvvm;$(_HERE_);$(TOP)/lib;
:is the input matrix a of float (for sgemm).
 
  
'''''lda'''''
+
INCLUDES        +=  "-I$(TOP)/include" "-I$(TOP)/include/cudart" "-IZ:/Program Files/Microsoft Visual Studio 11.0/VC/include" $(_SPACE_)
:is the leading dimension of the array specified by a. The leading dimension must be greater than zero. If transa is specified as 'N' or 'n', the leading dimension must be greater than or equal to 1. If transa is specified as 'T' or 't', the leading dimension must be greater than or equal to the value specified in m.
 
  
'''''B'''''
+
LIBRARIES        =+ $(_SPACE_) "/LIBPATH:$(TOP)/lib/$(_WIN_PLATFORM_)" cudart.lib
:is the input matrix b of float (for sgemm).
 
  
'''''ldb'''''
+
CUDAFE_FLAGS    +=
:is the leading dimension of the array specified by b. The leading dimension must be greater than zero. If transb is specified as 'N' or 'n', the leading dimension must be greater than or equal to the value specified in m. If transa is specified as 'T' or 't', the leading dimension must be greater than or equal to the value specified in n.
+
OPENCC_FLAGS    +=
 +
PTXAS_FLAGS    +=
  
'''''beta'''''
+
</pre>
:is the scaling constant for matrix c
 
  
'''''C'''''
+
The most important part to note is: "INCLUDES += ..."
:is the output matrix c of float (for sgemm) or double (for dgemm) values.
 
  
'''''ldc'''''
+
What you want to put is "-IC:/PATH/TO/THE/INCLUDE/FILES", which in my case was: "-IZ:/Program Files/Microsoft Visual Studio 11.0/VC/include".
:is the leading dimension of the array specified by c. The leading dimension must be greater than zero. If transb is specified as 'N' or 'n', the leading dimension must be greater than or equal to 0 and greater than or equal to the value specified in l.
 
  
https://scs.senecac.on.ca/~gpu610/pages/content/lines.html
+
Hope this helps anyone, as it insanely irritated me as changing up the environment path on windows did nothing.
  
==Getting Started on Mac==
+
== Dynamically Allocated Shared Memory ==
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_Getting_Started_Mac.pdf
+
Here is a roundabout way of working around the shared memory limitations of your graphics card.
 +
The idea is to send in chunks that your kernel can handle, then keep on sending chunks until there are none to be sent. The address being sent is also being shifted based on the chunk size.
 +
<div style='color:#000000;background:#ffffff;'>
 +
    CHUNKSIZE <span style='color:#808030; '>=</span> <span style='color:#008c00; '>512</span><span style='color:#800080; '>;</span>
 +
    shared_ <span style='color:#808030; '>=</span> CHUNKSIZE <span style='color:#808030; '>*</span> <span style='color:#800000; font-weight:bold; '>sizeof</span><span style='color:#808030; '>(</span>SimBody<span style='color:#808030; '>)</span><span style='color:#800080; '>;</span>
 +
    <span style='color:#800000; font-weight:bold; '>while</span> <span style='color:#808030; '>(</span>chunks <span style='color:#808030; '>></span> <span style='color:#008c00; '>0</span><span style='color:#808030; '>)</span>
 +
    <span style='color:#800080; '>{</span>
 +
        BodyArray ba <span style='color:#808030; '>=</span> <span style='color:#800080; '>{</span> <span style='color:#808030; '>&amp;</span>arr<span style='color:#808030; '>.</span><span style='color:#603000; '>array</span><span style='color:#808030; '>[</span>index<span style='color:#808030; '>]</span><span style='color:#808030; '>,</span> CHUNKSIZE <span style='color:#800080; '>}</span><span style='color:#800080; '>;</span>
 +
        SimCalc <span style='color:#808030; '>&lt;</span><span style='color:#808030; '>&lt;</span><span style='color:#808030; '>&lt;</span> numBlocks_<span style='color:#808030; '>,</span> numThreads_<span style='color:#808030; '>,</span> shared_ <span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>(</span>ba<span style='color:#808030; '>)</span><span style='color:#800080; '>;</span>
 +
        cudaThreadSynchronize<span style='color:#808030; '>(</span><span style='color:#808030; '>)</span><span style='color:#800080; '>;</span>
 +
        SimTick <span style='color:#808030; '>&lt;</span><span style='color:#808030; '>&lt;</span><span style='color:#808030; '>&lt;</span> numBlocks_<span style='color:#808030; '>,</span> numThreads_<span style='color:#808030; '>,</span>  shared_ <span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>></span><span style='color:#808030; '>(</span>ba<span style='color:#808030; '>,</span> timeStep<span style='color:#808030; '>)</span><span style='color:#800080; '>;</span>
 +
        cudaThreadSynchronize<span style='color:#808030; '>(</span><span style='color:#808030; '>)</span><span style='color:#800080; '>;</span>
 +
        index <span style='color:#808030; '>+</span><span style='color:#808030; '>=</span> CHUNKSIZE<span style='color:#800080; '>;</span>
 +
        <span style='color:#808030; '>-</span><span style='color:#808030; '>-</span>chunks<span style='color:#800080; '>;</span> 
 +
    <span style='color:#800080; '>}</span>
 +
    chunks <span style='color:#808030; '>=</span>  arr<span style='color:#808030; '>.</span>size <span style='color:#808030; '>/</span> CHUNKSIZE <span style='color:#808030; '>+</span> <span style='color:#008c00; '>1</span><span style='color:#800080; '>;</span>
 +
    index <span style='color:#808030; '>=</span> <span style='color:#008c00; '>0</span><span style='color:#800080; '>;</span>
 +
</div>
  
http://developer.nvidia.com/cuda/cuda-downloads
+
= Converting Fortran Code to C Code =
 +
Sample code from the TOMO project - converted by James Boelen, Raymong Hung, and Stanley Tsang
 +
== Original Fortran Subroutine ==
 +
<pre>
 +
SUBROUTINE longtrack_self(direction,nrep,yp,xp,turnnow)
 +
!-------------------------------------------------------------------------
 +
! h: principal harmonic number
 +
! eta0: phase slip factor
 +
! E0: energy of synchronous particle m
 +
! beta0: relativistic beta of synchronous particle
 +
! phi0: synchronous phase
 +
! q: charge state of particles
 +
! dphi: phase difference between considered particle and synchronous one
 +
! denergy: energy difference between considered particle and synchronous one
 +
! nrep: pass cavity nrep times before returning data
 +
! direction: to inverse the time advance (rotation in the bucket), 1 or -1
 +
! xp and yp: time and energy in pixels
 +
! dtbin and dEbin: GLOBAL time and energy pixel size in s and MeV
 +
! omegarev0: revolution frequency
 +
! VRF1,VRF2,VRF1dot,VRF2dot: GLOBAL RF voltages and derivatives of volts
 +
! turnnow: present turn
 +
!---------------------------------------------------------------------------
 +
  IMPLICIT NONE
 +
  REAL(SP), DIMENSION(:), INTENT(INOUT) :: xp,yp
 +
  REAL(SP), DIMENSION(SIZE(xp)) :: dphi,denergy,selfvolt
 +
!HPF$ distribute dphi(block)
 +
!HPF$ align with dphi :: denergy,selfvolt,xp
 +
  INTEGER :: mm
 +
  INTEGER :: i,p,nrep,direction,turnnow
 +
  dphi=(xp+xorigin)*h*omegarev0(turnnow)*dtbin-phi0(turnnow)
 +
  denergy=(yp-yat0)*dEbin
 +
  IF (direction.GT.0) THEN
 +
    p=turnnow/dturns+1
 +
    DO i=1,nrep
 +
      forall(mm=1:size(xp)) dphi(mm)=dphi(mm)-c1(turnnow)*denergy(mm)
 +
      turnnow=turnnow+1
 +
      forall(mm=1:size(xp)) xp(mm)=dphi(mm)+phi0(turnnow)-&
 +
                                  xorigin*h*omegarev0(turnnow)*dtbin
 +
      forall(mm=1:size(xp)) xp(mm)=(xp(mm)-&
 +
        phiwrap*FLOOR(xp(mm)/phiwrap))/(h*omegarev0(turnnow)*dtbin)
 +
      forall(mm=1:size(xp)) selfvolt(mm)=vself(p,FLOOR(xp(mm))+1)
 +
      forall(mm=1:size(xp)) denergy(mm)=denergy(mm)+q*((&
 +
        (VRF1+VRF1dot*tatturn(turnnow))*SIN(dphi(mm)+phi0(turnnow))+&
 +
        (VRF2+VRF2dot*tatturn(turnnow))*&
 +
        SIN(hratio*(dphi(mm)+phi0(turnnow)-phi12)))+selfvolt(mm))-c2(turnnow)
 +
    END DO
 +
  ELSE
 +
    p=turnnow/dturns
 +
    DO i=1,nrep
 +
      forall(mm=1:size(xp)) selfvolt(mm)=vself(p,FLOOR(xp(mm))+1)
 +
      forall(mm=1:size(xp)) denergy(mm)=denergy(mm)-q*((&
 +
        (VRF1+VRF1dot*tatturn(turnnow))*SIN(dphi(mm)+phi0(turnnow))+&
 +
        (VRF2+VRF2dot*tatturn(turnnow))*&
 +
        SIN(hratio*(dphi(mm)+phi0(turnnow)-phi12)))+selfvolt(mm))+c2(turnnow)
 +
      turnnow=turnnow-1
 +
      forall(mm=1:size(xp)) dphi(mm)=dphi(mm)+c1(turnnow)*denergy(mm)
 +
      forall(mm=1:size(xp)) xp(mm)=dphi(mm)+phi0(turnnow)-&
 +
                                  xorigin*h*omegarev0(turnnow)*dtbin
 +
      forall(mm=1:size(xp)) xp(mm)=(xp(mm)-&
 +
        phiwrap*FLOOR(xp(mm)/phiwrap))/(h*omegarev0(turnnow)*dtbin)
 +
    END DO
 +
  END IF
 +
  yp=denergy/dEbin+yat0
 +
END SUBROUTINE longtrack_self
 +
</pre>
 +
== Modified Fortran Subroutine ==
 +
<pre>
 +
SUBROUTINE longtrack_self(direction,nrep,yp,xp,turnnow)
 +
!-------------------------------------------------------------------------
 +
! h: principal harmonic number
 +
! eta0: phase slip factor
 +
! E0: energy of synchronous particle
 +
! beta0: relativistic beta of synchronous particle
 +
! phi0: synchronous phase
 +
! q: charge state of particles
 +
! dphi: phase difference between considered particle and synchronous one
 +
! denergy: energy difference between considered particle and synchronous one
 +
! nrep: pass cavity nrep times before returning data
 +
! direction: to inverse the time advance (rotation in the bucket), 1 or -1
 +
! xp and yp: time and energy in pixels
 +
! dtbin and dEbin: GLOBAL time and energy pixel size in s and MeV
 +
! omegarev0: revolution frequency
 +
! VRF1,VRF2,VRF1dot,VRF2dot: GLOBAL RF voltages and derivatives of volts
 +
! turnnow: present turn
 +
!---------------------------------------------------------------------------
 +
  IMPLICIT NONE
 +
  REAL(SP), DIMENSION(:), INTENT(INOUT) :: xp,yp
 +
  REAL(SP), DIMENSION(SIZE(xp)) :: dphi,denergy,selfvolt
 +
!HPF$ distribute dphi(block)
 +
!HPF$ align with dphi :: denergy,selfvolt,xp
 +
  INTEGER :: mm
 +
  INTEGER :: i,p,nrep,direction,turnnow
 +
  CALL gputrack_self(direction,nrep,yp,xp,turnnow, &
 +
  SIZE(xp),dphi,denergy, &
 +
    c1, &
 +
    c2, &
 +
    dEbin, &
 +
    dtbin, &
 +
    h, &
 +
    hratio, &
 +
    omegarev0, &
 +
    phi0, &
 +
    phi12, &
 +
    q, &
 +
    tatturn, &
 +
    VRF1, &
 +
    VRF1dot, &
 +
    VRF2, &
 +
    VRF2dot, &
 +
    xorigin, &
 +
    yat0, &
 +
    p, &
 +
    dturns, &
 +
    phiwrap, &
 +
    selfvolt, &
 +
    profilecount-1, &
 +
    wraplength, &
 +
    vself )
 +
END SUBROUTINE longtrack_self
 +
</pre>
  
====Troubleshooting====
+
== New C Function ==
Problem with CUDA driver version 5.0.24 on MacBook Pro 2012 [http://blogs.adobe.com/premiereprotraining/2012/08/known-issues-with-cuda-5-0-17-driver-including-crashes-and-kernel-panics.html Fix]
+
<pre>
 +
#include <stdio.h>
 +
#include <math.h>
  
 +
void gputrack_self_ ( \
 +
    int  *direction, \
 +
    int  *nrep, \
 +
    float *yp, \
 +
    float *xp, \
 +
    int  *turnnow, \
 +
    int  *sizeofarrays, \
 +
    float *dphi, \
 +
    float *denergy, \
 +
    float *c1, \
 +
    float *c2, \
 +
    float *dEbin, \
 +
    float *dtbin, \
 +
    float *h, \
 +
    float *hratio, \
 +
    float *omegarev0, \
 +
    float *phi0, \
 +
    float *phi12, \
 +
    float *q, \
 +
    float *tatturn, \
 +
    float *VRF1, \
 +
    float *VRF1dot, \
 +
    float *VRF2, \
 +
    float *VRF2dot, \
 +
    float *xorigin, \
 +
    float *yat0, \
 +
    int *p, \
 +
    int *dturns, \
 +
    float *phiwrap, \
 +
    float *selfvolt, \
 +
    int *vselfDimRow, \
 +
    int *vselfDimCol, \
 +
    float *vself \
 +
)
 +
{
 +
    /* Local Variables */
 +
    int l,i,mm,t;
 +
    l = *sizeofarrays;
 +
    t = *turnnow;
 +
 
 +
 
 +
    // longtrack_self specific local variables
 +
    int cp;
 +
    cp = *p;
 +
 
 +
    /*  dphi=(xp+xorigin)*h*omegarev0(turnnow)*dtbin-phi0(turnnow) */
 +
    for(mm = 0; mm < l; mm++) {
 +
        dphi[mm] = (xp[mm] + *xorigin) * *h * omegarev0[t] * *dtbin - phi0[t];
 +
    }
 +
 
 +
    /*  denergy=(yp-yat0)*dEbin */
 +
    for(mm = 0; mm < l; mm++) {
 +
        denergy[mm] = (yp[mm] - *yat0) * *dEbin;
 +
    }
  
 +
    /*  IF (direction.GT.0) THEN */
 +
    if (*direction > 0) {
 +
        /* p=turnnow/dturns+1 */
 +
        cp = t / *dturns + 1;
 +
        /* DO i=1,nrep */
 +
        for(i = 1; i <= *nrep; i++) {
 +
            /* forall(mm=1:size(xp)) dphi(mm)=dphi(mm)-c1(turnnow)*denergy(mm) */
 +
            for(mm=0;mm<l;mm++) {
 +
                dphi[mm] = dphi[mm] - c1[t] *denergy[mm];
 +
            }
 +
            /* turnnow=turnnow+1 */
 +
            t=t+1;
 +
            /* forall(mm=1:size(xp)) xp(mm)=dphi(mm)+phi0(turnnow)-&
 +
                xorigin*h*omegarev0(turnnow)*dtbin */
 +
            for(mm=0;mm<l;mm++) {
 +
                xp[mm] = dphi[mm] + phi0[t] - \
 +
                *xorigin * *h * omegarev0[t] * *dtbin;
 +
            }
 +
            /* forall(mm=1:size(xp)) xp(mm)=(xp(mm)-&
 +
                phiwrap*FLOOR(xp(mm)/phiwrap))/(h*omegarev0(turnnow)*dtbin) */
 +
            for(mm = 0; mm < l; mm++) {
 +
                xp[mm] = (xp[mm] - \
 +
                *phiwrap * floor(xp[mm] / *phiwrap)) / (*h * omegarev0[t] * *dtbin);
 +
            }
 +
            /* forall(mm=1:size(xp)) selfvolt(mm)=vself(p,FLOOR(xp(mm))+1) */
 +
            for(mm = 0; mm < l; mm++) {
 +
                int itemp = floor(xp[mm]);
 +
                selfvolt[mm] = vself[(*vselfDimRow * (itemp)) + (cp-1)];
 +
            }
 +
            /* forall(mm=1:size(xp)) denergy(mm)=denergy(mm)+q*((&
 +
                (VRF1+VRF1dot*tatturn(turnnow))*SIN(dphi(mm)+phi0(turnnow))+&
 +
                (VRF2+VRF2dot*tatturn(turnnow))*&
 +
                SIN(hratio*(dphi(mm)+phi0(turnnow)-phi12)))+selfvolt(mm))-c2(turnnow) */
 +
            for(mm = 0; mm < l; mm++) {
 +
                denergy[mm] = denergy[mm] + *q *(( \
 +
                (*VRF1 + *VRF1dot * tatturn[t]) * sin(dphi[mm] + phi0[t]) + \
 +
                (*VRF2 + *VRF2dot * tatturn[t]) * \
 +
                sin(*hratio * (dphi[mm] + phi0[t] - *phi12))) + selfvolt[mm]) -c2[t];
 +
            }
 +
        /*    END DO */
 +
        }
 +
    }
 +
      else {
 +
        // p=turnnow/dturns
 +
        cp = t / *dturns;
 +
        // DO i=1,nrep
 +
        for (i=1;i<=*nrep;i++) {
 +
            // forall(mm=1:size(xp)) selfvolt(mm)=vself(p,FLOOR(xp(mm))+1)
 +
            for(mm = 0; mm < l; mm++) {
 +
                int itemp = (int)floor(xp[mm]);
 +
                selfvolt[mm] = vself[(*vselfDimRow*(itemp)) + (cp-1)];
 +
            }
 +
            /* forall(mm=1:size(xp)) denergy(mm)=denergy(mm)-q*((&
 +
                (VRF1+VRF1dot*tatturn(turnnow))*SIN(dphi(mm)+phi0(turnnow))+&
 +
                (VRF2+VRF2dot*tatturn(turnnow))*&
 +
                SIN(hratio*(dphi(mm)+phi0(turnnow)-phi12)))+selfvolt(mm))+c2(turnnow) */
 +
            for(mm = 0; mm < l; mm++) {
 +
                denergy[mm]=denergy[mm] - *q *(( \
 +
                (*VRF1 + *VRF1dot * tatturn[t]) *sin(dphi[mm] + phi0[t]) + \
 +
                (*VRF2 + *VRF2dot * tatturn[t]) * \
 +
                sin(*hratio * (dphi[mm] + phi0[t] - *phi12))) + selfvolt[mm]) + c2[t];
 +
            }
 +
            // turnnow=turnnow-1
 +
            t--;
 +
            /* forall(mm=1:size(xp)) dphi(mm)=dphi(mm)-c1(turnnow)*denergy(mm) */
 +
            for(mm = 0; mm < l; mm++) {
 +
                dphi[mm]=dphi[mm] + c1[t] * denergy[mm];
 +
            }
 +
            /* forall(mm=1:size(xp)) xp(mm)=dphi(mm)+phi0(turnnow)-&
 +
                xorigin*h*omegarev0(turnnow)*dtbin */
 +
            for(mm = 0; mm < l; mm++) {
 +
                xp[mm] = dphi[mm] + phi0[t] - \
 +
                *xorigin * *h * omegarev0[t] * *dtbin;
 +
            }
 +
            /* forall(mm=1:size(xp)) xp(mm)=(xp(mm)-&
 +
                phiwrap*FLOOR(xp(mm)/phiwrap))/(h*omegarev0(turnnow)*dtbin) */
 +
            for(mm = 0; mm < l; mm++) {
 +
                xp[mm] = (xp[mm] - \
 +
                *phiwrap * floor(xp[mm] / *phiwrap)) / (*h * omegarev0[t] * *dtbin);
 +
            }
 +
        }
 +
    }
 +
 
 +
    // yp=denergy/dEbin+yat0
 +
    for(mm=0; mm<l; mm++) {
 +
        yp[mm] = denergy[mm] / *dEbin + *yat0;
 +
    } 
  
==References==
+
    *turnnow = t;
 +
 
 +
  return;
 +
}
 +
</pre>
  
<references/>
+
= Visual Studio 2017 and CUDA 9.1 Problem =
 +
I ran into this problem when trying to build '''thrust_sort.cu''' in the Thrust lecture. The only way I was able to build and run successfully was to create a '''CUDA 9.1 project'''. However, in the current version of Visual Studio 2017, unless you set the '''Platform Toolset''' to '''Visual Studio 2015 (v140)''', you will not be able to build and run CUDA 9.1 projects. This can be done by going to project properties, then to the General section, then changing the '''Platform Toolset'''. However, this is where I ran into a problem where Visual Studio would display an error and would not let me change the platform toolset. So I came up with the following workaround and it works:
 +
*If you haven't already done so, install the optional '''Visual Studio 2015 (v140)''' component which is available from the Visual Studio 2017 installer.
 +
*From Visual Studio, create a CUDA 9.1 project, then close the solution.
 +
*Using a text editor, open <project name>.vcxproj
 +
*Add the following as the first element in the XML under the '''Project''' tag:
 +
  <nowiki><PropertyGroup>
 +
    <CUDAPropsPath Condition="'$(CUDAPropsPath)'==''">$(VCTargetsPath)\BuildCustomizations</CUDAPropsPath>
 +
</PropertyGroup></nowiki>
 +
*Replace all occurrences (there are 2 of them) of v141 with v140.
 +
*Search for "CUDA 9.1" (you will find 2 occurrences). Then replace the first entire line with <code><nowiki><Import Project="$(CUDAPropsPath)\CUDA 9.1.props" /></nowiki></code> and the second entire line with <code><nowiki><Import Project="$(CUDAPropsPath)\CUDA 9.1.targets" /></nowiki></code>.
 +
*Close the file in the text editor then re-open the solution in Visual Studio. You should now be able to add your .cu files, build and run.

Latest revision as of 18:40, 22 February 2018


GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary

The purpose of this page is to share useful information that can help groups with their CUDA projects.

CUDA Enabled Cards

List @ CUDA Wiki

Workshop Notes

BLAS Documentation

See the BLAS Documentation Page

For Documentation on Apple's implementation of BLAS see their docs which are very easy to read and navigate.

Getting Started on Mac

http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_Getting_Started_Mac.pdf

http://developer.nvidia.com/cuda/cuda-downloads

Makefile Documentation

See the Makefile Documentation Page

Troubleshooting

Problem with CUDA driver version 5.0.24 on MacBook Pro 2012 Fix

Visual Studio Common Problems & Solutions

Cannot Open cublas.lib

"lnk1104 cannot open cublas.lib"

Add cublas.lib to Linker Input.

If you included cublas.lib in the linker configuration, but VS cannot find the file you may be building in Win32 instead of x64 (depending on CUDA installation directory). To build in x64:

  1. Project Properties (alt+enter)
  2. Click Configuration Options button
  3. Active Solution Platform -> Change the dropdown box to x64
  4. Close, OK
  5. Try to build now

(Boris Bershadsky + Yehoshua Ghitis)

Cuda Win32/x64 Library

After following the instructions,,provided in today's lecture, to setting up the library and include files in the project properties to run Cuda on VS 2012 Express at home, I still encounter the linker error; "unable to find cuda_runtime.h". Googling around, there are two ways around this. By default, VS Studio uses the 32bit debugger, which you can change in project properties. You will have to use the Win32 version of the library directives (ie in my case "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\Win32") with the default debugger. If use the x64 library files, change the debugger to 64bit (which I neglected and lost a good portion of time). Cheers.

-- Peter Huang

Ubuntu 12.04 LTS and CUDA 5 Toolkit Installation Guide

See the guide here; work in progress

SVGALIBS - Graphics Library

This library is a Linux graphics library and thus will not work on windows (I have tried very briefly on finding a way but could not for the reason that Windows does not have X11/xorgs/linux tty devices). The program needs to be run on a Linux machine because it is using svgalibs which is an archaic way to display stuff on the linux screen (from quick google search on the svga library).

svgalibs link

nvcc cannot find header files

a.k.a. Dun Goofing where nvcc locates its header files - as experienced by Neil Guzman

Find nvcc.profile (usually located in "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\bin") and replace everything inside it with this (if you have not changed it before):


TOP              = $(_HERE_)/..

PATH            += $(TOP)/open64/bin;$(TOP)/nvvm;$(_HERE_);$(TOP)/lib;

INCLUDES        +=  "-I$(TOP)/include" "-I$(TOP)/include/cudart" "-IZ:/Program Files/Microsoft Visual Studio 11.0/VC/include" $(_SPACE_) 

LIBRARIES        =+ $(_SPACE_) "/LIBPATH:$(TOP)/lib/$(_WIN_PLATFORM_)" cudart.lib

CUDAFE_FLAGS    +=
OPENCC_FLAGS    +=
PTXAS_FLAGS     +=

The most important part to note is: "INCLUDES += ..."

What you want to put is "-IC:/PATH/TO/THE/INCLUDE/FILES", which in my case was: "-IZ:/Program Files/Microsoft Visual Studio 11.0/VC/include".

Hope this helps anyone, as it insanely irritated me as changing up the environment path on windows did nothing.

Dynamically Allocated Shared Memory

Here is a roundabout way of working around the shared memory limitations of your graphics card. The idea is to send in chunks that your kernel can handle, then keep on sending chunks until there are none to be sent. The address being sent is also being shifted based on the chunk size.

   CHUNKSIZE = 512;
   shared_ = CHUNKSIZE * sizeof(SimBody);
   while (chunks > 0)
   {
       BodyArray ba = { &arr.array[index], CHUNKSIZE };
       SimCalc <<< numBlocks_, numThreads_, shared_ >>>(ba);
       cudaThreadSynchronize();
       SimTick <<< numBlocks_, numThreads_,  shared_ >>>(ba, timeStep);
       cudaThreadSynchronize();
       index += CHUNKSIZE;
       --chunks;   
   }
   chunks =  arr.size / CHUNKSIZE + 1;
   index = 0;

Converting Fortran Code to C Code

Sample code from the TOMO project - converted by James Boelen, Raymong Hung, and Stanley Tsang

Original Fortran Subroutine

SUBROUTINE longtrack_self(direction,nrep,yp,xp,turnnow)
!-------------------------------------------------------------------------
! h: principal harmonic number
! eta0: phase slip factor
! E0: energy of synchronous particle m
! beta0: relativistic beta of synchronous particle
! phi0: synchronous phase
! q: charge state of particles
! dphi: phase difference between considered particle and synchronous one
! denergy: energy difference between considered particle and synchronous one
! nrep: pass cavity nrep times before returning data
! direction: to inverse the time advance (rotation in the bucket), 1 or -1
! xp and yp: time and energy in pixels
! dtbin and dEbin: GLOBAL time and energy pixel size in s and MeV
! omegarev0: revolution frequency
! VRF1,VRF2,VRF1dot,VRF2dot: GLOBAL RF voltages and derivatives of volts
! turnnow: present turn
!---------------------------------------------------------------------------
  IMPLICIT NONE
  REAL(SP), DIMENSION(:), INTENT(INOUT) :: xp,yp
  REAL(SP), DIMENSION(SIZE(xp)) :: dphi,denergy,selfvolt
!HPF$ distribute dphi(block)
!HPF$ align with dphi :: denergy,selfvolt,xp
  INTEGER :: mm
  INTEGER :: i,p,nrep,direction,turnnow
  dphi=(xp+xorigin)*h*omegarev0(turnnow)*dtbin-phi0(turnnow)
  denergy=(yp-yat0)*dEbin
  IF (direction.GT.0) THEN
    p=turnnow/dturns+1
    DO i=1,nrep
      forall(mm=1:size(xp)) dphi(mm)=dphi(mm)-c1(turnnow)*denergy(mm)
      turnnow=turnnow+1
      forall(mm=1:size(xp)) xp(mm)=dphi(mm)+phi0(turnnow)-&
                                   xorigin*h*omegarev0(turnnow)*dtbin
      forall(mm=1:size(xp)) xp(mm)=(xp(mm)-&
        phiwrap*FLOOR(xp(mm)/phiwrap))/(h*omegarev0(turnnow)*dtbin)
      forall(mm=1:size(xp)) selfvolt(mm)=vself(p,FLOOR(xp(mm))+1)
      forall(mm=1:size(xp)) denergy(mm)=denergy(mm)+q*((&
        (VRF1+VRF1dot*tatturn(turnnow))*SIN(dphi(mm)+phi0(turnnow))+&
        (VRF2+VRF2dot*tatturn(turnnow))*&
        SIN(hratio*(dphi(mm)+phi0(turnnow)-phi12)))+selfvolt(mm))-c2(turnnow)
    END DO
  ELSE
    p=turnnow/dturns
    DO i=1,nrep
      forall(mm=1:size(xp)) selfvolt(mm)=vself(p,FLOOR(xp(mm))+1)
      forall(mm=1:size(xp)) denergy(mm)=denergy(mm)-q*((&
        (VRF1+VRF1dot*tatturn(turnnow))*SIN(dphi(mm)+phi0(turnnow))+&
        (VRF2+VRF2dot*tatturn(turnnow))*&
        SIN(hratio*(dphi(mm)+phi0(turnnow)-phi12)))+selfvolt(mm))+c2(turnnow)
      turnnow=turnnow-1
      forall(mm=1:size(xp)) dphi(mm)=dphi(mm)+c1(turnnow)*denergy(mm)
      forall(mm=1:size(xp)) xp(mm)=dphi(mm)+phi0(turnnow)-&
                                   xorigin*h*omegarev0(turnnow)*dtbin
      forall(mm=1:size(xp)) xp(mm)=(xp(mm)-&
        phiwrap*FLOOR(xp(mm)/phiwrap))/(h*omegarev0(turnnow)*dtbin)
    END DO
  END IF
  yp=denergy/dEbin+yat0
END SUBROUTINE longtrack_self

Modified Fortran Subroutine

SUBROUTINE longtrack_self(direction,nrep,yp,xp,turnnow)
!-------------------------------------------------------------------------
! h: principal harmonic number
! eta0: phase slip factor
! E0: energy of synchronous particle
! beta0: relativistic beta of synchronous particle
! phi0: synchronous phase
! q: charge state of particles
! dphi: phase difference between considered particle and synchronous one
! denergy: energy difference between considered particle and synchronous one
! nrep: pass cavity nrep times before returning data
! direction: to inverse the time advance (rotation in the bucket), 1 or -1
! xp and yp: time and energy in pixels
! dtbin and dEbin: GLOBAL time and energy pixel size in s and MeV
! omegarev0: revolution frequency
! VRF1,VRF2,VRF1dot,VRF2dot: GLOBAL RF voltages and derivatives of volts
! turnnow: present turn
!---------------------------------------------------------------------------
  IMPLICIT NONE
  REAL(SP), DIMENSION(:), INTENT(INOUT) :: xp,yp
  REAL(SP), DIMENSION(SIZE(xp)) :: dphi,denergy,selfvolt
!HPF$ distribute dphi(block)
!HPF$ align with dphi :: denergy,selfvolt,xp
  INTEGER :: mm
  INTEGER :: i,p,nrep,direction,turnnow
  CALL gputrack_self(direction,nrep,yp,xp,turnnow, &
  SIZE(xp),dphi,denergy, &
     c1, &
     c2, &
     dEbin, &
     dtbin, &
     h, &
     hratio, &
     omegarev0, &
     phi0, &
     phi12, &
     q, &
     tatturn, &
     VRF1, &
     VRF1dot, &
     VRF2, &
     VRF2dot, &
     xorigin, &
     yat0, &
     p, &
     dturns, &
     phiwrap, &
     selfvolt, &
     profilecount-1, &
     wraplength, &
     vself )
END SUBROUTINE longtrack_self

New C Function

#include <stdio.h>
#include <math.h>

void gputrack_self_ ( \
    int  *direction, \
    int  *nrep, \
    float *yp, \
    float *xp, \
    int  *turnnow, \
    int  *sizeofarrays, \
    float *dphi, \
    float *denergy, \
    float *c1, \
    float *c2, \
    float *dEbin, \
    float *dtbin, \
    float *h, \
    float *hratio, \
    float *omegarev0, \
    float *phi0, \
    float *phi12, \
    float *q, \
    float *tatturn, \
    float *VRF1, \
    float *VRF1dot, \
    float *VRF2, \
    float *VRF2dot, \
    float *xorigin, \
    float *yat0, \
    int *p, \
    int *dturns, \
    float *phiwrap, \
    float *selfvolt, \
    int *vselfDimRow, \
    int *vselfDimCol, \
    float *vself \
)
{
    /* Local Variables */
    int l,i,mm,t;
    l = *sizeofarrays;
    t = *turnnow;
   
   
    // longtrack_self specific local variables
    int cp;
    cp = *p;
   
    /*  dphi=(xp+xorigin)*h*omegarev0(turnnow)*dtbin-phi0(turnnow) */
    for(mm = 0; mm < l; mm++) {
        dphi[mm] = (xp[mm] + *xorigin) * *h * omegarev0[t] * *dtbin - phi0[t];
    }
   
    /*  denergy=(yp-yat0)*dEbin */
    for(mm = 0; mm < l; mm++) {
        denergy[mm] = (yp[mm] - *yat0) * *dEbin;
    }

    /*   IF (direction.GT.0) THEN */
    if (*direction > 0) {
        /* p=turnnow/dturns+1 */
        cp = t / *dturns + 1;
        /* DO i=1,nrep */
        for(i = 1; i <= *nrep; i++) {
            /* forall(mm=1:size(xp)) dphi(mm)=dphi(mm)-c1(turnnow)*denergy(mm) */
            for(mm=0;mm<l;mm++) {
                dphi[mm] = dphi[mm] - c1[t] *denergy[mm];
            }
            /* turnnow=turnnow+1 */
            t=t+1;
            /* forall(mm=1:size(xp)) xp(mm)=dphi(mm)+phi0(turnnow)-&
                xorigin*h*omegarev0(turnnow)*dtbin */
            for(mm=0;mm<l;mm++) {
                xp[mm] = dphi[mm] + phi0[t] - \
                *xorigin * *h * omegarev0[t] * *dtbin;
            }
            /* forall(mm=1:size(xp)) xp(mm)=(xp(mm)-&
                phiwrap*FLOOR(xp(mm)/phiwrap))/(h*omegarev0(turnnow)*dtbin) */
            for(mm = 0; mm < l; mm++) {
                xp[mm] = (xp[mm] - \
                *phiwrap * floor(xp[mm] / *phiwrap)) / (*h * omegarev0[t] * *dtbin);
            }
            /* forall(mm=1:size(xp)) selfvolt(mm)=vself(p,FLOOR(xp(mm))+1) */
            for(mm = 0; mm < l; mm++) {
                int itemp = floor(xp[mm]);
                selfvolt[mm] = vself[(*vselfDimRow * (itemp)) + (cp-1)];
            }
            /* forall(mm=1:size(xp)) denergy(mm)=denergy(mm)+q*((&
                (VRF1+VRF1dot*tatturn(turnnow))*SIN(dphi(mm)+phi0(turnnow))+&
                (VRF2+VRF2dot*tatturn(turnnow))*&
                SIN(hratio*(dphi(mm)+phi0(turnnow)-phi12)))+selfvolt(mm))-c2(turnnow) */
            for(mm = 0; mm < l; mm++) {
                denergy[mm] = denergy[mm] + *q *(( \
                (*VRF1 + *VRF1dot * tatturn[t]) * sin(dphi[mm] + phi0[t]) + \
                (*VRF2 + *VRF2dot * tatturn[t]) * \
                sin(*hratio * (dphi[mm] + phi0[t] - *phi12))) + selfvolt[mm]) -c2[t];
            }
        /*     END DO */
        }
    }
      else {
        // p=turnnow/dturns
        cp = t / *dturns;
        // DO i=1,nrep
        for (i=1;i<=*nrep;i++) {
            // forall(mm=1:size(xp)) selfvolt(mm)=vself(p,FLOOR(xp(mm))+1)
            for(mm = 0; mm < l; mm++) {
                int itemp = (int)floor(xp[mm]);
                selfvolt[mm] = vself[(*vselfDimRow*(itemp)) + (cp-1)];
            }
            /* forall(mm=1:size(xp)) denergy(mm)=denergy(mm)-q*((&
                (VRF1+VRF1dot*tatturn(turnnow))*SIN(dphi(mm)+phi0(turnnow))+&
                (VRF2+VRF2dot*tatturn(turnnow))*&
                SIN(hratio*(dphi(mm)+phi0(turnnow)-phi12)))+selfvolt(mm))+c2(turnnow) */
            for(mm = 0; mm < l; mm++) {
                denergy[mm]=denergy[mm] - *q *(( \
                (*VRF1 + *VRF1dot * tatturn[t]) *sin(dphi[mm] + phi0[t]) + \
                (*VRF2 + *VRF2dot * tatturn[t]) * \
                sin(*hratio * (dphi[mm] + phi0[t] - *phi12))) + selfvolt[mm]) + c2[t];
            }
            // turnnow=turnnow-1
            t--;
            /* forall(mm=1:size(xp)) dphi(mm)=dphi(mm)-c1(turnnow)*denergy(mm) */
            for(mm = 0; mm < l; mm++) {
                dphi[mm]=dphi[mm] + c1[t] * denergy[mm];
            }
            /* forall(mm=1:size(xp)) xp(mm)=dphi(mm)+phi0(turnnow)-&
                xorigin*h*omegarev0(turnnow)*dtbin */
            for(mm = 0; mm < l; mm++) {
                xp[mm] = dphi[mm] + phi0[t] - \
                *xorigin * *h * omegarev0[t] * *dtbin;
            }
            /* forall(mm=1:size(xp)) xp(mm)=(xp(mm)-&
                phiwrap*FLOOR(xp(mm)/phiwrap))/(h*omegarev0(turnnow)*dtbin) */
            for(mm = 0; mm < l; mm++) {
                xp[mm] = (xp[mm] - \
                *phiwrap * floor(xp[mm] / *phiwrap)) / (*h * omegarev0[t] * *dtbin);
            }
        }
    }
   
    // yp=denergy/dEbin+yat0
    for(mm=0; mm<l; mm++) {
        yp[mm] = denergy[mm] / *dEbin + *yat0;
    }   

    *turnnow = t;
   
  return;
}

Visual Studio 2017 and CUDA 9.1 Problem

I ran into this problem when trying to build thrust_sort.cu in the Thrust lecture. The only way I was able to build and run successfully was to create a CUDA 9.1 project. However, in the current version of Visual Studio 2017, unless you set the Platform Toolset to Visual Studio 2015 (v140), you will not be able to build and run CUDA 9.1 projects. This can be done by going to project properties, then to the General section, then changing the Platform Toolset. However, this is where I ran into a problem where Visual Studio would display an error and would not let me change the platform toolset. So I came up with the following workaround and it works:

  • If you haven't already done so, install the optional Visual Studio 2015 (v140) component which is available from the Visual Studio 2017 installer.
  • From Visual Studio, create a CUDA 9.1 project, then close the solution.
  • Using a text editor, open <project name>.vcxproj
  • Add the following as the first element in the XML under the Project tag:
 <PropertyGroup>
    <CUDAPropsPath Condition="'$(CUDAPropsPath)'==''">$(VCTargetsPath)\BuildCustomizations</CUDAPropsPath>
 </PropertyGroup>
  • Replace all occurrences (there are 2 of them) of v141 with v140.
  • Search for "CUDA 9.1" (you will find 2 occurrences). Then replace the first entire line with <Import Project="$(CUDAPropsPath)\CUDA 9.1.props" /> and the second entire line with <Import Project="$(CUDAPropsPath)\CUDA 9.1.targets" />.
  • Close the file in the text editor then re-open the solution in Visual Studio. You should now be able to add your .cu files, build and run.