Changes

GPU621/Threadless Horsemen

3,976 bytes added, 11:02, 28 November 2018

→‎Introduction: The Julia Programming Language

# [mailto:nmmisener@myseneca.ca Nathan Misener]

== Introduction: The Julia Programming Language ==

[[File:Julia_lang_logo.png|300px]]

=== Bit of History ===

More Use Cases:

https://juliacomputing.com/case-studies/

* Julia computing was co-founded by the co-creators of Julia to provide support, consulting and other services to organizations using Julia

* The company raised $4.6M in seed funding last year (http://www.finsmes.com/2017/06/julia-computing-raises-4-6m-in-seed-funding.html)

== Julia's Forms of Parallelism ==

* We focused on this in our quantitative testing, since at the time of writing code, we only had experience with OpenMP

[[File:Omp_fork_join.png|800px]]

=== Multi-core Or Distributed Processing ===

* you can call fetch on the Future to get the result

[[File:Julia_remote_call.png|800px]]

https://www.youtube.com/watch?v=RlogUNQTf-M (Introduction to Julia and its Parallel Implmentation, 2:00)

# Example

# @everywhere lets all processes be able to call the function

@everywhere function whoami()

println(myid(), gethostname())

end

remotecall_fetch(whoami, 2)

remotecall_fetch(whoami, 4)

# remotecall_fetch is the same as fetch(remotecall(...))

</source>

Source: https://www.dursi.ca/post/julia-vs-chapel.html#parallel-primitives

=== Coroutines (Green Threads) ===

More code here: https://github.com/tsarkarsc/parallel_prog

* If you don't care about using Threads, Julia has some features called macros which look similar to OpenMP's parallel constructs

* OpenMP of course uses multi-threading, whereas Julia uses Tasks, which is what they call their coroutines / fibers

* The following is a comparison of parallel reduction in OpenMP and Julia

{| class="wikitable"

|-

! OpenMP

! Julia

|-

|

template <typename T>

T reduce(

const T* in, // points to the data set

int n, // number of elements in the data set

T identity // initial value

) {

T accum = identity;

#pragma omp parallel for reduction(+:accum)

for (int i = 0; i < n; i++)

accum += in[i];

return accum;

}

</source>

|

a = randn(1000)

@distributed (+) for i = 1:100000

some_func(a[rand(1:end)])

end

</source>

|}

* Note: Looks like Julia used to have a macro called @parallel so you could use, say, @parallel for, but it seems like it was deprecated in favour of @distributed

https://scs.senecac.on.ca/~gpu621/pages/content/omp_3.html

https://docs.julialang.org/en/v1/stdlib/Distributed/#Distributed.@distributed

https://docs.julialang.org/en/v1/manual/parallel-computing/index.html

https://github.com/JuliaLang/julia/issues/19578

== OpenMP vs Julia Results ==

[[File:GPU621_JuliaRuntime.png]] [[File:GPU621_openMpRuntime.png]] * We are taking the Workshop 3 as our test example and we are looking at the differences in speeds* We decreased the size of the array to 512* ~~add graphs~~From here we wrote code for Julia to match the base, loop interchange and threading tests [[File:GPU621_O1Runtime_Julia_OpenMp.png]] [[File:GPU621_O2Runtime_Julia_OpenMp.png]] * ~~recap~~ The reason loop interchange ~~benefits~~ works for ~~openmp (locality of reference)~~OpenMP is the way we store our array in memory originally favored "Row-Major" which allows the processor to move across cached data in a row fashion faster than column based. * ~~discuss julia storing arrays as column major~~As you might have seen, Julia’s loop interchange ~~was~~ is worse it's for ~~julia~~opposite reason OpenMP improves from loop interchange. * [https://docs.julialang.org/en/v1/manual/performance-tips/index.html Julia favours "Column-Major" layouts in cache memory.] [[File:255px-Row_and_column_major_order.svg.png]] * ~~discuss different~~ Julia has several levels of runtime optimization (0-3)* julia -O2 scriptName.jl or julia --optimize=2 * Set the optimizationlevel (default level is 2 if unspecified or 3 if used without a level -O) == Vectorization == * We want to briefly touch on vectorization{| class="wikitable"|-! Using Vectorization! Expanded axpy function|-|<source>function axpy(a,x,y) @simd for i=1:length(x) @inbounds y[i] += a*x[i] endend n = 1003x = rand(Float32,n)y = rand(Float32,n)axpy(1.414f0, x, y)</source>|<source>function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1}) n=length(x) i = 1 @inbounds while i<=n t1 = x[i] t2 = y[i] t3 = a*t1[i] t4 = t2+t3 y[i] = t4 i += 1 endend</source>|} * The @simd macro gives the compiler license to vectorize without checking whether it will change the program's visible behavior.* The vectorized code will behave as if the code were written to operate on chunks of the arrays.* @inbounds turns off subscript checking that might throw an exception.* Make sure your subscripts are in bounds before using it or you might corrupt your Julia session. [https://software.intel.com/en-us/articles/vectorization-in-julia More info on vectorization in Julia]

== Conclusion ==

* JIT compiled to native code thanks to LLVM, faster than interpreted languages like Python, slower than compile-ahead-of-time languages like C++

* Although slower than C++, implements simpler syntax (looks similar to Python)

* The default compiler takes care of some optimization tasks for you. Don't need to worry about locality of reference (loop interchange) ~~or vectorization~~

* Multi-threading is still experimental, and it's recommended to use distributed processing or coroutines (green threads) for parallelism

</div>

Tsarkarcd

93

edits

Changes

GPU621/Threadless Horsemen

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools