a[(r - i * c) + j] = b[i * c + j];
}
[[ImageFile:cuda_size_257kb.png|thumb|500px| ]] ----[[File:cuda_size_769kb.png]]
The results were same as constructor.
The CUDA memory allocation cudaMalloc is the most time consuming operation in the kernel execution.