Difference between revisions of "GPU610/TeamLean"
(Created page with 'Team Lean Page for Squish Optimization') |
Btulchinsky (talk | contribs) (updated with findings... thought I did this last week but it was either overwritten or I didn't save the page.) |
||
(3 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | Team | + | == Team Members == |
+ | # [mailto:aadamico@myseneca.ca?subject=gpu610 Alex D'Amico] | ||
+ | # [mailto:btulchinsky@myseneca.ca?subject=gpu610 Barry Tulchinsky] | ||
+ | |||
+ | [mailto:aadamico@myseneca.ca,btulchinsky@myseneca.ca?subject=gpu610 Email All] | ||
+ | |||
+ | = Progress = | ||
+ | == Assignment 1 == | ||
+ | |||
+ | '''For the first assignment we each profiled open source libraries, Lame - a audio compression utility and Squish an image compression utility.''' | ||
+ | |||
+ | ''' | ||
+ | ''' | ||
+ | |||
+ | '''Alex - results for Lame''' | ||
+ | |||
+ | I have examined the LAME mp3 encoder to see if the process of encoding a wav file into an mp3 can be parallelized. | ||
+ | |||
+ | The below source, code from the psymodel.c file, could potentially be parallelized. | ||
+ | |||
+ | From the sample runs and the given results in the profile, as the size of the wave file gets larger, the percentage of the function that takes the most time actually goes down. This might mean that it is not worth parallelizing. If my group chooses this project, we will have to examine this carefully. | ||
+ | |||
+ | '''<u>SOURCE CODE</u>''' | ||
+ | int L3psycho_anal_vbr(lame_internal_flags * gfc, const sample_t * const buffer[2], int gr_out, | ||
+ | III_psy_ratio masking_ratio[2][2], | ||
+ | III_psy_ratio masking_MS_ratio[2][2], | ||
+ | FLOAT percep_entropy[2], FLOAT percep_MS_entropy[2], | ||
+ | FLOAT energy[4], int blocktype_d[2]){ | ||
+ | SessionConfig_t const *const cfg = &gfc->cfg; | ||
+ | PsyStateVar_t *const psv = &gfc->sv_psy; | ||
+ | PsyConst_CB2SB_t const *const gdl = &gfc->cd_psy->l; | ||
+ | PsyConst_CB2SB_t const *const gds = &gfc->cd_psy->s; | ||
+ | plotting_data *plt = cfg->analysis ? gfc->pinfo : 0; | ||
+ | III_psy_xmin last_thm[4]; | ||
+ | /* fft and energy calculation */ | ||
+ | FLOAT(*wsamp_l)[BLKSIZE]; | ||
+ | FLOAT(*wsamp_s)[3][BLKSIZE_s]; | ||
+ | FLOAT fftenergy[HBLKSIZE]; | ||
+ | FLOAT fftenergy_s[3][HBLKSIZE_s]; | ||
+ | FLOAT wsamp_L[2][BLKSIZE]; | ||
+ | FLOAT wsamp_S[2][3][BLKSIZE_s]; | ||
+ | FLOAT eb[4][CBANDS], thr[4] [CBANDS]; | ||
+ | FLOAT sub_short_factor[4][3]; | ||
+ | FLOAT thmm; | ||
+ | FLOAT const pcfact = 0.6f; | ||
+ | FLOAT const ath_factor = (cfg->msfix > 0.f) ? (cfg->ATH_offset_factor * gfc->ATH->adjust_factor) : 1.f; | ||
+ | const FLOAT(*const_eb)[CBANDS] = (const FLOAT(*)[CBANDS]) eb; | ||
+ | const FLOAT(*const_fftenergy_s) [HBLKSIZE_s] = (const FLOAT(*)[HBLKSIZE_s]) fftenergy_s; | ||
+ | /* block type */ | ||
+ | int ns_attacks[4] [4] = { {0, 0, 0, 0}, {0, 0, 0, 0}, {0, 0, 0, 0}, {0, 0, 0, 0} }; | ||
+ | int uselongblock[2]; | ||
+ | /* usual variables like loop indices, etc.. */ | ||
+ | int chn, sb, sblock; | ||
+ | /* chn=2 and 3 = Mid and Side channels */ | ||
+ | int const n_chn_psy = (cfg->mode == JOINT_STEREO) ? 4 : cfg->channels_out; | ||
+ | memcpy(&last_thm[0], &psv->thm[0], sizeof(last_thm)); | ||
+ | vbrpsy_attack_detection(gfc, buffer, gr_out, masking_ratio, masking_MS_ratio, energy, | ||
+ | sub_short_factor, ns_attacks, uselongblock); | ||
+ | vbrpsy_compute_block_type(cfg, uselongblock); | ||
+ | /* LONG BLOCK CASE */ | ||
+ | { | ||
+ | for (chn = 0; chn < n_chn_psy; chn++) { | ||
+ | int const ch01 = chn & 0x01; | ||
+ | wsamp_l = wsamp_L + ch01; | ||
+ | vbrpsy_compute_fft_l(gfc, buffer, chn, gr_out, fftenergy, wsamp_l); | ||
+ | vbrpsy_compute_loudness_approximation_l(gfc, gr_out, chn, fftenergy); | ||
+ | vbrpsy_compute_masking_l(gfc, fftenergy, eb[chn], thr[chn], chn); | ||
+ | } | ||
+ | if (cfg->mode == JOINT_STEREO) { | ||
+ | if ((uselongblock[0] + uselongblock[1]) == 2) { | ||
+ | vbrpsy_compute_MS_thresholds(const_eb, thr, gdl->mld_cb, gfc->ATH->cb_l, | ||
+ | ath_factor, cfg->msfix, gdl->npart); | ||
+ | } | ||
+ | } | ||
+ | /* TODO: apply adaptive ATH masking here ?? */ | ||
+ | for (chn = 0; chn < n_chn_psy; chn++) { | ||
+ | convert_partition2scalefac_l(gfc, eb[chn], thr[chn], chn); | ||
+ | convert_partition2scalefac_l_to_s (gfc, eb[chn], thr[chn], chn); | ||
+ | } | ||
+ | } | ||
+ | /* SHORT BLOCKS CASE */ | ||
+ | { | ||
+ | int const force_short_block_calc = gfc->cd_psy->force_short_block_calc; | ||
+ | for (sblock = 0; sblock < 3; sblock++) { | ||
+ | for (chn = 0; chn < n_chn_psy; ++chn) { | ||
+ | int const ch01 = chn & 0x01; | ||
+ | if (uselongblock[ch01] && !force_short_block_calc) { | ||
+ | vbrpsy_skip_masking_s(gfc, chn, sblock); | ||
+ | } | ||
+ | else { | ||
+ | /* compute masking thresholds for short blocks */ | ||
+ | wsamp_s = wsamp_S + ch01; | ||
+ | vbrpsy_compute_fft_s(gfc, buffer, chn, sblock, fftenergy_s, wsamp_s); | ||
+ | vbrpsy_compute_masking_s(gfc, const_fftenergy_s, eb[chn], thr[chn], chn, | ||
+ | sblock); | ||
+ | } | ||
+ | } | ||
+ | if (cfg->mode == JOINT_STEREO) { | ||
+ | if ((uselongblock[0] + uselongblock[1]) == 0) { | ||
+ | vbrpsy_compute_MS_thresholds (const_eb, thr, gds->mld_cb, gfc->ATH->cb_s, | ||
+ | ath_factor, cfg->msfix, gds->npart); | ||
+ | } | ||
+ | } | ||
+ | /* TODO: apply adaptive ATH masking here ?? */ | ||
+ | for (chn = 0; chn < n_chn_psy; ++chn) { | ||
+ | int const ch01 = chn & 0x01; | ||
+ | if (!uselongblock[ch01] || force_short_block_calc) { | ||
+ | convert_partition2scalefac_s(gfc, eb[chn], thr[chn], chn, sblock); | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | /**** short block pre-echo control ****/ | ||
+ | for (chn = 0; chn < n_chn_psy; chn++) { | ||
+ | for (sb = 0; sb < SBMAX_s; sb++) { | ||
+ | FLOAT new_thmm[3], prev_thm, t1, t2; | ||
+ | for (sblock = 0; sblock < 3; sblock++) { | ||
+ | thmm = psv->thm[chn].s[sb][sblock]; | ||
+ | thmm *= NS_PREECHO_ATT0; | ||
+ | t1 = t2 = thmm; | ||
+ | if (sblock > 0) { | ||
+ | prev_thm = new_thmm[sblock - 1]; | ||
+ | } | ||
+ | else { | ||
+ | prev_thm = last_thm[chn].s[sb][2]; | ||
+ | } | ||
+ | if (ns_attacks[chn][sblock] >= 2 || ns_attacks[chn][sblock + 1] == 1) { | ||
+ | t1 = NS_INTERP(prev_thm, thmm, NS_PREECHO_ATT1 * pcfact); | ||
+ | } | ||
+ | thmm = Min(t1, thmm); | ||
+ | if (ns_attacks[chn][sblock] == 1) { | ||
+ | t2 = NS_INTERP(prev_thm, thmm, NS_PREECHO_ATT2 * pcfact); | ||
+ | } | ||
+ | else if ((sblock == 0 && psv->last_attacks[chn] == 3) | ||
+ | || (sblock > 0 && ns_attacks [chn][sblock - 1] == 3)) { /* 2nd preceeding block */ | ||
+ | switch (sblock) { | ||
+ | case 0: | ||
+ | prev_thm = last_thm[chn].s[sb][1]; | ||
+ | break; | ||
+ | case 1: | ||
+ | prev_thm = last_thm[chn].s[sb][2]; | ||
+ | break; | ||
+ | case 2: | ||
+ | prev_thm = new_thmm[0]; | ||
+ | break; | ||
+ | } | ||
+ | t2 = NS_INTERP(prev_thm, thmm, NS_PREECHO_ATT2 * pcfact); | ||
+ | } | ||
+ | thmm = Min (t1, thmm); | ||
+ | thmm = Min(t2, thmm); | ||
+ | /* pulse like signal detection for fatboy.wav and so on */ | ||
+ | thmm *= sub_short_factor[chn][sblock]; | ||
+ | new_thmm[sblock] = thmm; | ||
+ | } | ||
+ | for (sblock = 0; sblock < 3; sblock++) { | ||
+ | psv->thm[chn].s[sb][sblock] = new_thmm[sblock]; | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | for (chn = 0; chn < n_chn_psy; chn++) { | ||
+ | psv->last_attacks[chn] = ns_attacks[chn][2]; | ||
+ | } | ||
+ | /*************************************************************** | ||
+ | * determine final block type | ||
+ | ***************************************************************/ | ||
+ | vbrpsy_apply_block_type(psv, cfg- >channels_out, uselongblock, blocktype_d); | ||
+ | /********************************************************************* | ||
+ | * compute the value of PE to return ... no delay and advance | ||
+ | *********************************************************************/ | ||
+ | for (chn = 0; chn < n_chn_psy; chn++) { | ||
+ | FLOAT *ppe; | ||
+ | int type; | ||
+ | III_psy_ratio const *mr; | ||
+ | if (chn > 1) { | ||
+ | ppe = percep_MS_entropy - 2; | ||
+ | type = NORM_TYPE; | ||
+ | if (blocktype_d[0] == SHORT_TYPE || blocktype_d[1] == SHORT_TYPE) | ||
+ | type = SHORT_TYPE; | ||
+ | mr = &masking_MS_ratio[gr_out][chn - 2]; | ||
+ | } | ||
+ | else { | ||
+ | ppe = percep_entropy; | ||
+ | type = blocktype_d[chn]; | ||
+ | mr = &masking_ratio[gr_out][chn]; | ||
+ | } | ||
+ | if (type == SHORT_TYPE) { | ||
+ | ppe[chn] = pecalc_s(mr, gfc->sv_qnt.masking_lower); | ||
+ | } | ||
+ | else { | ||
+ | ppe[chn] = pecalc_l(mr, gfc->sv_qnt.masking_lower); | ||
+ | } | ||
+ | if (plt) { | ||
+ | plt->pe [gr_out][chn] = ppe[chn]; | ||
+ | } | ||
+ | } | ||
+ | return 0; | ||
+ | } | ||
+ | |||
+ | ''' | ||
+ | ''' | ||
+ | |||
+ | '''Barry - results for Squish''' | ||
+ | I looked at a image compression library called squish. | ||
+ | |||
+ | There are several possibilities for compression. It depends on the amount of adjacent colours and their relativity to one another. In other words, the more common the colours, the better the compression as it tries to fit RGB schemes in a smaller vector object. | ||
+ | |||
+ | Below is my findings for the library: | ||
+ | |||
+ | There were several test files included, one that tested a PNG file compression. However, Linux didn't have a required library so I was only able to profile simple colour compression. In the future if I get the PNG compression test to work I will attempt to profile it and discuss with my team member if he would like to pursue Nonetheless, I believe that it was sufficient to profile the area of the code that could potentially benefit from parallelization. | ||
+ | |||
+ | Looking at the 3 profiles (squishtest.select1.flt, squishtest.select2.flt, squishtest.select3.flt) it seems that compression and decompression of one colour is extremely fast (select1 and select2 files). The problem comes when there are 2 (or more) colours involved (select3 file). I also noticed that the FloatTo565 function is called the most out of all the functions. Looking into that function however, its a method that compacts the RBG into a single value using bitwise operations. There are 3 and bitwise operations are quite fast, so I don't think we should focus on that even though it's called many times. | ||
+ | |||
+ | I believe the best place to offset the workload into the GPU would be the Compress3 and Compress4 function primarily, as well as the ComputeWeightedCovariance function. The profile is only for the Compress4 and not for the Compress3 function, but I presume that it may be called a lot of times as well depending on the compression format. | ||
+ | |||
+ | There is also a function that orders the vectors of colours but that can't be parralelized because it's dependent on a previous iteration. | ||
+ | |||
+ | === squishtest.select1.flt === | ||
+ | |||
+ | Flat profile: | ||
+ | |||
+ | Each sample counts as 0.01 seconds. | ||
+ | no time accumulated | ||
+ | |||
+ | % cumulative self self total | ||
+ | time seconds seconds calls Ts/call Ts/call name | ||
+ | 0.00 0.00 0.00 3366 0.00 0.00 squish::FloatTo565(squish::Vec3 const&) | ||
+ | 0.00 0.00 0.00 2000 0.00 0.00 squish::SingleColourFit::ComputeEndPoints(squish::SingleColourLookup const* const*) | ||
+ | 0.00 0.00 0.00 2000 0.00 0.00 squish::FixFlags(int) | ||
+ | 0.00 0.00 0.00 2000 0.00 0.00 squish::Unpack565(unsigned char const*, unsigned char*) | ||
+ | 0.00 0.00 0.00 1683 0.00 0.00 squish::WriteColourBlock(int, int, unsigned char*, void*) | ||
+ | 0.00 0.00 0.00 1683 0.00 0.00 squish::ColourSet::RemapIndices(unsigned char const*, unsigned char*) const | ||
+ | 0.00 0.00 0.00 1000 0.00 0.00 GetColourError(unsigned char const*, unsigned char const*) | ||
+ | 0.00 0.00 0.00 1000 0.00 0.00 squish::Decompress(unsigned char*, void const*, int) | ||
+ | 0.00 0.00 0.00 1000 0.00 0.00 squish::CompressMasked(unsigned char const*, int, void*, int) | ||
+ | 0.00 0.00 0.00 1000 0.00 0.00 squish::SingleColourFit::Compress3(void*) | ||
+ | 0.00 0.00 0.00 1000 0.00 0.00 squish::SingleColourFit::Compress4(void*) | ||
+ | 0.00 0.00 0.00 1000 0.00 0.00 squish::SingleColourFit::SingleColourFit(squish::ColourSet const*, int) | ||
+ | 0.00 0.00 0.00 1000 0.00 0.00 squish::DecompressColour(unsigned char*, void const*, bool) | ||
+ | 0.00 0.00 0.00 1000 0.00 0.00 squish::WriteColourBlock3(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*) | ||
+ | 0.00 0.00 0.00 1000 0.00 0.00 squish::Compress(unsigned char const*, void*, int) | ||
+ | 0.00 0.00 0.00 1000 0.00 0.00 squish::ColourFit::Compress(void*) | ||
+ | 0.00 0.00 0.00 1000 0.00 0.00 squish::ColourFit::ColourFit(squish::ColourSet const*, int) | ||
+ | 0.00 0.00 0.00 1000 0.00 0.00 squish::ColourSet::ColourSet(unsigned char const*, int, int) | ||
+ | 0.00 0.00 0.00 683 0.00 0.00 squish::WriteColourBlock4(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*) | ||
+ | 0.00 0.00 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z14GetColourErrorPKhS0_ | ||
+ | |||
+ | === squishtest.select2.flt === | ||
+ | |||
+ | Flat profile: | ||
+ | |||
+ | Each sample counts as 0.01 seconds. | ||
+ | no time accumulated | ||
+ | |||
+ | % cumulative self self total | ||
+ | time seconds seconds calls Ts/call Ts/call name | ||
+ | 0.00 0.00 0.00 2104 0.00 0.00 squish::FloatTo565(squish::Vec3 const&) | ||
+ | 0.00 0.00 0.00 1530 0.00 0.00 squish::SingleColourFit::ComputeEndPoints(squish::SingleColourLookup const* const*) | ||
+ | 0.00 0.00 0.00 1530 0.00 0.00 squish::FixFlags(int) | ||
+ | 0.00 0.00 0.00 1530 0.00 0.00 squish::Unpack565(unsigned char const*, unsigned char*) | ||
+ | 0.00 0.00 0.00 1052 0.00 0.00 squish::WriteColourBlock(int, int, unsigned char*, void*) | ||
+ | 0.00 0.00 0.00 1052 0.00 0.00 squish::ColourSet::RemapIndices(unsigned char const*, unsigned char*) const | ||
+ | 0.00 0.00 0.00 765 0.00 0.00 GetColourError(unsigned char const*, unsigned char const*) | ||
+ | 0.00 0.00 0.00 765 0.00 0.00 squish::Decompress(unsigned char*, void const*, int) | ||
+ | 0.00 0.00 0.00 765 0.00 0.00 squish::CompressMasked(unsigned char const*, int, void*, int) | ||
+ | 0.00 0.00 0.00 765 0.00 0.00 squish::SingleColourFit::Compress3(void*) | ||
+ | 0.00 0.00 0.00 765 0.00 0.00 squish::SingleColourFit::Compress4(void*) | ||
+ | 0.00 0.00 0.00 765 0.00 0.00 squish::SingleColourFit::SingleColourFit(squish::ColourSet const*, int) | ||
+ | 0.00 0.00 0.00 765 0.00 0.00 squish::DecompressColour(unsigned char*, void const*, bool) | ||
+ | 0.00 0.00 0.00 765 0.00 0.00 squish::WriteColourBlock3(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*) | ||
+ | 0.00 0.00 0.00 765 0.00 0.00 squish::Compress(unsigned char const*, void*, int) | ||
+ | 0.00 0.00 0.00 765 0.00 0.00 squish::ColourFit::Compress(void*) | ||
+ | 0.00 0.00 0.00 765 0.00 0.00 squish::ColourFit::ColourFit(squish::ColourSet const*, int) | ||
+ | 0.00 0.00 0.00 765 0.00 0.00 squish::ColourSet::ColourSet(unsigned char const*, int, int) | ||
+ | 0.00 0.00 0.00 287 0.00 0.00 squish::WriteColourBlock4(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*) | ||
+ | 0.00 0.00 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z14GetColourErrorPKhS0_ | ||
+ | |||
+ | === squishtest.select3.flt === | ||
+ | |||
+ | Flat profile: | ||
+ | |||
+ | Each sample counts as 0.01 seconds. | ||
+ | % cumulative self self total | ||
+ | time seconds seconds calls us/call us/call name | ||
+ | 26.09 0.06 0.06 97155 0.62 0.84 squish::ClusterFit::Compress4(void*) | ||
+ | 21.74 0.11 0.05 292548 0.17 0.17 squish::FloatTo565(squish::Vec3 const&) | ||
+ | 21.74 0.16 0.05 97155 0.51 1.01 squish::ClusterFit::Compress3(void*) | ||
+ | 4.35 0.17 0.01 194310 0.05 0.05 squish::ClusterFit::ConstructOrdering(squish::Vec3 const&, int) | ||
+ | 4.35 0.18 0.01 194310 0.05 0.05 squish::Unpack565(unsigned char const*, unsigned char*) | ||
+ | 4.35 0.19 0.01 97155 0.10 0.21 squish::DecompressColour(unsigned char*, void const*, bool) | ||
+ | 4.35 0.20 0.01 97155 0.10 0.44 squish::WriteColourBlock3(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*) | ||
+ | 4.35 0.21 0.01 97155 0.10 0.10 squish::ComputeWeightedCovariance(int, squish::Vec3 const*, float const*) | ||
+ | 4.35 0.22 0.01 97155 0.10 0.10 squish::ColourSet::ColourSet(unsigned char const*, int, int) | ||
+ | 4.35 0.23 0.01 TestTwoColour(int) | ||
+ | 0.00 0.23 0.00 194310 0.00 0.00 squish::FixFlags(int) | ||
+ | 0.00 0.23 0.00 146274 0.00 0.00 squish::WriteColourBlock(int, int, unsigned char*, void*) | ||
+ | 0.00 0.23 0.00 146274 0.00 0.00 squish::ColourSet::RemapIndices(unsigned char const*, unsigned char*) const | ||
+ | 0.00 0.23 0.00 97155 0.00 0.00 GetColourError(unsigned char const*, unsigned char const*) | ||
+ | 0.00 0.23 0.00 97155 0.00 0.10 squish::ClusterFit::ClusterFit(squish::ColourSet const*, int) | ||
+ | 0.00 0.23 0.00 97155 0.00 0.21 squish::Decompress(unsigned char*, void const*, int) | ||
+ | 0.00 0.23 0.00 97155 0.00 2.06 squish::CompressMasked(unsigned char const*, int, void*, int) | ||
+ | 0.00 0.23 0.00 97155 0.00 0.00 squish::ComputePrincipleComponent(squish::Sym3x3 const&) | ||
+ | 0.00 0.23 0.00 97155 0.00 2.06 squish::Compress(unsigned char const*, void*, int) | ||
+ | 0.00 0.23 0.00 97155 0.00 1.01 squish::ColourFit::Compress(void*) | ||
+ | 0.00 0.23 0.00 97155 0.00 0.00 squish::ColourFit::ColourFit(squish::ColourSet const*, int) | ||
+ | 0.00 0.23 0.00 97155 0.00 0.00 squish::GetMultiplicity1Evector(squish::Sym3x3 const&, float) | ||
+ | 0.00 0.23 0.00 49119 0.00 0.34 squish::WriteColourBlock4(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*) | ||
+ | 0.00 0.23 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z14GetColourErrorPKhS0_ | ||
+ | |||
+ | == Assignment 2 == | ||
+ | |||
+ | DECISION: work on squish library | ||
+ | |||
+ | We ran into a lot of difficulty with this assignment from beginning until the time of submission. | ||
+ | |||
+ | We decided to work on the squish project. However, the first problem that we ran into was that we didn't know how to convert the makefile on linux to work on Windows and nvcc. We finally managed to devise a workaround where we edited the header files to point to squish.h in the current directory so we won't require dynamic linking and just compile everything in one line, unfortunately it took us several days. | ||
+ | |||
+ | The next issue we had was finding and getting all the various class methods to be available on the device and host via the __device__ __host__ modifiers. This was a lot more difficult than we first anticipated as some classes had different versions depending on how it's configured and did take us quite a bit of time as some methods used other (or nested) methods which made it hard to follow and debug. | ||
+ | |||
+ | In addition, this brought more difficulty due to the fact that standard library functions were used, such as min and max. We looked into using thrust, however, we decided to just use conditional operations to work around the problem for simplicity. | ||
+ | |||
+ | The major problem was debugging the maths.cu code. On our first successful compilation with the kernel code, we noticed that there was an error somewhere as our third, and most important test, which was mixing 2 colours didn't produce any results. We initially assumed it was a kernel error. But when we tried to compile the maths.cu file from the original maths.cpp code (we just copied the original maths.cpp file and changed the extension to .cu) it was producing the same results. This made it virtually impossible to debug, even when using the visual profiler, as it only tells us that there was a non-zero (or 1) return from main. | ||
+ | |||
+ | It's due to this reason that we were unable to produce any visual chart or comparison with the original code, as we still need to debug the code. However, it appears that the kernel code is fine and the error is after it's execution, this is according to the profiler. | ||
+ | |||
+ | During our search for answers we noticed that CUDA uses squish for texture operations (https://developer.nvidia.com/gpu-accelerated-texture-compression). We will try to get in touch with Simon Brown who is the creator of squish and see if he can lead us in the proper path, as it can be clearly parallelized if it is used in some CUDA functionality. | ||
+ | |||
+ | == Assignment 3 == | ||
+ | |||
+ | This project had a lot of potential at first but both Alex and I found it very disappointing and frustrating as we couldn't manage to get it to run with much success. | ||
+ | Because we didn't get it working. We decided to write about the theory behind our intentions for optimization, after we explain what we've done. | ||
+ | |||
+ | As we originally mentioned in our second assignment findings, we thought that our kernel code was fine and that we had issues elsewhere. However, that was not the case. | ||
+ | |||
+ | It turns out that there was an error in our kernel, and we did notice some logical errors that we overlooked. We were able to get some profiling through the visual profiler | ||
+ | but the code crashes and therefore the profile is incomplete, we were unable to get the code to work for our 3rd and most important case, and we couldn't figure out why. | ||
+ | Since we couldn't get it working with the two sources that we mentioned, we decided to focus on one, which was the ComputeWeightedCovariance method and its kernel as seen | ||
+ | below (Please excuse the excessive comments): | ||
+ | |||
+ | === KERNEL: === | ||
+ | <pre> | ||
+ | __global__ void kernelWeightedConvariance (float* weights, Vec3* points, float* total, Vec3* cudaCentroid, int n) { | ||
+ | //shared memory to reduce memory latency | ||
+ | __shared__ float sharedTotal; | ||
+ | //centroid members | ||
+ | // __shared__ float cx; | ||
+ | // __shared__ float cxTotal; | ||
+ | // __shared__ float cy; | ||
+ | // __shared__ float cyTotal; | ||
+ | // __shared__ float cz; | ||
+ | // __shared__ float czTotal; | ||
+ | __shared__ Vec3 sharedCentroid; | ||
+ | int idx = blockDim.x * blockIdx.x + threadIdx.x; | ||
+ | |||
+ | |||
+ | if (idx == 0) { | ||
+ | *total = 0.0f; | ||
+ | *cudaCentroid = Vec3(0.0f); | ||
+ | } | ||
+ | if (threadIdx.x == 0) { | ||
+ | sharedCentroid = Vec3(0.0f); | ||
+ | sharedTotal = 0; | ||
+ | } | ||
+ | __syncthreads(); | ||
+ | |||
+ | // cx = 0; | ||
+ | // cy = 0; | ||
+ | // cz = 0; | ||
+ | // cxTotal = 0.0f; | ||
+ | // cyTotal = 0.0f; | ||
+ | // czTotal = 0.0f; | ||
+ | |||
+ | if (idx < n) { | ||
+ | sharedTotal += weights[idx]; | ||
+ | // cxTotal += weights[idx] * points[idx | ||
+ | sharedCentroid = weights[idx] * points[idx]; | ||
+ | } | ||
+ | __syncthreads(); | ||
+ | |||
+ | //copy to global memory | ||
+ | if (threadIdx.x == 0) { | ||
+ | *total += sharedTotal; | ||
+ | *cudaCentroid += sharedCentroid; | ||
+ | } | ||
+ | } | ||
+ | </pre> | ||
+ | METHOD CONTAINING KERNEL CALL | ||
+ | |||
+ | <pre> | ||
+ | Sym3x3 ComputeWeightedCovariance( int n, Vec3 const* points, const float * weights ) | ||
+ | { | ||
+ | // compute the centroid | ||
+ | float total = 0.0f; | ||
+ | Vec3 centroid( 0.0f ); | ||
+ | // float centroidX = 0.0f; | ||
+ | // float centroidY = 0.0f; | ||
+ | // float centroidZ = 0.0f; | ||
+ | cudaError_t error; | ||
+ | bool cudaContinue = true; | ||
+ | |||
+ | |||
+ | // device memory addresses | ||
+ | float* cudaWeights; | ||
+ | Vec3* cudaPoints; | ||
+ | float* cudaTotal; | ||
+ | Vec3* cudaCentroid; | ||
+ | // float* cudaCentroidX; | ||
+ | // float* cudaCentroidY; | ||
+ | // float* cudaCentroidZ; | ||
+ | |||
+ | //calculate number of blocks | ||
+ | int nblocks = (n + ntpb - 1) / ntpb; | ||
+ | // int nblocks = n / ntpb + 1; | ||
+ | |||
+ | //allocate device memory | ||
+ | if (cudaContinue && (error = cudaMalloc((void**)&cudaWeights, n * sizeof(float))) != cudaSuccess) { | ||
+ | cout<< "unable to create device memory for cudaWeights: " << cudaGetErrorString(error) << endl; | ||
+ | cudaContinue = false; | ||
+ | } | ||
+ | |||
+ | if (cudaContinue && (error = cudaMalloc((void**)&cudaPoints, n * sizeof(Vec3))) != cudaSuccess) { | ||
+ | cout<< "unable to create device memory for cudaPoints: " << cudaGetErrorString(error) << endl; | ||
+ | cudaContinue = false; | ||
+ | } | ||
+ | |||
+ | if (cudaContinue && (error = cudaMalloc((void**)&cudaTotal, sizeof(float))) != cudaSuccess) { | ||
+ | cout<< "unable to create device memory for cudaTotal: " << cudaGetErrorString(error) << endl; | ||
+ | cudaContinue = false; | ||
+ | } | ||
+ | |||
+ | if (cudaContinue && (error = cudaMalloc((void**)&cudaCentroid, sizeof(Vec3))) != cudaSuccess) { | ||
+ | cout<< "unable to create device memory for cudaCentroid: " << cudaGetErrorString(error) << endl; | ||
+ | cudaContinue = false; | ||
+ | } | ||
+ | |||
+ | // cout<<"cudamemcpy: "<<cudaGetErrorString(cudaGetLastError())<<" "<<temp<<" "<< weights<<endl; | ||
+ | // cout<<"cudamemcpyTotal: "<<cudaGetErrorString(cudaGetLastError())<<" "<<cudaTotal<<" "<< weights<<endl; | ||
+ | |||
+ | // cout<<"cudamemcpyCentroid: "<<cudaGetErrorString(cudaGetLastError())<<" "<<cudaCentroid<<" "<< weights<<endl; | ||
+ | // cudaMalloc((void**)&cudaCentroidX, sizeof(float)); | ||
+ | // cudaMalloc((void**)&cudaCentroidY, sizeof(float)); | ||
+ | // cudaMalloc((void**)&cudaCentroidZ, sizeof(float)); | ||
+ | |||
+ | //copy the weights and points to the device | ||
+ | if ((error = cudaMemcpy(cudaWeights, weights, n * sizeof(float), cudaMemcpyHostToDevice)) != cudaSuccess) { | ||
+ | cout<<"failed to copy weights to device: "<<cudaGetErrorString(error)<<" "<<weights<<endl; | ||
+ | } | ||
+ | if ((error = cudaMemcpy(cudaPoints, points, n * sizeof(Vec3), cudaMemcpyHostToDevice)) != cudaSuccess) { | ||
+ | cout<<"failed to copy points to device: "<<cudaGetErrorString(error)<<" "<<points<<endl; | ||
+ | } | ||
+ | // ; | ||
+ | cout<<"GOING INTO KERNEL"<<endl; | ||
+ | //OFFSET LOOP TO GPU | ||
+ | // kernelWeightedConvariance<<<nblocks, ntpb>>>(cudaWeights, cudaPoints, cudaTotal, cudaCentroidX, cudaCentroidY, cudaCentroidZ); | ||
+ | kernelWeightedConvariance<<<nblocks, ntpb>>>(cudaWeights, cudaPoints, cudaTotal, cudaCentroid, n); | ||
+ | // kernelWeightedConvariance<<<1, ntpb>>>(cudaWeights, cudaPoints, cudaTotal, cudaCentroid, n); | ||
+ | |||
+ | //ensure synchronization | ||
+ | cudaDeviceSynchronize(); | ||
+ | |||
+ | cudaContinue = true; | ||
+ | //copy back to host | ||
+ | if (cudaContinue && (error = cudaMemcpy(&total, cudaTotal, sizeof(float), cudaMemcpyDeviceToHost))) { | ||
+ | // cout<<"failed to copy total from device: "<<cudaGetErrorString(error)<<" "<<total<<endl; | ||
+ | } | ||
+ | if (cudaContinue && (error = cudaMemcpy(¢roid, cudaCentroid, sizeof(Vec3), cudaMemcpyDeviceToHost))) { | ||
+ | // cout<<"failed to copy total from device: "<<cudaGetErrorString(error)<<" X:"<<centroid.X() << \ | ||
+ | " Y:"<<centroid.Y() << \ | ||
+ | " Z:"<<centroid.Z() <<endl; | ||
+ | } | ||
+ | // cudaMemcpy(¢roidX, cudaCentroidX, sizeof(float), cudaMemcpyDeviceToHost); | ||
+ | // cudaMemcpy(¢roidY, cudaCentroidX, sizeof(float), cudaMemcpyDeviceToHost); | ||
+ | // cudaMemcpy(¢roidZ, cudaCentroidX, sizeof(float), cudaMemcpyDeviceToHost); | ||
+ | |||
+ | for( int i = 0; i < n; ++i ) | ||
+ | { | ||
+ | total += weights[i]; | ||
+ | centroid += weights[i]*points[i]; | ||
+ | } | ||
+ | |||
+ | //create centroid from kernel results | ||
+ | // Vec3 centroid(centroidX, centroidY, centroidZ); | ||
+ | |||
+ | centroid /= total; | ||
+ | |||
+ | // accumulate the covariance matrix | ||
+ | Sym3x3 covariance( 0.0f ); | ||
+ | for( int i = 0; i < n; ++i ) | ||
+ | { | ||
+ | Vec3 a = points[i] - centroid; | ||
+ | Vec3 b = weights[i]*a; | ||
+ | |||
+ | covariance[0] += a.X()*b.X(); | ||
+ | covariance[1] += a.X()*b.Y(); | ||
+ | covariance[2] += a.X()*b.Z(); | ||
+ | covariance[3] += a.Y()*b.Y(); | ||
+ | covariance[4] += a.Y()*b.Z(); | ||
+ | covariance[5] += a.Z()*b.Z(); | ||
+ | } | ||
+ | |||
+ | cudaFree(cudaTotal); | ||
+ | cudaFree(cudaCentroid); | ||
+ | // cudaFree(cudaCentroidX); | ||
+ | // cudaFree(cudaCentroidY); | ||
+ | // cudaFree(cudaCentroidZ); | ||
+ | cudaFree(cudaWeights); | ||
+ | cudaFree(cudaPoints); | ||
+ | |||
+ | cudaDeviceReset(); | ||
+ | cout<<"something outta nothing\n"; | ||
+ | // return it | ||
+ | return covariance; | ||
+ | } | ||
+ | </pre> | ||
+ | |||
+ | We added some error correction thinking an errpr was in memory allocation or wrong computation, but that was not the case. We tried also to get some debuggers to work, but | ||
+ | since we couldn't we relied on printf style debugging, which is only useful OUTSIDE the kernel, unfortunately. However, looking at the naive code it appears correct. | ||
+ | |||
+ | What we were able to profile for the test case is that it spends less than 2 microsecond copying to the device and less than 3 microseconds in the kernel. This doesn't | ||
+ | say much as we're dealing only with 2 random colours, but the amount of computation is dependent on the file size. | ||
+ | |||
+ | |||
+ | === OPTIMIZATION === | ||
+ | |||
+ | As far as what we planned to look on how to optimize this code, we noticed several options. | ||
+ | |||
+ | 1) as we learned in class, we can use a thread divergent reduction algorithm and store each blocks result separately instead of having everything flushed into one global location. This | ||
+ | will reduce the number of operations in the block and potentially even lead to reduction in threads required. | ||
+ | |||
+ | 2) Since we know that the number of computations is dependent on the size of the file (image file), we would be able to optimize the number of threads per blocks and number of blocks | ||
+ | required according to the file size and compute capability in order to reduce overhead and extra fragmentation of threads (ie. threads in a block at are not needed to complete the | ||
+ | computation in the last block) | ||
+ | |||
+ | |||
+ | === OTHER NOTES === | ||
+ | |||
+ | |||
+ | TO RUN ON WINDOWS: | ||
+ | |||
+ | nvcc alpha.cpp clusterfit.cpp colourblock.cpp colourfit.cpp colourset.cpp maths.cu rangefit.cpp singlecolourfit.cpp squish.cpp squishtest.cu alpha.cpp | ||
+ | |||
+ | *Note we excluded the clusterfit.cu from the build command because we didn't concentrate on it for this iteration of the assignment. |
Latest revision as of 03:38, 19 April 2013
Contents
Team Members
Progress
Assignment 1
For the first assignment we each profiled open source libraries, Lame - a audio compression utility and Squish an image compression utility.
Alex - results for Lame
I have examined the LAME mp3 encoder to see if the process of encoding a wav file into an mp3 can be parallelized.
The below source, code from the psymodel.c file, could potentially be parallelized.
From the sample runs and the given results in the profile, as the size of the wave file gets larger, the percentage of the function that takes the most time actually goes down. This might mean that it is not worth parallelizing. If my group chooses this project, we will have to examine this carefully.
SOURCE CODE
int L3psycho_anal_vbr(lame_internal_flags * gfc, const sample_t * const buffer[2], int gr_out,
III_psy_ratio masking_ratio[2][2], III_psy_ratio masking_MS_ratio[2][2], FLOAT percep_entropy[2], FLOAT percep_MS_entropy[2], FLOAT energy[4], int blocktype_d[2]){ SessionConfig_t const *const cfg = &gfc->cfg; PsyStateVar_t *const psv = &gfc->sv_psy; PsyConst_CB2SB_t const *const gdl = &gfc->cd_psy->l; PsyConst_CB2SB_t const *const gds = &gfc->cd_psy->s; plotting_data *plt = cfg->analysis ? gfc->pinfo : 0; III_psy_xmin last_thm[4]; /* fft and energy calculation */ FLOAT(*wsamp_l)[BLKSIZE]; FLOAT(*wsamp_s)[3][BLKSIZE_s]; FLOAT fftenergy[HBLKSIZE]; FLOAT fftenergy_s[3][HBLKSIZE_s]; FLOAT wsamp_L[2][BLKSIZE]; FLOAT wsamp_S[2][3][BLKSIZE_s]; FLOAT eb[4][CBANDS], thr[4] [CBANDS]; FLOAT sub_short_factor[4][3]; FLOAT thmm; FLOAT const pcfact = 0.6f; FLOAT const ath_factor = (cfg->msfix > 0.f) ? (cfg->ATH_offset_factor * gfc->ATH->adjust_factor) : 1.f; const FLOAT(*const_eb)[CBANDS] = (const FLOAT(*)[CBANDS]) eb; const FLOAT(*const_fftenergy_s) [HBLKSIZE_s] = (const FLOAT(*)[HBLKSIZE_s]) fftenergy_s; /* block type */ int ns_attacks[4] [4] = { {0, 0, 0, 0}, {0, 0, 0, 0}, {0, 0, 0, 0}, {0, 0, 0, 0} }; int uselongblock[2]; /* usual variables like loop indices, etc.. */ int chn, sb, sblock; /* chn=2 and 3 = Mid and Side channels */ int const n_chn_psy = (cfg->mode == JOINT_STEREO) ? 4 : cfg->channels_out; memcpy(&last_thm[0], &psv->thm[0], sizeof(last_thm)); vbrpsy_attack_detection(gfc, buffer, gr_out, masking_ratio, masking_MS_ratio, energy, sub_short_factor, ns_attacks, uselongblock); vbrpsy_compute_block_type(cfg, uselongblock); /* LONG BLOCK CASE */ { for (chn = 0; chn < n_chn_psy; chn++) { int const ch01 = chn & 0x01; wsamp_l = wsamp_L + ch01; vbrpsy_compute_fft_l(gfc, buffer, chn, gr_out, fftenergy, wsamp_l); vbrpsy_compute_loudness_approximation_l(gfc, gr_out, chn, fftenergy); vbrpsy_compute_masking_l(gfc, fftenergy, eb[chn], thr[chn], chn); } if (cfg->mode == JOINT_STEREO) { if ((uselongblock[0] + uselongblock[1]) == 2) { vbrpsy_compute_MS_thresholds(const_eb, thr, gdl->mld_cb, gfc->ATH->cb_l, ath_factor, cfg->msfix, gdl->npart); } } /* TODO: apply adaptive ATH masking here ?? */ for (chn = 0; chn < n_chn_psy; chn++) { convert_partition2scalefac_l(gfc, eb[chn], thr[chn], chn); convert_partition2scalefac_l_to_s (gfc, eb[chn], thr[chn], chn); } } /* SHORT BLOCKS CASE */ { int const force_short_block_calc = gfc->cd_psy->force_short_block_calc; for (sblock = 0; sblock < 3; sblock++) { for (chn = 0; chn < n_chn_psy; ++chn) { int const ch01 = chn & 0x01; if (uselongblock[ch01] && !force_short_block_calc) { vbrpsy_skip_masking_s(gfc, chn, sblock); } else { /* compute masking thresholds for short blocks */ wsamp_s = wsamp_S + ch01; vbrpsy_compute_fft_s(gfc, buffer, chn, sblock, fftenergy_s, wsamp_s); vbrpsy_compute_masking_s(gfc, const_fftenergy_s, eb[chn], thr[chn], chn, sblock); } } if (cfg->mode == JOINT_STEREO) { if ((uselongblock[0] + uselongblock[1]) == 0) { vbrpsy_compute_MS_thresholds (const_eb, thr, gds->mld_cb, gfc->ATH->cb_s, ath_factor, cfg->msfix, gds->npart); } } /* TODO: apply adaptive ATH masking here ?? */ for (chn = 0; chn < n_chn_psy; ++chn) { int const ch01 = chn & 0x01; if (!uselongblock[ch01] || force_short_block_calc) { convert_partition2scalefac_s(gfc, eb[chn], thr[chn], chn, sblock); } } } /**** short block pre-echo control ****/ for (chn = 0; chn < n_chn_psy; chn++) { for (sb = 0; sb < SBMAX_s; sb++) { FLOAT new_thmm[3], prev_thm, t1, t2; for (sblock = 0; sblock < 3; sblock++) { thmm = psv->thm[chn].s[sb][sblock]; thmm *= NS_PREECHO_ATT0; t1 = t2 = thmm; if (sblock > 0) { prev_thm = new_thmm[sblock - 1]; } else { prev_thm = last_thm[chn].s[sb][2]; } if (ns_attacks[chn][sblock] >= 2 || ns_attacks[chn][sblock + 1] == 1) { t1 = NS_INTERP(prev_thm, thmm, NS_PREECHO_ATT1 * pcfact); } thmm = Min(t1, thmm); if (ns_attacks[chn][sblock] == 1) { t2 = NS_INTERP(prev_thm, thmm, NS_PREECHO_ATT2 * pcfact); } else if ((sblock == 0 && psv->last_attacks[chn] == 3) || (sblock > 0 && ns_attacks [chn][sblock - 1] == 3)) { /* 2nd preceeding block */ switch (sblock) { case 0: prev_thm = last_thm[chn].s[sb][1]; break; case 1: prev_thm = last_thm[chn].s[sb][2]; break; case 2: prev_thm = new_thmm[0]; break; } t2 = NS_INTERP(prev_thm, thmm, NS_PREECHO_ATT2 * pcfact); } thmm = Min (t1, thmm); thmm = Min(t2, thmm); /* pulse like signal detection for fatboy.wav and so on */ thmm *= sub_short_factor[chn][sblock]; new_thmm[sblock] = thmm; } for (sblock = 0; sblock < 3; sblock++) { psv->thm[chn].s[sb][sblock] = new_thmm[sblock]; } } } } for (chn = 0; chn < n_chn_psy; chn++) { psv->last_attacks[chn] = ns_attacks[chn][2]; } /*************************************************************** * determine final block type ***************************************************************/ vbrpsy_apply_block_type(psv, cfg- >channels_out, uselongblock, blocktype_d); /********************************************************************* * compute the value of PE to return ... no delay and advance *********************************************************************/ for (chn = 0; chn < n_chn_psy; chn++) { FLOAT *ppe; int type; III_psy_ratio const *mr; if (chn > 1) { ppe = percep_MS_entropy - 2; type = NORM_TYPE; if (blocktype_d[0] == SHORT_TYPE || blocktype_d[1] == SHORT_TYPE) type = SHORT_TYPE; mr = &masking_MS_ratio[gr_out][chn - 2]; } else { ppe = percep_entropy; type = blocktype_d[chn]; mr = &masking_ratio[gr_out][chn]; } if (type == SHORT_TYPE) { ppe[chn] = pecalc_s(mr, gfc->sv_qnt.masking_lower); } else { ppe[chn] = pecalc_l(mr, gfc->sv_qnt.masking_lower); } if (plt) { plt->pe [gr_out][chn] = ppe[chn]; } } return 0;
}
Barry - results for Squish I looked at a image compression library called squish.
There are several possibilities for compression. It depends on the amount of adjacent colours and their relativity to one another. In other words, the more common the colours, the better the compression as it tries to fit RGB schemes in a smaller vector object.
Below is my findings for the library:
There were several test files included, one that tested a PNG file compression. However, Linux didn't have a required library so I was only able to profile simple colour compression. In the future if I get the PNG compression test to work I will attempt to profile it and discuss with my team member if he would like to pursue Nonetheless, I believe that it was sufficient to profile the area of the code that could potentially benefit from parallelization.
Looking at the 3 profiles (squishtest.select1.flt, squishtest.select2.flt, squishtest.select3.flt) it seems that compression and decompression of one colour is extremely fast (select1 and select2 files). The problem comes when there are 2 (or more) colours involved (select3 file). I also noticed that the FloatTo565 function is called the most out of all the functions. Looking into that function however, its a method that compacts the RBG into a single value using bitwise operations. There are 3 and bitwise operations are quite fast, so I don't think we should focus on that even though it's called many times.
I believe the best place to offset the workload into the GPU would be the Compress3 and Compress4 function primarily, as well as the ComputeWeightedCovariance function. The profile is only for the Compress4 and not for the Compress3 function, but I presume that it may be called a lot of times as well depending on the compression format.
There is also a function that orders the vectors of colours but that can't be parralelized because it's dependent on a previous iteration.
squishtest.select1.flt
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total time seconds seconds calls Ts/call Ts/call name 0.00 0.00 0.00 3366 0.00 0.00 squish::FloatTo565(squish::Vec3 const&) 0.00 0.00 0.00 2000 0.00 0.00 squish::SingleColourFit::ComputeEndPoints(squish::SingleColourLookup const* const*) 0.00 0.00 0.00 2000 0.00 0.00 squish::FixFlags(int) 0.00 0.00 0.00 2000 0.00 0.00 squish::Unpack565(unsigned char const*, unsigned char*) 0.00 0.00 0.00 1683 0.00 0.00 squish::WriteColourBlock(int, int, unsigned char*, void*) 0.00 0.00 0.00 1683 0.00 0.00 squish::ColourSet::RemapIndices(unsigned char const*, unsigned char*) const 0.00 0.00 0.00 1000 0.00 0.00 GetColourError(unsigned char const*, unsigned char const*) 0.00 0.00 0.00 1000 0.00 0.00 squish::Decompress(unsigned char*, void const*, int) 0.00 0.00 0.00 1000 0.00 0.00 squish::CompressMasked(unsigned char const*, int, void*, int) 0.00 0.00 0.00 1000 0.00 0.00 squish::SingleColourFit::Compress3(void*) 0.00 0.00 0.00 1000 0.00 0.00 squish::SingleColourFit::Compress4(void*) 0.00 0.00 0.00 1000 0.00 0.00 squish::SingleColourFit::SingleColourFit(squish::ColourSet const*, int) 0.00 0.00 0.00 1000 0.00 0.00 squish::DecompressColour(unsigned char*, void const*, bool) 0.00 0.00 0.00 1000 0.00 0.00 squish::WriteColourBlock3(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*) 0.00 0.00 0.00 1000 0.00 0.00 squish::Compress(unsigned char const*, void*, int) 0.00 0.00 0.00 1000 0.00 0.00 squish::ColourFit::Compress(void*) 0.00 0.00 0.00 1000 0.00 0.00 squish::ColourFit::ColourFit(squish::ColourSet const*, int) 0.00 0.00 0.00 1000 0.00 0.00 squish::ColourSet::ColourSet(unsigned char const*, int, int) 0.00 0.00 0.00 683 0.00 0.00 squish::WriteColourBlock4(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*) 0.00 0.00 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z14GetColourErrorPKhS0_
squishtest.select2.flt
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total time seconds seconds calls Ts/call Ts/call name 0.00 0.00 0.00 2104 0.00 0.00 squish::FloatTo565(squish::Vec3 const&) 0.00 0.00 0.00 1530 0.00 0.00 squish::SingleColourFit::ComputeEndPoints(squish::SingleColourLookup const* const*) 0.00 0.00 0.00 1530 0.00 0.00 squish::FixFlags(int) 0.00 0.00 0.00 1530 0.00 0.00 squish::Unpack565(unsigned char const*, unsigned char*) 0.00 0.00 0.00 1052 0.00 0.00 squish::WriteColourBlock(int, int, unsigned char*, void*) 0.00 0.00 0.00 1052 0.00 0.00 squish::ColourSet::RemapIndices(unsigned char const*, unsigned char*) const 0.00 0.00 0.00 765 0.00 0.00 GetColourError(unsigned char const*, unsigned char const*) 0.00 0.00 0.00 765 0.00 0.00 squish::Decompress(unsigned char*, void const*, int) 0.00 0.00 0.00 765 0.00 0.00 squish::CompressMasked(unsigned char const*, int, void*, int) 0.00 0.00 0.00 765 0.00 0.00 squish::SingleColourFit::Compress3(void*) 0.00 0.00 0.00 765 0.00 0.00 squish::SingleColourFit::Compress4(void*) 0.00 0.00 0.00 765 0.00 0.00 squish::SingleColourFit::SingleColourFit(squish::ColourSet const*, int) 0.00 0.00 0.00 765 0.00 0.00 squish::DecompressColour(unsigned char*, void const*, bool) 0.00 0.00 0.00 765 0.00 0.00 squish::WriteColourBlock3(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*) 0.00 0.00 0.00 765 0.00 0.00 squish::Compress(unsigned char const*, void*, int) 0.00 0.00 0.00 765 0.00 0.00 squish::ColourFit::Compress(void*) 0.00 0.00 0.00 765 0.00 0.00 squish::ColourFit::ColourFit(squish::ColourSet const*, int) 0.00 0.00 0.00 765 0.00 0.00 squish::ColourSet::ColourSet(unsigned char const*, int, int) 0.00 0.00 0.00 287 0.00 0.00 squish::WriteColourBlock4(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*) 0.00 0.00 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z14GetColourErrorPKhS0_
squishtest.select3.flt
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total time seconds seconds calls us/call us/call name 26.09 0.06 0.06 97155 0.62 0.84 squish::ClusterFit::Compress4(void*) 21.74 0.11 0.05 292548 0.17 0.17 squish::FloatTo565(squish::Vec3 const&) 21.74 0.16 0.05 97155 0.51 1.01 squish::ClusterFit::Compress3(void*) 4.35 0.17 0.01 194310 0.05 0.05 squish::ClusterFit::ConstructOrdering(squish::Vec3 const&, int) 4.35 0.18 0.01 194310 0.05 0.05 squish::Unpack565(unsigned char const*, unsigned char*) 4.35 0.19 0.01 97155 0.10 0.21 squish::DecompressColour(unsigned char*, void const*, bool) 4.35 0.20 0.01 97155 0.10 0.44 squish::WriteColourBlock3(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*) 4.35 0.21 0.01 97155 0.10 0.10 squish::ComputeWeightedCovariance(int, squish::Vec3 const*, float const*) 4.35 0.22 0.01 97155 0.10 0.10 squish::ColourSet::ColourSet(unsigned char const*, int, int) 4.35 0.23 0.01 TestTwoColour(int) 0.00 0.23 0.00 194310 0.00 0.00 squish::FixFlags(int) 0.00 0.23 0.00 146274 0.00 0.00 squish::WriteColourBlock(int, int, unsigned char*, void*) 0.00 0.23 0.00 146274 0.00 0.00 squish::ColourSet::RemapIndices(unsigned char const*, unsigned char*) const 0.00 0.23 0.00 97155 0.00 0.00 GetColourError(unsigned char const*, unsigned char const*) 0.00 0.23 0.00 97155 0.00 0.10 squish::ClusterFit::ClusterFit(squish::ColourSet const*, int) 0.00 0.23 0.00 97155 0.00 0.21 squish::Decompress(unsigned char*, void const*, int) 0.00 0.23 0.00 97155 0.00 2.06 squish::CompressMasked(unsigned char const*, int, void*, int) 0.00 0.23 0.00 97155 0.00 0.00 squish::ComputePrincipleComponent(squish::Sym3x3 const&) 0.00 0.23 0.00 97155 0.00 2.06 squish::Compress(unsigned char const*, void*, int) 0.00 0.23 0.00 97155 0.00 1.01 squish::ColourFit::Compress(void*) 0.00 0.23 0.00 97155 0.00 0.00 squish::ColourFit::ColourFit(squish::ColourSet const*, int) 0.00 0.23 0.00 97155 0.00 0.00 squish::GetMultiplicity1Evector(squish::Sym3x3 const&, float) 0.00 0.23 0.00 49119 0.00 0.34 squish::WriteColourBlock4(squish::Vec3 const&, squish::Vec3 const&, unsigned char const*, void*) 0.00 0.23 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z14GetColourErrorPKhS0_
Assignment 2
DECISION: work on squish library
We ran into a lot of difficulty with this assignment from beginning until the time of submission.
We decided to work on the squish project. However, the first problem that we ran into was that we didn't know how to convert the makefile on linux to work on Windows and nvcc. We finally managed to devise a workaround where we edited the header files to point to squish.h in the current directory so we won't require dynamic linking and just compile everything in one line, unfortunately it took us several days.
The next issue we had was finding and getting all the various class methods to be available on the device and host via the __device__ __host__ modifiers. This was a lot more difficult than we first anticipated as some classes had different versions depending on how it's configured and did take us quite a bit of time as some methods used other (or nested) methods which made it hard to follow and debug.
In addition, this brought more difficulty due to the fact that standard library functions were used, such as min and max. We looked into using thrust, however, we decided to just use conditional operations to work around the problem for simplicity.
The major problem was debugging the maths.cu code. On our first successful compilation with the kernel code, we noticed that there was an error somewhere as our third, and most important test, which was mixing 2 colours didn't produce any results. We initially assumed it was a kernel error. But when we tried to compile the maths.cu file from the original maths.cpp code (we just copied the original maths.cpp file and changed the extension to .cu) it was producing the same results. This made it virtually impossible to debug, even when using the visual profiler, as it only tells us that there was a non-zero (or 1) return from main.
It's due to this reason that we were unable to produce any visual chart or comparison with the original code, as we still need to debug the code. However, it appears that the kernel code is fine and the error is after it's execution, this is according to the profiler.
During our search for answers we noticed that CUDA uses squish for texture operations (https://developer.nvidia.com/gpu-accelerated-texture-compression). We will try to get in touch with Simon Brown who is the creator of squish and see if he can lead us in the proper path, as it can be clearly parallelized if it is used in some CUDA functionality.
Assignment 3
This project had a lot of potential at first but both Alex and I found it very disappointing and frustrating as we couldn't manage to get it to run with much success. Because we didn't get it working. We decided to write about the theory behind our intentions for optimization, after we explain what we've done.
As we originally mentioned in our second assignment findings, we thought that our kernel code was fine and that we had issues elsewhere. However, that was not the case.
It turns out that there was an error in our kernel, and we did notice some logical errors that we overlooked. We were able to get some profiling through the visual profiler but the code crashes and therefore the profile is incomplete, we were unable to get the code to work for our 3rd and most important case, and we couldn't figure out why. Since we couldn't get it working with the two sources that we mentioned, we decided to focus on one, which was the ComputeWeightedCovariance method and its kernel as seen below (Please excuse the excessive comments):
KERNEL:
__global__ void kernelWeightedConvariance (float* weights, Vec3* points, float* total, Vec3* cudaCentroid, int n) { //shared memory to reduce memory latency __shared__ float sharedTotal; //centroid members // __shared__ float cx; // __shared__ float cxTotal; // __shared__ float cy; // __shared__ float cyTotal; // __shared__ float cz; // __shared__ float czTotal; __shared__ Vec3 sharedCentroid; int idx = blockDim.x * blockIdx.x + threadIdx.x; if (idx == 0) { *total = 0.0f; *cudaCentroid = Vec3(0.0f); } if (threadIdx.x == 0) { sharedCentroid = Vec3(0.0f); sharedTotal = 0; } __syncthreads(); // cx = 0; // cy = 0; // cz = 0; // cxTotal = 0.0f; // cyTotal = 0.0f; // czTotal = 0.0f; if (idx < n) { sharedTotal += weights[idx]; // cxTotal += weights[idx] * points[idx sharedCentroid = weights[idx] * points[idx]; } __syncthreads(); //copy to global memory if (threadIdx.x == 0) { *total += sharedTotal; *cudaCentroid += sharedCentroid; } }
METHOD CONTAINING KERNEL CALL
Sym3x3 ComputeWeightedCovariance( int n, Vec3 const* points, const float * weights ) { // compute the centroid float total = 0.0f; Vec3 centroid( 0.0f ); // float centroidX = 0.0f; // float centroidY = 0.0f; // float centroidZ = 0.0f; cudaError_t error; bool cudaContinue = true; // device memory addresses float* cudaWeights; Vec3* cudaPoints; float* cudaTotal; Vec3* cudaCentroid; // float* cudaCentroidX; // float* cudaCentroidY; // float* cudaCentroidZ; //calculate number of blocks int nblocks = (n + ntpb - 1) / ntpb; // int nblocks = n / ntpb + 1; //allocate device memory if (cudaContinue && (error = cudaMalloc((void**)&cudaWeights, n * sizeof(float))) != cudaSuccess) { cout<< "unable to create device memory for cudaWeights: " << cudaGetErrorString(error) << endl; cudaContinue = false; } if (cudaContinue && (error = cudaMalloc((void**)&cudaPoints, n * sizeof(Vec3))) != cudaSuccess) { cout<< "unable to create device memory for cudaPoints: " << cudaGetErrorString(error) << endl; cudaContinue = false; } if (cudaContinue && (error = cudaMalloc((void**)&cudaTotal, sizeof(float))) != cudaSuccess) { cout<< "unable to create device memory for cudaTotal: " << cudaGetErrorString(error) << endl; cudaContinue = false; } if (cudaContinue && (error = cudaMalloc((void**)&cudaCentroid, sizeof(Vec3))) != cudaSuccess) { cout<< "unable to create device memory for cudaCentroid: " << cudaGetErrorString(error) << endl; cudaContinue = false; } // cout<<"cudamemcpy: "<<cudaGetErrorString(cudaGetLastError())<<" "<<temp<<" "<< weights<<endl; // cout<<"cudamemcpyTotal: "<<cudaGetErrorString(cudaGetLastError())<<" "<<cudaTotal<<" "<< weights<<endl; // cout<<"cudamemcpyCentroid: "<<cudaGetErrorString(cudaGetLastError())<<" "<<cudaCentroid<<" "<< weights<<endl; // cudaMalloc((void**)&cudaCentroidX, sizeof(float)); // cudaMalloc((void**)&cudaCentroidY, sizeof(float)); // cudaMalloc((void**)&cudaCentroidZ, sizeof(float)); //copy the weights and points to the device if ((error = cudaMemcpy(cudaWeights, weights, n * sizeof(float), cudaMemcpyHostToDevice)) != cudaSuccess) { cout<<"failed to copy weights to device: "<<cudaGetErrorString(error)<<" "<<weights<<endl; } if ((error = cudaMemcpy(cudaPoints, points, n * sizeof(Vec3), cudaMemcpyHostToDevice)) != cudaSuccess) { cout<<"failed to copy points to device: "<<cudaGetErrorString(error)<<" "<<points<<endl; } // ; cout<<"GOING INTO KERNEL"<<endl; //OFFSET LOOP TO GPU // kernelWeightedConvariance<<<nblocks, ntpb>>>(cudaWeights, cudaPoints, cudaTotal, cudaCentroidX, cudaCentroidY, cudaCentroidZ); kernelWeightedConvariance<<<nblocks, ntpb>>>(cudaWeights, cudaPoints, cudaTotal, cudaCentroid, n); // kernelWeightedConvariance<<<1, ntpb>>>(cudaWeights, cudaPoints, cudaTotal, cudaCentroid, n); //ensure synchronization cudaDeviceSynchronize(); cudaContinue = true; //copy back to host if (cudaContinue && (error = cudaMemcpy(&total, cudaTotal, sizeof(float), cudaMemcpyDeviceToHost))) { // cout<<"failed to copy total from device: "<<cudaGetErrorString(error)<<" "<<total<<endl; } if (cudaContinue && (error = cudaMemcpy(¢roid, cudaCentroid, sizeof(Vec3), cudaMemcpyDeviceToHost))) { // cout<<"failed to copy total from device: "<<cudaGetErrorString(error)<<" X:"<<centroid.X() << \ " Y:"<<centroid.Y() << \ " Z:"<<centroid.Z() <<endl; } // cudaMemcpy(¢roidX, cudaCentroidX, sizeof(float), cudaMemcpyDeviceToHost); // cudaMemcpy(¢roidY, cudaCentroidX, sizeof(float), cudaMemcpyDeviceToHost); // cudaMemcpy(¢roidZ, cudaCentroidX, sizeof(float), cudaMemcpyDeviceToHost); for( int i = 0; i < n; ++i ) { total += weights[i]; centroid += weights[i]*points[i]; } //create centroid from kernel results // Vec3 centroid(centroidX, centroidY, centroidZ); centroid /= total; // accumulate the covariance matrix Sym3x3 covariance( 0.0f ); for( int i = 0; i < n; ++i ) { Vec3 a = points[i] - centroid; Vec3 b = weights[i]*a; covariance[0] += a.X()*b.X(); covariance[1] += a.X()*b.Y(); covariance[2] += a.X()*b.Z(); covariance[3] += a.Y()*b.Y(); covariance[4] += a.Y()*b.Z(); covariance[5] += a.Z()*b.Z(); } cudaFree(cudaTotal); cudaFree(cudaCentroid); // cudaFree(cudaCentroidX); // cudaFree(cudaCentroidY); // cudaFree(cudaCentroidZ); cudaFree(cudaWeights); cudaFree(cudaPoints); cudaDeviceReset(); cout<<"something outta nothing\n"; // return it return covariance; }
We added some error correction thinking an errpr was in memory allocation or wrong computation, but that was not the case. We tried also to get some debuggers to work, but since we couldn't we relied on printf style debugging, which is only useful OUTSIDE the kernel, unfortunately. However, looking at the naive code it appears correct.
What we were able to profile for the test case is that it spends less than 2 microsecond copying to the device and less than 3 microseconds in the kernel. This doesn't say much as we're dealing only with 2 random colours, but the amount of computation is dependent on the file size.
OPTIMIZATION
As far as what we planned to look on how to optimize this code, we noticed several options.
1) as we learned in class, we can use a thread divergent reduction algorithm and store each blocks result separately instead of having everything flushed into one global location. This will reduce the number of operations in the block and potentially even lead to reduction in threads required.
2) Since we know that the number of computations is dependent on the size of the file (image file), we would be able to optimize the number of threads per blocks and number of blocks required according to the file size and compute capability in order to reduce overhead and extra fragmentation of threads (ie. threads in a block at are not needed to complete the computation in the last block)
OTHER NOTES
TO RUN ON WINDOWS:
nvcc alpha.cpp clusterfit.cpp colourblock.cpp colourfit.cpp colourset.cpp maths.cu rangefit.cpp singlecolourfit.cpp squish.cpp squishtest.cu alpha.cpp
- Note we excluded the clusterfit.cu from the build command because we didn't concentrate on it for this iteration of the assignment.