Difference between revisions of "SPO600 Algorithm Selection Lab"
Chris Tyler (talk | contribs) (→Three Approaches) |
Chris Tyler (talk | contribs) |
||
Line 4: | Line 4: | ||
=== Background === | === Background === | ||
− | * Digital sound is typically represented, uncompressed, as signed 16-bit integer signal samples. There | + | * Digital sound is typically represented, uncompressed, as signed 16-bit integer signal samples. There are two streams of samples, one each for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second per channel, for a total of 88.2 or 96 thousand samples per second (kHz). Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec). |
* To change the volume of sound, each sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume). | * To change the volume of sound, each sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume). | ||
* On a mobile device, the amount of processing required to scale sound will affect battery life. | * On a mobile device, the amount of processing required to scale sound will affect battery life. | ||
− | === | + | === Multiple Approaches === |
− | + | Six programs are provided, each with a different approach to the problem, named <code>vol0.c</code> through <code>vol5.c</code>. A header file, <code>vol.h</code>, defines how much data (in number of sample) will be processed by each program, as well as the volume level to be used for scaling (50%). | |
− | # | + | These are the six programs: |
− | # | + | |
− | # | + | # vol0.c is the basic or naive algorithm. This approach multiplies each sound sample by the volume scaling factor, casting from signed 16-bit integer to floating point and back again. Casting between integer and floating point can be [[Expensive|expensive]] operations. |
+ | # vol1.c does the math using fixed-point calculations. This avoids the overhead of casting between integer and floating point and back again. | ||
+ | # vol2.c pre-calculates all 65536 different results, and then looks up the answer for each input value. | ||
+ | # vol3.c is a dummy program - it doesn't scale the volume at all. It can be used to determine some of the overhead of the rest of the processing (besides scaling the volume) done by the other programs. | ||
+ | # vol4.c uses Single Instruction, Multiple Data (SIMD) instructions accessed through inline assembley (assembly language code inserted into a C program). This program is specific to the AArch64 architecture and will not build for x86_64. | ||
+ | # vol5.c uses SIMD instructions accessed through Complier Intrinsics. This program is also specific to AArch64. | ||
=== Don't Compare Across Machines === | === Don't Compare Across Machines === | ||
− | In this lab, ''do not'' compare the relative performance across different machines, because | + | In this lab, ''do not'' compare the relative performance across different machines, because various systems have a wide range of processor implementations, from server-class to mobile-class. However, ''do'' compare the relative performance of the various algorithms on the ''same'' machine. |
=== Benchmarking === | === Benchmarking === | ||
Line 24: | Line 29: | ||
Get the files for this lab from one of the [[SPO600 Servers]] -- but you can perform the lab wherever you want (feel free to use your laptop or home system). Test on both an x86_64 and an AArch64 system. | Get the files for this lab from one of the [[SPO600 Servers]] -- but you can perform the lab wherever you want (feel free to use your laptop or home system). Test on both an x86_64 and an AArch64 system. | ||
− | + | The files for this lab are in the archive <code>/public/spo600-volume-examples.tgz</code> on each of the SPO600 servers. The archive contains: | |
− | * <code>vol.h</code> controls the number of samples to be processed | + | * <code>vol.h</code> controls the number of samples to be processed and the volume level to be used |
− | * <code> | + | * <code>vol0.c</code> through <code>vol5.c</code> implement the various algorithms |
+ | * <code>vol_createsample.c</code> contains a function to create dummy samples | ||
* The <code>Makefile</code> can be used to build the programs | * The <code>Makefile</code> can be used to build the programs | ||
Perform these steps: | Perform these steps: | ||
− | # Unpack the archive <code>/public/spo600- | + | # Unpack the archive <code>/public/spo600-volume-examples.tgz</code> |
# Study each of the source code files and make sure that you understand what the code is doing. | # Study each of the source code files and make sure that you understand what the code is doing. | ||
# '''Make a prediction''' of the relative performance of each scaling algorithm. | # '''Make a prediction''' of the relative performance of each scaling algorithm. | ||
Line 37: | Line 43: | ||
#** How can you verify this? | #** How can you verify this? | ||
#** If there is a difference, is it significant enough to matter? | #** If there is a difference, is it significant enough to matter? | ||
− | #* Change the number of samples so that each program takes a reasonable amount of time to execute (suggested minimum 20 seconds | + | #* Change the number of samples so that each program takes a reasonable amount of time to execute (suggested minimum is 20 seconds). |
# Test the performance of each program. | # Test the performance of each program. | ||
#* Find a way to measure performance ''without'' the time taken to perform the test setup pre-processing (generating the samples) and post-processing (summing the results) so that you can measure ''only'' the time taken to scale the samples. '''This is the hard part!''' | #* Find a way to measure performance ''without'' the time taken to perform the test setup pre-processing (generating the samples) and post-processing (summing the results) so that you can measure ''only'' the time taken to scale the samples. '''This is the hard part!''' | ||
Line 46: | Line 52: | ||
#* What is the relative memory usage of each program? | #* What is the relative memory usage of each program? | ||
# Was your prediction accurate? | # Was your prediction accurate? | ||
+ | # Find all of the questions, marked with <code>'''Q:'''</code>, in the program comments, and answer those questions. | ||
=== Deliverables === | === Deliverables === | ||
− | Blog about your experiments with a detailed analysis of your results, including memory usage, performance, accuracy, and trade-offs. | + | Blog about your experiments with a detailed analysis of your results, including memory usage, performance, accuracy, and trade-offs. Include answers to all of the questions marked with Q: in the source code. |
− | Make sure you convincingly prove your results to your reader! Also be sure to explain what you're doing so that a reader coming across your blog post understands the context (in other words, don't just jump into a discussion of optimization results -- give your post some context). | + | Make sure you convincingly '''prove''' your results to your reader! Also be sure to explain what you're doing so that a reader coming across your blog post understands the context (in other words, don't just jump into a discussion of optimization results -- give your post some context). |
'''Optional - Recommended:''' Compare results across several '''implementations''' of AArch64 and x86_64 systems. Note that on different CPU implementations, the relative performance of different algorithms will vary; for example, table lookup may outperform other algorithms on a system with a fast memory system (cache), but not on a system with a slower memory system. | '''Optional - Recommended:''' Compare results across several '''implementations''' of AArch64 and x86_64 systems. Note that on different CPU implementations, the relative performance of different algorithms will vary; for example, table lookup may outperform other algorithms on a system with a fast memory system (cache), but not on a system with a slower memory system. | ||
− | * For AArch64, you could compare the performance on AArchie against | + | * For AArch64, you could compare the performance on AArchie against a Raspberry Pi 4 (in 64-bit mode) or an ARM Chromebook. |
* For x86_64, you could compare the performance of different processors, such as xerxes, your own laptop or desktop, and Seneca systems such as Matrix or lab desktops. | * For x86_64, you could compare the performance of different processors, such as xerxes, your own laptop or desktop, and Seneca systems such as Matrix or lab desktops. | ||
Line 60: | Line 67: | ||
==== Design of Your Tests ==== | ==== Design of Your Tests ==== | ||
− | * Most solutions for a problem of this type involve generating a large amount of data in an array, processing that array using the function being evaluated, and then storing that data back into an array. The test setup can take more time than the actual test! Make sure that you measure the time taken in the | + | * Most solutions for a problem of this type involve generating a large amount of data in an array, processing that array using the function being evaluated, and then storing that data back into an array. The test setup can take more time than the actual test! Make sure that you measure the time taken for the code in question (the part that scales the sound samples) ONLY -- you need to be able to remove the rest of the processing time from your evaluation. |
− | * You may need to run a | + | * You may need to run a massive large amount of sample data through the function to be able to detect its performance. |
* If you do not use the output from your calculation (e.g., do something with the output array), the compiler may recognize that, and remove the code you're trying to test. Be sure to process the results in some way so that the optimizer preserves the code you want to test. It is a good idea to calculate some sort of verification value to ensure that both approaches generate the same results. | * If you do not use the output from your calculation (e.g., do something with the output array), the compiler may recognize that, and remove the code you're trying to test. Be sure to process the results in some way so that the optimizer preserves the code you want to test. It is a good idea to calculate some sort of verification value to ensure that both approaches generate the same results. | ||
* Be aware of what other tasks the system is handling during your test run, including software running on behalf of other users. | * Be aware of what other tasks the system is handling during your test run, including software running on behalf of other users. | ||
Line 67: | Line 74: | ||
=== Tips === | === Tips === | ||
{{Admon/tip|Analysis|Do a thorough analysis of the results. Be certain (and prove!) that your performance measurement ''does not'' include the generation or summarization of the test data. Do multiple runs and discard the outliers. Decide whether to use mean, minimum, or maximum time values from the multiple runs, and explain why you made that decision. Control your variables well. Show relative performance as percentage change, e.g., "this approach was NN% faster than that approach".}} | {{Admon/tip|Analysis|Do a thorough analysis of the results. Be certain (and prove!) that your performance measurement ''does not'' include the generation or summarization of the test data. Do multiple runs and discard the outliers. Decide whether to use mean, minimum, or maximum time values from the multiple runs, and explain why you made that decision. Control your variables well. Show relative performance as percentage change, e.g., "this approach was NN% faster than that approach".}} | ||
− | |||
− | |||
{{Admon/tip|Time and Memory Usage of a Program|You can get basic timing information for a program by running <code>time ''programName''</code> -- the output will show the total time taken (real), the amount of CPU time used to run the application (user), and the amount of CPU time used by the operating system on behalf of the application (system). | {{Admon/tip|Time and Memory Usage of a Program|You can get basic timing information for a program by running <code>time ''programName''</code> -- the output will show the total time taken (real), the amount of CPU time used to run the application (user), and the amount of CPU time used by the operating system on behalf of the application (system). | ||
Line 76: | Line 81: | ||
{{Admon/tip|SOX|If you want to try this with actual sound samples, you can convert a sound file of your choice to raw 16-bit signed integer PCM data using the [http://sox.sourceforge.net/ sox] utility present on most Linux systems and available for a wide range of platforms.}} | {{Admon/tip|SOX|If you want to try this with actual sound samples, you can convert a sound file of your choice to raw 16-bit signed integer PCM data using the [http://sox.sourceforge.net/ sox] utility present on most Linux systems and available for a wide range of platforms.}} | ||
− | {{Admon/tip|stdint.h|The <code>stdint.h</code> header provides definitions for many specialized integer size types. Use <code>int16_t</code> for 16-bit signed integers.}} | + | {{Admon/tip|stdint.h|The <code>stdint.h</code> header provides definitions for many specialized integer size types. Use <code>int16_t</code> for 16-bit signed integers and <code>uint16_t</code> for 16-bit unsigned integers.}} |
{{Admon/tip|Scripting|Use bash scripting capabilities to reduce tedious manual steps!}} | {{Admon/tip|Scripting|Use bash scripting capabilities to reduce tedious manual steps!}} |
Revision as of 07:17, 22 November 2021
Contents
Lab 6
Background
- Digital sound is typically represented, uncompressed, as signed 16-bit integer signal samples. There are two streams of samples, one each for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second per channel, for a total of 88.2 or 96 thousand samples per second (kHz). Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
- To change the volume of sound, each sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
- On a mobile device, the amount of processing required to scale sound will affect battery life.
Multiple Approaches
Six programs are provided, each with a different approach to the problem, named vol0.c
through vol5.c
. A header file, vol.h
, defines how much data (in number of sample) will be processed by each program, as well as the volume level to be used for scaling (50%).
These are the six programs:
- vol0.c is the basic or naive algorithm. This approach multiplies each sound sample by the volume scaling factor, casting from signed 16-bit integer to floating point and back again. Casting between integer and floating point can be expensive operations.
- vol1.c does the math using fixed-point calculations. This avoids the overhead of casting between integer and floating point and back again.
- vol2.c pre-calculates all 65536 different results, and then looks up the answer for each input value.
- vol3.c is a dummy program - it doesn't scale the volume at all. It can be used to determine some of the overhead of the rest of the processing (besides scaling the volume) done by the other programs.
- vol4.c uses Single Instruction, Multiple Data (SIMD) instructions accessed through inline assembley (assembly language code inserted into a C program). This program is specific to the AArch64 architecture and will not build for x86_64.
- vol5.c uses SIMD instructions accessed through Complier Intrinsics. This program is also specific to AArch64.
Don't Compare Across Machines
In this lab, do not compare the relative performance across different machines, because various systems have a wide range of processor implementations, from server-class to mobile-class. However, do compare the relative performance of the various algorithms on the same machine.
Benchmarking
Get the files for this lab from one of the SPO600 Servers -- but you can perform the lab wherever you want (feel free to use your laptop or home system). Test on both an x86_64 and an AArch64 system.
The files for this lab are in the archive /public/spo600-volume-examples.tgz
on each of the SPO600 servers. The archive contains:
-
vol.h
controls the number of samples to be processed and the volume level to be used -
vol0.c
throughvol5.c
implement the various algorithms -
vol_createsample.c
contains a function to create dummy samples - The
Makefile
can be used to build the programs
Perform these steps:
- Unpack the archive
/public/spo600-volume-examples.tgz
- Study each of the source code files and make sure that you understand what the code is doing.
- Make a prediction of the relative performance of each scaling algorithm.
- Build and test each of the programs.
- Do all of the algorithms produce the same output?
- How can you verify this?
- If there is a difference, is it significant enough to matter?
- Change the number of samples so that each program takes a reasonable amount of time to execute (suggested minimum is 20 seconds).
- Do all of the algorithms produce the same output?
- Test the performance of each program.
- Find a way to measure performance without the time taken to perform the test setup pre-processing (generating the samples) and post-processing (summing the results) so that you can measure only the time taken to scale the samples. This is the hard part!
- How much time is spent scaling the sound samples?
- Do multiple runs take the same time? How much variation do you observe? What is the likely cause of this variation?
- Is there any difference in the results produced by the various algorithms?
- Does the difference between the algorithms vary depending on the architecture and implementation on which you test?
- What is the relative memory usage of each program?
- Was your prediction accurate?
- Find all of the questions, marked with
Q:
, in the program comments, and answer those questions.
Deliverables
Blog about your experiments with a detailed analysis of your results, including memory usage, performance, accuracy, and trade-offs. Include answers to all of the questions marked with Q: in the source code.
Make sure you convincingly prove your results to your reader! Also be sure to explain what you're doing so that a reader coming across your blog post understands the context (in other words, don't just jump into a discussion of optimization results -- give your post some context).
Optional - Recommended: Compare results across several implementations of AArch64 and x86_64 systems. Note that on different CPU implementations, the relative performance of different algorithms will vary; for example, table lookup may outperform other algorithms on a system with a fast memory system (cache), but not on a system with a slower memory system.
- For AArch64, you could compare the performance on AArchie against a Raspberry Pi 4 (in 64-bit mode) or an ARM Chromebook.
- For x86_64, you could compare the performance of different processors, such as xerxes, your own laptop or desktop, and Seneca systems such as Matrix or lab desktops.
Things to consider
Design of Your Tests
- Most solutions for a problem of this type involve generating a large amount of data in an array, processing that array using the function being evaluated, and then storing that data back into an array. The test setup can take more time than the actual test! Make sure that you measure the time taken for the code in question (the part that scales the sound samples) ONLY -- you need to be able to remove the rest of the processing time from your evaluation.
- You may need to run a massive large amount of sample data through the function to be able to detect its performance.
- If you do not use the output from your calculation (e.g., do something with the output array), the compiler may recognize that, and remove the code you're trying to test. Be sure to process the results in some way so that the optimizer preserves the code you want to test. It is a good idea to calculate some sort of verification value to ensure that both approaches generate the same results.
- Be aware of what other tasks the system is handling during your test run, including software running on behalf of other users.