Benchmarking
A bench mark was originally a surveyor's mark used to record a reference elevation. A surveyor would literally mark a pillar, post, or stone with the reference elevation, which would correspond to the height at which they would place their bench (platform for measuring equipment) to measure other elevations. (See the etymology of benchmark at Wiktionary.
Benchmarking Software
In the software industry, benchmarking means measuring performance in a reliable, repeatable way. This is done to compare the relative performance of:
- two versions of the same software (to gauge the effect of changes made to the software);
- two different pieces of software which do the same thing (e.g., two webservers);
- the same software running with different libraries or operating systems (e.g., apache under Windows and OS/X);
- the same software built in two different ways (e.g., using different compilers or optimization options); or
- the same software on two different computers (x86_64 vs mainframe) or computer configurations (SSD vs. hard disk, or 8GB RAM vs 64GB RAM).
Factors to Control
In order to produce reliable, repeatable results, variable must be controlled or eliminated. The most common variables affecting performance results on a system are:
- the data being processed;
- the state of caches;
- other activity on the system (other processes, users, network activity, and so forth).
Typical Benchmark Process
Execution-time Benchmarks
- Decide on the processing to be benchmarked. It is best to avoid all human interaction (user interfaces) and use data that is consistent (data sets should be provided by a file, random numbers should be generated by a PRNG given identical keys, etc). Pick a data set size that is
- Disable any unnecessary background processing (daemons, cron jobs, screen sessions, and so forth).
- Warm up disk and network caches by doing an initial program run and discard the results.
- Execute the benchmark process several times, recording the execution time for each run. If the results are not consistent, determine why and eliminate the variation.
Volume-of-Work Benchmarks
For some operations, especially such as serving web pages or remote storage, it may be more appropriate to determine how much data (e.g., web requests) can be served in a given amount of time.
- Decide on the processing to be benchmarked, and which program will be used to generate the test load (e.g., a program to request web pages, such as httpbench, or a program to generate storage requests, such as bonnie++).
- Decide whether the load generator should be run on the same system as the server, or on another network-connected system (ensure that the network connection is fast enough that it will not be the limiting factor).
- Set up the server.
- Run the benchmark several times. Discard the first result (it may be affected by cache state).
- If the results are not consistent, determine why and eliminate the variation.
Comparing Different Systems
To compare benchmarks on different systems or with different software, it is important to configure the systems as similarly as possible. Doing so is left as an exercise for the reader :-)