Open main menu

CDOT Wiki β

Changes

Computer Architecture

2,420 bytes added, 12:09, 14 January 2014
no edit summary
* The ''work'' of a CPU is performed by '''Execution Units''', which perform operations such as loading and storing data from/to memory (load/store unit), performing integer math (integer unit), executing [[Bitwise Operations|bitwise operations]], and performing floating-point math (floating-point unit, or FPU). The length of time taken to perform an operation varies according to the sophistication of the execution unit and the complexity of the operation. For example, a multiplication can be performed in many different ways, ranging from repeated addition (very slow, but requiring very little hardware logic) to table lookup (very fast, but requiring a lot of silicon), with most operations falling somewhere in the middle. A multiplication will almost always take longer to perform than an addition, and may vary according to carry and overflow sub-operations required. The use of multiple units permits faster operations to be completed on some execution units while other (slower) operations are taking place on other execution units.
* As instructions are performed, special results are recorded as '''[[Flags]]''' within the CPU. For example, adding or multiplying two numbers will set a "Carry" flag when the result overflows the word width. Other flags may indicate zero or negative result values. These flags can then be used in later operations -- for example, a branch may be taken if the carry flag is set. The number of flags, their specific meanings, and the circumstances under which they are set (to binary "1") and cleared (to binary "0") vary from architecture to architecture.
* '''Cache''' is high-speed memory placed between RAM and the CPU. This memory is faster than main RAM but much smaller; it improves performance by enabling the CPU to continue to write data quickly and continue without waiting for the data to be written out to main RAM. It also provides fast access to instructions or data that are accessed repeatedly, such as when a small loop is being executed. The performance difference between a loop that fits into cache and a loop that does not fit into cache can be substantial. Cache memory is arranged in "lines" which are typically a multiple of the word size; requesting a memory address that is not in cache results in a "cache miss" which causes a stall while the cache contents are retrieved from main memory. Cache design varies in many details, especially including in write behaviour -- the cache can simply carry a write through to main memory (write-through), or it can hold the data in cache and write it back at a later time (write-back).
* '''Pre-fetching''' the process of retrieving instructions from memory before executing them. Done effectively, this avoids pipeline stalls due to cache misses.
* '''Branch prediction''' is used to guess whether a branch will be taken or not taken based on past history. For example, in most loops, the same branch is taken repeatedly until the loop exit condition is met, so a prediction that the loop will be taken will be correct most of the time. However, inside the loop, there may be a conditional statement ("if") which is usually executed, so predicting that the branch that skirts around the conditional code will ''not'' be taken will be correct most of the time. Branch prediction is used in conjunction with pre-fetching and pipelining to improve performance.
The [[Instruction Set Architecture]] specifies the encoding of instructions. This is specific to a particular architecture family and therefore dependent on certain architectural features, such as the register set, but independent of other features, such as the cache type -- because the cache type affects performance but not the instructions which can be executed by the CPU.
 
== Sub-word and Unaligned Access ==
 
Most processors use [[Word|word]] size that is multiple of width of some common data types -- for example, a system with a 32-bit [[Word#Hardware Word|hardware word size]] that is running applications which use UTF-8 character encoding may often need to read or write single [[Word#Byte|bytes]] of data. A byte-sized read will cause the CPU to perform a 32-bit read, followed by [[Bitwise Operations|masking and shifting operations]] to extract the desired byte from the 32-bit word. A single-byte write operation will cause the CPU to read the existing word, extract the unaffected bits within that word, [[Bitwise Operations#OR]] in the new value, and then write the word back to memory.
 
Unaligned memory access causes similar issues. For example, to read a 32-bit value from the byte address 0x2, most 32-bit CPUs will read a 32-bit value starting at byte address 0x0 and perform an [[Bitwise Operations|AND]] to extract highest 16 bits, then a shift to move those bits to the lowest 16 bit positions. The CPU will then read a 32-bit value starting at byte address 0x4, perform an AND to extract the lowest 16 bits, shift those bits to the highest 16 bit positions, and then OR the high 16 bits and the low 16 bits together into a single 32 bit value. (Writing an unaligned 32-bit value is even worse!)
 
Obviously, unaligned access is far slower than aligned access, and should be avoided whenever possible. However, aligning all storage may result in increased memory usage (e.g., aligning 24-bit pixel values on 32-bit boundaries), and some data such as compressed data streams or network packets will almost always contain unaligned data.
 
Some processors do not have alignment-fixup hardware, and an unaligned access causes a processor exception. The operating system may ignore the unaligned access (usually leading to incorrect results), stop the program performing the unaligned access, or fix up the access in software and then resume execution of the process which caused the exception.
 
On a Linux system, the control file /proc/cpu/alignment controls how the operating system will handle alignment exceptions (on machines which lack alignment-fixup hardware). The possible values are:
 
0 = ignore
1 = warn (via kernel message)
2 = fixup
3 = fixup + warn
 
Note that even on systems that perform alignment fixups in hardware, unaligned access is [[Expensive|expensive]].
== Processor-specific Optimizations ==