Changes

GPU621/Analyzing False Sharing

1,461 bytes added, 14:53, 7 November 2022

no edit summary

If you do the same operation multiple times on a piece of data, it makes sense to load it close to the CPU while executing the operation, for example a loop counter, you don't want to go to the main memory every loop to fetch this data to grow it.

[[File:CPUCacheArchitecture.png|800px]]

L3 is more common in modern multicore machines, is still larger, slower, and shared by all CPU cores on a single socket. Finally, main memory, which holds all the data for program operation, is larger, slower, and shared by all CPU cores on all slots.

When the CPU performs an operation, it first goes to L1 to find the required data, then L2, then L3, and finally if none of these caches are available, the required data has to go to main memory to get it. The farther it goes, the longer the operation takes. So if you do some very frequent operations, make sure the data is in the L1 cache.

== CPU cache line ==

A cache is made up of cache lines, usually 64 bytes (common processors have 64-byte cache lines, older processors have 32-byte cache lines), and it effectively references a block of addresses in main memory.

A C++ double type is 8 bytes, so 8 double variables can be stored in a cache line.

[[File:CPUCacheLines.png|800px]]

During program runtime, the cache is loaded with 64 consecutive bytes from main memory for each update. Thus, if an array of type double is accessed, when a value in the array is loaded into the cache, the other 7 elements are also loaded into the cache. However, if the items in the data structure used are not adjacent to each other in memory, such as a chain table, then the benefits of cache loading will not be available.

Ryan Leong

118

edits

CDOT Wiki β

Changes

GPU621/Analyzing False Sharing

CDOT Wiki ^β