Open main menu

CDOT Wiki β

Changes

GPU621/False Sharing

2,032 bytes added, 23:44, 7 December 2021
m
Presentation
So why do we have so many types?
In an ideal world, the CPU would have lightning-fast access to all data on your machine at any given time to maximize the power and speed of modern-day CPUs. In reality, there are limitations and trade-offs when balancing speed, storage capacity, location, and cost. To compromise, a computer utilizes many types of memory. To name the important ones in the hierarchy, we have secondary memory like hard drives and SSDs which can permanently hold vast quantities of data but are slow. Next is Dynamic Random Access Memory(DRAM) which is much smaller, faster, and volatile. Last, we have the cache or Static Random Access Memory(SRAM), which is extremely fast, but even smaller and more expensive.
Near the bottom, we have secondary memory like hard drives, flash drives and SSDs which can permanently hold vast quantities of data, but are relatively slow. Hard drives in particular have mechanical parts that must physically move. If the CPU solely relied on getting data from secondary this memorytype, the slow access speed of the storage device speeds would become a massive bottleneck in computation time. There would be huge gaps The CPU will spend a majority of time where the CPU sits around doing nothing idling waiting for data. To the end user, your machine would appear to be extremely sluggish and non-responsive.
At the same timeNext is Dynamic Random Access Memory(DRAM) which use capacitors, only small amounts piece of transistors and electricity to store data. They require constant refreshing with electricity to store data making them volatile as their data disappears when power is needed at a given momentcut. Even if you brought everything in from memoryDRAM is more expensive, most of it will be unusedbut smaller and faster. Using complex algorithmsHowever, this is still not fast enough for the most relevant data can be stored ahead CPU. Last, we have Static Random Access Memory(SRAM), which is extremely fast, but even smaller and more expensive. This type of time memory is what is used in the cache and RAM.  When it needs data, the CPU can look for it looks in the cachefirst. If it is there then , it is a cache hit. If it is not there, it is a cache miss, and it moves the CPU must search main memory or even further down the hierarchy until it finds it. Minimizing the number of cache misses ensures the CPU has a steady flow of data it can quickly retrieve and compute. The cache operates on locality of reference which refers to the tendency of programs to access the same set of memory locations repeatedly over a short period of time. From here, there are two major types. Temporal locality is when one memory location is accessed, it will likely be accessed again in the near future. Spatial locality means if one memory locations is accessed, nearby memory locations will likely be needed as well. Using these principles and complex algorithms, data is brought into the cache ahead of time to minimize the number of cache misses.
=== Cache Coherence and Cache Line ===
[[File:cache_coherence.png|right|500px|thumb|Cache structure in a multi-processor system with shared memory.]]
Each processor has their own local cache. When data is needed, a fixed block blocks of memory is are transferred to the cache; this these block is are known as a cache linelines.
However, if they each modify their own copy, how do we know which one is correct? We can’t store all the different versions or pick a random one to be the correct version, this would create havoc in our computer systems.
The answer is we need cache coherence. Cache coherence is defined as the uniformity of shared resource data that ends up stored in multiple local caches. In other words, we must keep all the local caches synchronized. The challenge of doing so is known as the cache coherence problem. To solve this problem, multi-processor systems rely on cache coherence protocols to manage and maintain the cache. While the details of these protocols are beyond the scope of this assignment scope, more information on cache coherence protocols can be found here: https://en.wikipedia.org/wiki/Cache_coherency_protocols_(examples).
<br><br><br><br><br><br>
[[File:naive_implementation.png|500px|thumb|Execution time of naive implementation without any optimization levels (Od).]]
The algorithm calculates the correct answer, but the performance is absolutely terrible. The reason is that an int array is a contiguous block of memory with each integer taking up 4 bytes. Assuming a 64 byte cache lineEven though each thread modifies their own indexed element, due to spatial locality, our entire the system will bring in the other elements in the array only takes up half as part of the cache opening up the possibility for multiple threads to share the same cache line resulting in false sharing. Although there were cases where higher thread count produced better results, there were many cases that performed worse than a single thread. This is due to the scheduling of thread execution that is out of the programmer's hands. It is possible that the selected schedule managed to minimize the frequency of false sharing giving better performance. However, this is extremely unreliable, so we need a better solution to false sharing.
<br><br><br><br><br><br><br><br><br><br><br><br>
'''- tons of wasted memory'''
For every 4 bytes for storing a thread's calculation, 60 bytes is empty paddingwhich grows for every new thread that is added. This empty space rapidly increases as the number of threads increasesEven though it is not a lot, it is still very inefficient.
'''- the cache line only contains one piece of data with the rest being completely emptypadding'''
This is incredibly inefficient. The whole point of the cache is to improve performance by contain relevant data We want to minimize cache misses. In practice, but by hogging valuable cache space this reduces the scheduler will be juggling many threads with their own data that may or may not be related to this program. By padding out effectiveness of the cache line. If we had a more complex program, we are hogging this valuable space and are forcing could have tons of other data the thread could have fit on the same cache misses to occurline.
Identifying false sharing requires detailed code inspection as everything will appear to work normally on the surface. However, when left unchecked false sharing will be detrimental to the program's performance and scalability. This article discussed the relation between the cache and false sharing. In addition, using a simple example, we explained how you can use padding, synchronization, and thread local variables to reduce the frequency of false sharing.
 
== Presentation ==
PDF [[File:kchou_false_sharing.pdf]]
== References ==
* Gillis, A. S. (2019, November 7). ''What is DRAM (Dynamic Random Access Memory)? how does it work?''. SearchStorage. Retrieved December 6, 2021, from https://searchstorage.techtarget.com/definition/DRAM.
* Intel ISN. (2010, June 2). ''Avoiding and identifying false sharing among threads.'' CodeProject. Retrieved December 4, 2021, from https://www.codeproject.com/Articles/85356/Avoiding-and-Identifying-False-Sharing-Among-Threa.
* Jain, R. (2018, December 17). ''Memory hierarchy design and its characteristics.'' GeeksforGeeks. Retrieved December 4, 2021, from https://www.geeksforgeeks.org/memory-hierarchy-design-and-its-characteristics/.
* Jha, A. K. (2020, August 24). ''Cache Coherence.'' GeeksforGeeks. Retrieved December 4, 2021, from https://www.geeksforgeeks.org/cache-coherence/.
* Knerl, L. (2021, April 19). ''What is DRAM (Dynamic Random Access Memory)?: HP® Tech takes. What is DRAM (Dynamic Random Access Memory)? | HP® Tech Takes''. Retrieved December 6, 2021, from https://www.hp.com/us-en/shop/tech-takes/what-is-dram-dynamic-random-access-memory.
* Neso Academy. (2021, September 26). Cache Coherence Problem & Cache Coherency Protocols [Video]. YouTube. https://www.youtube.com/watch?v=r_ZE1XVT8Ao&ab_channel=NesoAcademy
* OpenMP. (2013, December 6). ''Introduction to OpenMP: 06 Discussion 2 '' [Video]. Youtube. https://www.youtube.com/watch?v=OuzYICZUthM* OpenMP. (2013, December 6). ''Introduction to OpenMP: 08 Discussion 3 '' [Video]. YouTube. https://www.youtube.com/watch?v=pLa972Rgl1I
* Oracle. (2015, February 17). ''8.2.1 What is False Sharing?'' What is false sharing? - Oracle® Solaris Studio 12.4: OpenMP API User's Guide. Retrieved December 4, 2021, from https://docs.oracle.com/cd/E37069_01/html/E37081/aewcy.html#scrolltoc.
* PCMag. (n.d.). ''Definition of Cache Line.'' Definition of a cache line | PCMag. Retrieved December 4, 2021, from https://www.pcmag.com/encyclopedia/term/cache-line#:~:text=Browse%20Encyclopedia-,A,size%20by%20the%20system%20designer.
* PCMag. (n.d.). ''Definition of locality of reference''. PCMag. Retrieved December 3, 2021, from https://www.pcmag.com/encyclopedia/term/locality-of-reference.
* PowerCert Animated Videos. (2016, November 27). CPU Cache Explained - What is Cache Memory? [Video]. YouTube. https://www.youtube.com/watch?v=yi0FhRqDJfo&ab_channel=PowerCertAnimatedVideos
* Roomi, M. (2020, March 6). ''5 advantages and disadvantages of Hard Disk Drive: Weaknesses &amp; benefits of Hard Disk Drive''. HiTechWhizz. Retrieved December 6, 2021, from https://www.hitechwhizz.com/2020/03/5-advantages-and-disadvantages-drawbacks-benefits-of-hard-disk-drive.html.
* Singh, B. (2019, August 21). ''Locality of reference and cache operation in Cache Memory''. GeeksforGeeks. Retrieved December 6, 2021, from https://www.geeksforgeeks.org/locality-of-reference-and-cache-operation-in-cache-memory/.
* Techquickie. (2016, June 15). ''What is CPU Cache?'' [Video]. YouTube. https://www.youtube.com/watch?v=sHqNMHf2UNI
* Thampson. (2018, February 23). ''Types of SSDs and Which Ones to Buy''. Techbytes. Retrieved December 5, 2021, from https://blogs.umass.edu/Techbytes/2018/02/23/types-of-ssds-and-which-ones-to-buy/.
83
edits