27
edits
Changes
no edit summary
[https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-lineage.html Source ]
==== D (Distributed) ====
[[File: spark-distribution.png| 600px]]
Describes how data resides on multiple nodes in a cluster across a network of machines. can be read from and written to distributed storages like HDFS or S3, and most importantly, can be cached in the memory of worker nodes for immediate reuse. Spark is designed as a framework that operates over a network infrastructure, so tasks are divided and executed across multiple nodes in a Spark Context.
==== D (Dataset) ====
[[File: partition-stages.png| 600px]]
[https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html Source]
=== RDD Essential Properties ===
RDD has 3 essential properties and 2 Optional properties.
* Optional preferred locations (aka locality info), i.e. hosts for a partition where the data will have been loaded.
=== RDD Functions ===
RDD supports two kinds of operations, ''Actions'', and ''Transformation.''
The essential idea is that the programmer specifies an operation, a transformation or series of transformations to perform on a data set using the specified operation(s), and finally, perform an action that returns new data that is a result of the transformations. The new data that resulted from the action can then be used for analysis, or for further transformations. Transformations can be thought of as the start of a parallel region of code, and the action as the end of the parallel region. Everything in Spark is designed to be as simple as possible, so partitions, threads etc. are generated automatically.
=== Advantages ===
The main advantage of Spark is that the data partitions are stored in memory, meaning that access to information is much faster than if the data was retrieved from a hard disk, this is also is a disadvantage in some cases, as storing large datasets in memory also necessitates the need for a large amount of physical memory.