Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark

325 bytes added, 15:38, 30 November 2020
Architecture
== Architecture ==
One of the distinguishing features of Spark is that it processes data in RAM using a concept known as Resilient Distributed Datasets (RDDs) - an immutable distributed collection of objects which can contain any type of Python, Java, or Scala objects, including user-defined classes. Each dataset is divided into logical partitions which may be computed on different nodes of the cluster. Spark's RDDs function as a working set for distributed programs that offer a restricted form of distributed shared memory. Another important abstraction in the Spark architecture is the Directed Acyclic Graph or DAG. DAG is the scheduling layer of Spark's architecture that implements stage-oriented scheduling. The DAG abstraction helps eliminate the Hadoop MapReduce multi-stage execution model and provides performance enhancements over Hadoop.
[[File: Cluster-overview.png|thumb|upright=1.1|right|alt=Spark cluster|4.1 Spark Cluster components]]