Open main menu

CDOT Wiki β

Changes

GPU621/Spark

508 bytes removed, 15:41, 24 November 2016
no edit summary
=== Components ===
[[File: cluster-overview.png|alt=Spark Components|link=https://gpu621.nickscherman.com/assets/images/cluster-overview.png]]
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
Fault tolerance. The ability to recompute missing or damaged partitions, mainly through what's called a lineage graph A RDD Lineage Graph (aka RDD operator graph) is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan. The following RDD graph shows the result of a the following transformations. an Lineage graph is generated as a result of an ``Action.`` Actions are an essential property of Spark that we will cover shortly.
[[File: rdd-lineage-graph.png|alt=RDD GRaph|link=https://gpu621.nickscherman.com/assets/images/rdd-lineage-graph.png]]
<syntaxhighlight lang="scala">
=== D (Distributed) ===
[[File: spark-distribution.png|alt=Spark Distribution|link=https://gpu621.nickscherman.com/assets/images/spark-distribution.png600px]]
Describes how data resides on multiple nodes in a cluster across a network of machines. can be read from and written to distributed storages like HDFS or S3, and most importantly, can be cached in the memory of worker nodes for immediate reuse. Spark is designed as a framework that operates over a network infrastructure, so tasks are divided and executed across multiple nodes in a Spark Context.
=== D (Dataset) ===
[[File: partition-stages.png|alt=RDD Dataset|link=https://gpu621.nickscherman.com/assets/images/partition-stages.png600px]]
The RDD dataset is a collection of automatically partitioned data. Spark manages data using partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.
[https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/ Source]
[[File: 4.jpg|alt=Logistic Regression Performance|link=http://cdn.edureka.co/blog/wplogistic-content/uploads/2015/12/4regression-performance.jpg]]
[http://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce Logistic Regresion Performance Comparison]
=== Installation ===
[[File: download-spark.png|alt=Spark Installation|link=https://gpu621.nickscherman.com/assets/images/download-spark.png600px]]
Spark is available for most UNIX (and OSX) platforms as well as Windows. Windows installation is more difficult since it usually requires building from source. This guide will simply cover installing Spark on Linux. If you want to follow along, you can install this on your local Linux laptop or desktop, or use Seneca [https://www.matrix.senecac.on.ca Matrix] to install as the binaries can be executed in your home directory.
27
edits