Changes

← Older edit

GPU621/Spark

520 bytes removed, 20:56, 24 November 2016

→‎History

Presenatation was given today. Presentation is available http://gpu621.nickscherman.com/. Presentation notes will be up shortly.

'''Nov 24th'''

Notes have been added. Introduction to Apache Spark and an elaboration of the presentation contents.

This is a supplementary to the presentation for those who want an in depth walk through of the concepts and code. The presentation focused on an introduction to the technical and practical aspects of Spark and these notes will focus on the same.

=== ~~SPARK~~ Spark ===

Spark is Big Data framework for large scale data procesing. It provides an API centred on a data structure called the Resilient Distributed Dataset (RDD). It provides a read only, fault tolerant multiset of data items distributed over a cluster of machines. High-level APIs are available for Scala, Java, Python, and R. This tutorial focuses on Python code for its simplicity and popularity.

=== ~~HISTORY~~ History ===

Spark was developed in 2009 at UC ~~Berkeleys~~ Berkeley's AMPLab. It was open sourced in 2010 under the BSD license. As of this writing (November 2016), it's at version 2.02.

Spark is one of the most active projects in the Apache Software foundation and one of the most popular open source big data projects overall. Spark had over 1000 contributors in 2015.

| Original release date

| Latest Version

| Release ~~data~~ date

|-

| 0.5

=== Components ===

[[File: ~~cluster-overview.png|alt=Spark Components|link=https://gpu621.nickscherman.com/assets/images/~~cluster-overview.png]]

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

Fault tolerance. The ability to recompute missing or damaged partitions, mainly through what's called a lineage graph A RDD Lineage Graph (aka RDD operator graph) is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan. The following RDD graph shows the result of a the following transformations. an Lineage graph is generated as a result of an ``Action.`` Actions are an essential property of Spark that we will cover shortly.

[[File: ~~rdd-lineage-graph.png|alt=RDD GRaph|link=https://gpu621.nickscherman.com/assets/images/~~rdd-lineage-graph.png]]

val r00 = sc.parallelize(0 to 9)

val r01 = sc.parallelize(0 to 90 by 10)

[https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-lineage.html Source ]

==== D (Distributed) ====

[[File: spark-distribution.png|~~alt=Spark Distribution|link=https://gpu621.nickscherman.com/assets/images/spark-distribution.png~~600px]]

Describes how data resides on multiple nodes in a cluster across a network of machines. can be read from and written to distributed storages like HDFS or S3, and most importantly, can be cached in the memory of worker nodes for immediate reuse. Spark is designed as a framework that operates over a network infrastructure, so tasks are divided and executed across multiple nodes in a Spark Context.

==== D (Dataset) ====

[[File: partition-stages.png|~~alt=RDD Dataset|link=https://gpu621.nickscherman.com/assets/images/partition-stages.png~~600px]]

The RDD dataset is a collection of automatically partitioned data. Spark manages data using partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks.

[https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html Source]

=== RDD Essential Properties ===

RDD has 3 essential properties and 2 Optional properties.

1. * List of parent RDDs that is the list of the dependencies an RDD depends on for records. 2. * An array of partitions that a dataset is divided to. 3. * A compute function to do a computation on partitions. 4. * An optional partitioner that defines how keys are hashed, and the pairs partitioned (for key-value RDDs)5. * Optional preferred locations (aka locality info), i.e. hosts for a partition where the data will have been loaded.

=== RDD Functions ===

RDD supports two kinds of operations, ''Actions'', and ''Transformation.''

The essential idea is that the programmer specifies an operation, a transformation or series of transformations to perform on a data set using the specified operation(s), and finally, perform an action that returns new data that is a result of the transformations. The new data that resulted from the action can then be used for analysis, or for further transformations. Transformations can be thought of as the start of a parallel region of code, and the action as the end of the parallel region. Everything in Spark is designed to be as simple as possible, so partitions, threads etc. are generated automatically.

=== Advantages ===

The main advantage of Spark is that the data partitions are stored in memory, meaning that access to information is much faster than if the data was retrieved from a hard disk, this is also is a disadvantage in some cases, as storing large datasets in memory also necessitates the need for a large amount of physical memory.

{| border="1"

|-

| ~~| Apache Spark| Apache Hadoop~~|-

| Cores

| Memory

| Network

|-

| Apache Spark

| 8-16

| 8-100GB

| 10GB/s

|-

| Apache Hadoop

| 4

| 24GB

[https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/ Source]

[[File: ~~4.jpg|alt=Logistic Regression Performance|link=http://cdn.edureka.co/blog/wp~~logistic-~~content/uploads/2015/12/4~~regression-performance.jpg]]

[http://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce Logistic Regresion Performance Comparison]

=== Installation ===

[[File: download-spark.png|~~alt=Spark Installation|link=https://gpu621.nickscherman.com/assets/images/download-spark.png~~600px]]

Spark is available for most UNIX (and OSX) platforms as well as Windows. Windows installation is more difficult since it usually requires building from source. This guide will simply cover installing Spark on Linux. If you want to follow along, you can install this on your local Linux laptop or desktop, or use Seneca [https://www.matrix.senecac.on.ca Matrix] to install as the binaries can be executed in your home directory.

Nascherman

27

edits

Changes

GPU621/Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools