Changes

GPU621/Apache Spark

208 bytes removed, 16:48, 30 November 2020

m

no edit summary

= Apache Hadoop =

[https://hadoop.apache.org/ ''' Apache Hadoop'''] is an open-source framework that allows for the storage and distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is an implementation of MapReduce, an application programming model developed by Google. ~~Hadoop is built in Java~~ MapReduce has three basic operations: Map, Shuffle and ~~accessible through many languages for writing MapReduce code including Python through~~ Reduce. Map, where each worker node applies a ~~Thrift client~~map function to the local data and writes the output to temporary storage. ~~Hadoop can process both structured and unstructured~~ Shuffle, where worker nodes redistribute data based on output keys such that all databelonging to one key is located on the same worker node. Finally reduce, ~~and scale up reliably from a single server to thousands~~ where each worker node processes each group of ~~machines~~output in parallel.

== Architecture ==

=== Hadoop MapReduce ===

The processing component of Hadoop ecosystem. It assigns the data fragments from the HDFS to separate map tasks in the cluster and processes the chunks in parallel to combine the pieces into the desired result. MapReduce has three basic operations: Map, Shuffle and Reduce. Map, where each worker node applies a map function to the local data and writes the output to temporary storage. Shuffle, where worker nodes redistribute data based on output keys such that all data belonging to one key is located on the same worker node. Finally reduce, where each worker node processes each group of output in parallel.

== Applications ==

== Architecture ==

One of the distinguishing features of Spark is that it processes data in RAM using a concept known as Resilient Distributed Datasets (RDDs) - an immutable distributed collection of objects which can contain any type of Python, Java, or Scala objects, including user-defined classes. Each dataset is divided into logical partitions which may be computed on different nodes of the cluster. Spark's RDDs function as a working set for distributed programs that offer a restricted form of distributed shared memory. Another important abstraction in the Spark architecture is the Directed Acyclic Graph or DAG. DAG is the scheduling layer of Spark's architecture that implements stage-oriented scheduling. The DAG abstraction helps eliminate the Hadoop MapReduce multi-stage execution model and provides performance enhancements over Hadoop.

=== Results ===

==== h4 Hadoop Counters ====* Number of splits: 66* Total input files to process: 8* GS: Number of MB read: 8,291* GS: Number of MB written: 138* Launched map tasks: 66* Launched reduce tasks: 19* Map input records (millions): 191.1* Map output records (millions): 1,237.1* Reduce input records (millions): 9.3* Reduce output records (millions): 3.6* CPU time spent (s): 3,597

=== Conclusion ===

== Progress ==

DanielPark

76

edits

Changes

GPU621/Apache Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools