Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark

208 bytes removed, 15:48, 30 November 2020
m
no edit summary
= Apache Hadoop =
[https://hadoop.apache.org/ ''' Apache Hadoop'''] is an open-source framework that allows for the storage and distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is an implementation of MapReduce, an application programming model developed by Google. Hadoop is built in Java MapReduce has three basic operations: Map, Shuffle and accessible through many languages for writing MapReduce code including Python through Reduce. Map, where each worker node applies a Thrift clientmap function to the local data and writes the output to temporary storage. Hadoop can process both structured and unstructured Shuffle, where worker nodes redistribute data based on output keys such that all databelonging to one key is located on the same worker node. Finally reduce, and scale up reliably from a single server to thousands where each worker node processes each group of machinesoutput in parallel.
== Architecture ==
=== Hadoop MapReduce ===
The processing component of Hadoop ecosystem. It assigns the data fragments from the HDFS to separate map tasks in the cluster and processes the chunks in parallel to combine the pieces into the desired result. MapReduce has three basic operations: Map, Shuffle and Reduce. Map, where each worker node applies a map function to the local data and writes the output to temporary storage. Shuffle, where worker nodes redistribute data based on output keys such that all data belonging to one key is located on the same worker node. Finally reduce, where each worker node processes each group of output in parallel.
== Applications ==
== Architecture ==
One of the distinguishing features of Spark is that it processes data in RAM using a concept known as Resilient Distributed Datasets (RDDs) - an immutable distributed collection of objects which can contain any type of Python, Java, or Scala objects, including user-defined classes. Each dataset is divided into logical partitions which may be computed on different nodes of the cluster. Spark's RDDs function as a working set for distributed programs that offer a restricted form of distributed shared memory. Another important abstraction in the Spark architecture is the Directed Acyclic Graph or DAG. DAG is the scheduling layer of Spark's architecture that implements stage-oriented scheduling. The DAG abstraction helps eliminate the Hadoop MapReduce multi-stage execution model and provides performance enhancements over Hadoop.
[[File: Cluster-overview.png|thumb|upright=1.1|right|alt=Spark cluster|4.1 Spark Cluster components]]
=== Results ===
==== h4 Hadoop Counters ====* Number of splits: 66* Total input files to process: 8* GS: Number of MB read: 8,291* GS: Number of MB written: 138* Launched map tasks: 66* Launched reduce tasks: 19* Map input records (millions): 191.1* Map output records (millions): 1,237.1* Reduce input records (millions): 9.3* Reduce output records (millions): 3.6* CPU time spent (s): 3,597  
=== Conclusion ===
 
 
== Progress ==
76
edits