Changes

Jump to: navigation, search

GPU621/Apache Spark

1,551 bytes added, 10:32, 30 November 2020
Apache Hadoop
== Introduction ==
= = Apache Hadoop ==[https://hadoop.apache.org/ ''' Apache Hadoop '''] is an open-source framework that allows for the storage and distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is an implementation of MapReduce, an application programming model developed by Google. MapReduce has three basic operations: Map, Shuffle and Reduce. Map, where each worker node applies a map function to the local data and writes the output to temporary storage. Shuffle, where worker nodes redistribute data based on output keys such that all data belonging to one key is located on the same worker node. Finally reduce, where each worker node processes each group of output in parallel. == Architecture == == Components == === Hadoop Common ===The set of common libraries and utilities that other modules depend on. It is also known as Hadoop Core as it provides support for all other Hadoop components.  === What Hadoop Distributed File System (HDFS) ===This is Apache the file system that manages the storage of large sets of data across a Hadoop cluster. HDFS can handle both structured and unstructured data. The storage hardware can range from any consumer-grade HDDs to enterprise drives. === Hadoop? YARN ===YARN (Yet Another Resource Negotiator) is responsible for managing computing resources and job scheduling.  === Hadoop MapReduce ===The processing component of Hadoop ecosystem. It assigns the data fragments from the HDFS to separate map tasks in the cluster and processes the chunks in parallel to combine the pieces into the desired result. == Applications ===
= Apache Spark =

Navigation menu