Revision as of 13:34, 30 November 2020

Group Members

Akhil Balachandran
Daniel Park

Project Description

vs

MapReduce was famously used by Google to process massive data sets in parallel on a distributed cluster in order to index the web for accurate and efficient search results. Apache Hadoop, the open-source platform inspired by Google’s early proprietary technology has been one of the most popular big data processing frameworks. However, in recent years its usage has been declining in favor of other increasingly popular technologies, namely Apache Spark.

This project will focus on demonstrating how a particular use case performs in Apache Hadoop versus Apache spark, and how this relates to the rising and waning adoption of Spark and Hadoop respectively. It will compare the advantages of Apache Hadoop versus Apache Spark for certain big data applications.

Apache Hadoop

Apache Hadoop is an open-source framework that allows for the storage and distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is an implementation of MapReduce, an application programming model developed by Google. MapReduce has three basic operations: Map, Shuffle and Reduce. Map, where each worker node applies a map function to the local data and writes the output to temporary storage. Shuffle, where worker nodes redistribute data based on output keys such that all data belonging to one key is located on the same worker node. Finally reduce, where each worker node processes each group of output in parallel.

Architecture

Hadoop has a master-slave architecture as shown in figure 3.1. A small Hadoop cluster consists of a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, NameNode, and DataNode. A worker node acts as both a task tracker and a DataNode. A file on HDFS is split into multiple blocks and each block is replicated within the Hadoop cluster. NameNode is the master server while the DataNodes store and maintain the blocks. The DataNodes are responsible for retrieving the blocks when requested by the NameNode. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

3.1 A multi-node Hadoop Cluster

Components

Hadoop Common

The set of common libraries and utilities that other modules depend on. It is also known as Hadoop Core as it provides support for all other Hadoop components.

Hadoop Distributed File System (HDFS)

This is the file system that manages the storage of large sets of data across a Hadoop cluster. HDFS can handle both structured and unstructured data. The storage hardware can range from any consumer-grade HDDs to enterprise drives.

Hadoop YARN

YARN (Yet Another Resource Negotiator) is responsible for managing computing resources and job scheduling.

Hadoop MapReduce

The processing component of Hadoop ecosystem. It assigns the data fragments from the HDFS to separate map tasks in the cluster and processes the chunks in parallel to combine the pieces into the desired result.

Applications

In the healthcare sector for curing diseases, reducing medical costs, predicting and managing epidemics, and maintaining the quality of human life by keeping track of large-scale health index and metrics.
In the financial industry to assess risk, build investment models, and create trading algorithms
In the retail industry to analyze structured and unstructured data to understand and serve their customers, fraud detection and prevention, inventory forecasting, etc.
In the telecom industry to store massive volumes of communications data which provides certain benefits like network traffic analytics, infrastructure planning, etc.

Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing. It is an open-source, general-purpose cluster-computing framework that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Since its inception, Spark has become one of the biggest big data distributed processing frameworks in the world. It can be deployed in a variety of ways, provides high-level APIs in Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing.

Architecture

One of the distinguishing features of Spark is that it processes data in RAM using a concept known as Resilient Distributed Datasets (RDDs) - an immutable distributed collection of objects which can contain any type of Python, Java, or Scala objects, including user-defined classes. Each dataset is divided into logical partitions which may be computed on different nodes of the cluster. Spark's RDDs function as a working set for distributed programs that offer a restricted form of distributed shared memory.

4.1 Spark Cluster components

At a fundamental level, an Apache Spark application consists of two main components: a driver, which converts the user's code into multiple tasks that can be distributed across worker nodes, and executors, which run on those nodes and execute the tasks assigned to them. The processes are coordinated by the SparkContext object in the driver program. The SparkContext can connect to several types of cluster managers which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for the application. Next, it sends the application code to the executors and finally sends tasks to the executors to run.

Components

Spark Core

Spark Core is the basic building block of Spark, which includes all components for job scheduling, performing various memory operations, fault tolerance, task dispatching, basic input/output functionalities, etc.

Spark Streaming

Spart Streaming processes live streams of data. Data generated by various sources is processed at the very instant by Spark Streaming. Data can originate from different sources including Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP/IP sockets, etc.

Spark SQL

Spark SQL is a component on top of Spark Core that introduced a data abstraction called DataFrames, which provides support for structured and semi-structured data. Spark SQL allows querying data via SQL, as well as via Apache Hive's form of SQL called Hive Query Language (HQL). It also supports data from various sources like parse tables, log files, JSON, etc. Spark SQL allows programmers to combine SQL queries with programmable changes or manipulations supported by RDD in Python, Java, Scala, and R.

GraphX

GraphX is Spark's library for enhancing graphs and enabling graph-parallel computation. It is a distributed graph-processing framework built on top of Spark. Apache Spark includes a number of graph algorithms that help users in simplifying graph analytics.

MLlib (Machine Learning Library)

Spark MLlib is a distributed machine-learning framework on top of Spark Core. It provides various types of ML algorithms including regression, clustering, and classification, which can perform various operations on data to get meaningful insights out of it.

Overview: Spark vs Hadoop

Advantage and Disadvantages

Parallelism

Performance

Spark vs Hadoop Wordcount Performance

Methodology

Hadoop and Spark clusters can be deployed in cloud environments such as the Google Cloud Platform or Amazon EMR. The clusters are managed, scalable, and pay-per-usage and comparatively easier to setup and manage versus setting up a cluster locally on commodity hardware. We will use the Google Cloud Platform managed service to run experiments and observe possible expected performance differences between Hadoop and Spark.

We will use the Google Cloud Platform Dataproc to deploy a 6 virtual machine (VM) nodes (1 master, 5 workers) cluster that is automatically configured for both Hadoop and Spark.
Use Google Cloud Storage Connector which is compatible with Apache HDFS file system, instead of storing data on local disks of VMs.
Store .jar and .py wordcount files and input data in the Cloud Storage Bucket
Run a Dataproc Hadoop MapReduce and Spark jobs to count number of words in large text files and compare the performance between Hadoop and Spark in execution time.

Setting up Dataproc and Google Cloud Storage

Using a registered Google account navigate to the Google Cloud Console https://console.cloud.google.com/ and activate the free-trial credits.

Create a new project by clicking the project link in the GCP console header. A default project of 'My First Project' is created by default

Once you are registered create the data cluster of master and slave nodes These nodes will come pre-configured with Apache Hadoop and Spark components.

Go to Menu -> Big Data -> DataProc -> Clusters

We will create 5 worker nodes and 1 master node using the N1 series General-Purpose machine with 4vCPU and 15 GB memory and a disk size of 32-50 GB for all nodes. You can see the cost of your machine configuration per hour. Using machines with more memory, computing power, etc will cost more per hourly use.

Allow API access to all google Cloud services in the project.

To view the individual nodes in the cluster go to Menu -> Virtual Machines -> VM Instances

Ensure that Dataproc, Compute Engine, and Cloud Storage APIs are all enabled

Go to Menu -> API & Services -> Library.
Search for the API name and enable them if they are not already enabled.

'Create a Cloud Storage Bucket by going from Menu -> Storage -> Browser -> Create Bucket' Make a note of the bucket name.

Copy the Hadoop wordcount example available on every Dataproc cluster, from Master node VM to our Cloud Storage bucket

Open Secure Shell (SSH) from VM Instances list: Menu -> Compute -> Compute Engine.
To copy from the VM local disk to Cloud Storage bucket enter the following command in the shell:

gsutil cp /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar gs://<myBucketName>/

Save the Spark wordcount example into the Cloud Storage bucket by dragging and dropping it into the storage browswer

To open Browser: Menu -> Storage -> Browser
Drag and drop the below word-count.py into the browser, or use 'UPLOAD FILES' to upload.

# word-count.py
#!/usr/bin/env python
import pyspark
import sys
if len(sys.argv) != 3:
 raise Exception("Exactly 2 arguments are required: <inputUri> <outputUri>")
inputUri=sys.argv[1]
outputUri=sys.argv[2]
sc = pyspark.SparkContext()
lines = sc.textFile(sys.argv[1])
words = lines.flatMap(lambda line: line.split())
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda count1, count2: count1 + count2)
wordCounts.saveAsTextFile(sys.argv[2])

Finally, add the input files containing the text the word count jobs will be processing

Go to Cloud Storage Bucket: Menu -> Storage -> Browser
Create a new folder 'input' and open it
Drag and Drop input files, or use 'UPLOAD FILES' or 'UPLOAD FOLDER'

For this analysis we are using archive text files of walkthroughs from https://gamefaqs.gamespot.com/ The files range in size from 4MB to 2.8GB for a total size of 7.7 GB of plain text.

Running the Jobs in Dataproc

Now that we have our project code, input files and Dataproc cluster setup we can proceed to run the Hadoop MapReduce and Spark wordcount jobs.

Run the Hadoop MapReduce Job

Go to Menu -> Big Data -> Dataproc -> Jobs
Select 'SUBMIT JOB' and name your job ID
Choose Region that the cluster was created on
Select your cluster
Specify Hadoop as Job Type
Specify JAR which contains the Hadoop MapReduce algorithm, give 3 arguments to wordcount, and submit job.

     gs://<myBucketName>/hadoop-mapreduce-examples.jar

     wordcount gs://<myBucketName>/inputFolder gs://<myBucketName>output

note: Running the job will create the output folder, However for subsequent jobs be sure to delete the output folder else Hadoop or Spark will not run. This limitation is done to prevent existing output from being overwritten

Retrieve the Results

You can observe the progress of each of the map and reduce jobs in the Job output console. When the jobs have completed and all the input files have been processed, Hadoop provides counters, statistics on the executed Job Also you can navigate back the the Jobs tab to see the total Elapsed Time of the job.

Some counters of note:

number of splits: splits from all input data. A split is the amount of data in one map task (Hadoop default block size is 128 MB)
Launched map tasks: number of total map tasks. Note that it matches the number of splits
Launched reduce tasks: number of total reduce tasks
GS: Number of bytes read: the total number of bytes read from Google Cloud Storage by both map and reduce tasks
GS: Number of bytes written: the total number of bytes writtenfrom Google Cloud Storage by both map and reduce tasks
Map input records: number of records (words) processed by all map tasks
Reduce output records: number of records (word) output by all reduce tasks.
CPU time spent (ms): total CPU processing time by all tasks

@@ Line 33: / Line 33: @@
 == Applications ==
-- In the healthcare sector for curing diseases, reducing medical costs, predicting and managing epidemics, and maintaining the quality of human life by keeping track of large-scale health index and metrics.
+* In the healthcare sector for curing diseases, reducing medical costs, predicting and managing epidemics, and maintaining the quality of human life by keeping track of large-scale health index and metrics.
-- In the financial industry to assess risk, build investment models, and create trading algorithms
+* In the financial industry to assess risk, build investment models, and create trading algorithms
-- In the retail industry to analyze structured and unstructured data to understand and serve their customers, fraud detection and prevention, inventory forecasting, etc.
+* In the retail industry to analyze structured and unstructured data to understand and serve their customers, fraud detection and prevention, inventory forecasting, etc.
-- In the telecom industry to store massive volumes of communications data which provides certain benefits like network traffic analytics, infrastructure planning, etc.
+* In the telecom industry to store massive volumes of communications data which provides certain benefits like network traffic analytics, infrastructure planning, etc.
 = Apache Spark =

Difference between revisions of "GPU621/Apache Spark"

Revision as of 13:34, 30 November 2020

Contents

Group Members

Project Description

Apache Hadoop

Architecture

Components

Hadoop Common

Hadoop Distributed File System (HDFS)

Hadoop YARN

Hadoop MapReduce

Applications

Apache Spark

Architecture

Components

Spark Core

Spark Streaming

Spark SQL

GraphX

MLlib (Machine Learning Library)

Overview: Spark vs Hadoop

Advantage and Disadvantages

Parallelism

Performance

Spark vs Hadoop Wordcount Performance

Methodology

Setting up Dataproc and Google Cloud Storage

Running the Jobs in Dataproc

Results

Conclusion

Progress

References

Navigation menu

Search