Changes

Jump to: navigation, search

GPU621/Apache Spark

86 bytes added, 13:33, 30 November 2020
m
Methodology
# We will use the Google Cloud Platform '''Dataproc''' to deploy a 6 virtual machine (VM) nodes (1 master, 5 workers) cluster that is automatically configured for both Hadoop and Spark.
# Use '''Google Cloud Storage Connector''' which is compatible with Apache HDFS file system, instead of storing data on local disks of VMs.
# Store .jar and .py wordcount files and input data in the '''Cloud Storage Bucket'''
# Run a '''Dataproc''' Hadoop MapReduce and Spark jobs to count number of words in large text files and compare the performance between Hadoop and Spark in execution time.
76
edits

Navigation menu