Changes

GPU621/Apache Spark

86 bytes added, 12:33, 30 November 2020

m

→‎Methodology

# We will use the Google Cloud Platform '''Dataproc''' to deploy a 6 virtual machine (VM) nodes (1 master, 5 workers) cluster that is automatically configured for both Hadoop and Spark.

# Use '''Google Cloud Storage Connector''' which is compatible with Apache HDFS file system, instead of storing data on local disks of VMs.

# Store .jar and .py wordcount files and input data in the '''Cloud Storage Bucket'''

# Run a '''Dataproc''' Hadoop MapReduce and Spark jobs to count number of words in large text files and compare the performance between Hadoop and Spark in execution time.

DanielPark

76

edits

CDOT Wiki β

Changes

GPU621/Apache Spark

CDOT Wiki ^β