Changes

Jump to: navigation, search

GPU621/Apache Spark

3 bytes removed, 12:32, 30 November 2020
m
Methodology
[[File:Google-cloud-dataproc.png]]
1. # We will use the Google Cloud Platform '''Dataproc''' to deploy a 6 virtual machine (VM) nodes (1 master, 5 workers) cluster that is automatically configured for both Hadoop and Spark.2. # Use '''Google Cloud Storage Connector''' which is compatible with Apache HDFS file system, instead of storing data on local disks of VMs. 3. # Run a '''Dataproc''' Hadoop MapReduce and Spark jobs to count number of words in large text files and compare the performance between Hadoop and Spark in execution time.
=== Setup ===
76
edits

Navigation menu