Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark

510 bytes added, 16:16, 30 November 2020
m
Spark vs Hadoop Wordcount Performance
# Store .jar and .py wordcount files and input data in the '''Cloud Storage Bucket'''
# Run a '''Dataproc''' Hadoop MapReduce and Spark jobs to count number of words in large text files and compare the performance between Hadoop and Spark in execution time.
 
=== What is Dataproc? ===
 
Dataproc is a managed Spark and Hadoop service that automates tasks for rapid cluster creation and management. Users can use cluster for larg-scale data processing with Spark or Hadoop with the same cluster. Virtual Machine (VM) nodes in the cluster create in minutes, with all node pre-configured and installed with Hadoop, Spark and other tools. Usage is charged by virtual CPU per hour, with standard and higher performance hardware configurations available at different rate.
 
=== Setting up Dataproc and Google Cloud Storage===
76
edits