Changes

GPU621/Apache Spark

510 bytes added, 16:16, 30 November 2020

m

→‎Spark vs Hadoop Wordcount Performance

# Store .jar and .py wordcount files and input data in the '''Cloud Storage Bucket'''

# Run a '''Dataproc''' Hadoop MapReduce and Spark jobs to count number of words in large text files and compare the performance between Hadoop and Spark in execution time.

=== What is Dataproc? ===

Dataproc is a managed Spark and Hadoop service that automates tasks for rapid cluster creation and management. Users can use cluster for larg-scale data processing with Spark or Hadoop with the same cluster. Virtual Machine (VM) nodes in the cluster create in minutes, with all node pre-configured and installed with Hadoop, Spark and other tools. Usage is charged by virtual CPU per hour, with standard and higher performance hardware configurations available at different rate.

=== Setting up Dataproc and Google Cloud Storage===

DanielPark

76

edits

CDOT Wiki β

Changes

GPU621/Apache Spark

CDOT Wiki ^β