Changes

Jump to: navigation, search

GPU621/Apache Spark

883 bytes added, 13:17, 30 November 2020
m
Spark vs Hadoop Wordcount Performance
[[File:Googlecloud-setup-11b.jpg]]
 
'''Ensure that Dataproc, Compute Engine, and Cloud Storage APIs are all enabled'''
# Search for the API name and enable them if they are not already enabled.'''
 '''Create a Cloud Storage Bucket by going from '''Menu -> Storage -> Browser -> Create Bucket''''''
Make a note of the bucket name.
 
'''Copy the Hadoop wordcount example available on every Dataproc cluster, from Master node VM to our Cloud Storage bucket'''
# To copy from the VM local disk to Cloud Storage bucket enter the following command in the shell:
<Code> gsutil cp /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar gs://<myBucketName>/ </Code>
 
'''Save the Spark wordcount example into the Cloud Storage bucket by dragging and dropping it into the storage browswer'''
# To open Browser: '''Menu -> Storage -> Browser'''
# Drag and drop the below word-count.py into the browser, or use 'UPLOAD FILES' to upload.
 
# word-count.py
=== Running the Jobs in Dataproc ===
 
Now that we have our project code, input files and Dataproc cluster setup we can proceed to run the Hadoop MapReduce and Spark wordcount jobs.
 
 
```Run the Hadoop MapReduce Job'''
# Go to '''Menu -> Big Data -> Dataproc -> Jobs'''
# Select 'SUBMIT JOB' and name your job ID
# Choose Region that the cluster was created on
# Select your cluster
# Specify Hadoop as Job Type
# Specify JAR which contains the hadoop mapreduce algorithm
** gs://<myBucketName>/hadoop-mapreduce-examples.jar
# Input 3 arguments to the mapreduce algorithm
** wordcount gs://<myBucketName>/inputFolder gs://<myBucketName>output
** '''note: Running the job will create the output folder, However for subsequent jobs be sure to delete the output folder else Hadoop or Spark will not run. This limitation is done to prevent existing output from being overwritten'''
 
[[File:Dataproc-hadoop.jpg]]
 
 
=== Results ===
76
edits

Navigation menu