76
edits
Changes
m
→Spark vs Hadoop Wordcount Performance
=== Running the Jobs Hadoop MapReduce Job in Dataproc ===
Now that we have our project code, input files and Dataproc cluster setup we can proceed to run the Hadoop MapReduce and Spark wordcount jobs.
[[File:Dataproc-hadoop-2.jpeg]]
=== Running the Apahce Spark Wordcount Job in Dataproc ===
'''Create and Submit Dataproc Job'''
# Go to '''Menu -> Big Data -> Dataproc -> Jobs'''
# Select 'SUBMIT JOB' and name your job ID
# Choose Region that the cluster was created on
# Select your cluster
# Specify PySpark Job Type
# Specify .py file which contains the Apache Spark wordcount algorithm
# Give 2 arguments to word-count.py; the input folder and the output folder
word-count.py:
gs://<myBucketName>/hadoop-mapreduce-examples.jar
2 arguments:
gs://<myBucketName>/inputFolder gs://<myBucketName>output
'''note: Running the job will create the output folder, However for subsequent jobs be sure to delete the output folder else Hadoop or Spark will not run. This limitation is done to prevent existing output from being overwritten'''
=== Results ===