Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark

381 bytes added, 13:45, 30 November 2020
m
no edit summary
=== Running the Hadoop MapReduce Job in Dataproc ===
 
Now that we have our project code, input files and Dataproc cluster setup we can proceed to run the Hadoop MapReduce and Spark wordcount jobs.
# Specify Hadoop as Job Type
# Specify JAR which contains the Hadoop MapReduce algorithm, give 3 arguments to wordcount, and submit job.
mapreduce jar:
gs://<myBucketName>/hadoop-mapreduce-examples.jar
3 arguments:
wordcount gs://<myBucketName>/inputFolder gs://<myBucketName>output
[[File:Dataproc-hadoop-2.jpeg]]
''' To output the files to a .txt file'''# Open the SSH for the Master VM node: '''Menu -> Compute -> Compute Engine -> VM Instances -> SSH (of 'm' master node) -> Open in Browser Window'''# Run following command in the shell to aggregate the results into 'output.txt' file gsutil cat gs://rinsereduce/output/* > gs://rinsereduce/output/output.txt  === Running the Apahce Apache Spark Wordcount Job in Dataproc === 
'''Create and Submit Dataproc Job'''
76
edits