Changes

GPU621/Apache Spark

381 bytes added, 13:45, 30 November 2020

m

no edit summary

=== Running the Hadoop MapReduce Job in Dataproc ===

Now that we have our project code, input files and Dataproc cluster setup we can proceed to run the Hadoop MapReduce and Spark wordcount jobs.

# Specify Hadoop as Job Type

# Specify JAR which contains the Hadoop MapReduce algorithm, give 3 arguments to wordcount, and submit job.

mapreduce jar:

gs://<myBucketName>/hadoop-mapreduce-examples.jar

3 arguments:

wordcount gs://<myBucketName>/inputFolder gs://<myBucketName>output

[[File:Dataproc-hadoop-2.jpeg]]

''' To output the files to a .txt file'''# Open the SSH for the Master VM node: '''Menu -> Compute -> Compute Engine -> VM Instances -> SSH (of 'm' master node) -> Open in Browser Window'''# Run following command in the shell to aggregate the results into 'output.txt' file gsutil cat gs://rinsereduce/output/* > gs://rinsereduce/output/output.txt === Running the ~~Apahce~~ Apache Spark Wordcount Job in Dataproc ===

'''Create and Submit Dataproc Job'''

DanielPark

76

edits

CDOT Wiki β

Changes

GPU621/Apache Spark

CDOT Wiki ^β