Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark

148 bytes added, 14:13, 30 November 2020
m
Spark vs Hadoop Wordcount Performance
[[File:Googlecloud-setup-9.jpg]]
 
'''To view the individual nodes in the cluster go to '''Menu -> Virtual Machines -> VM Instances'''
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda count1, count2: count1 + count2)
wordCounts.saveAsTextFile(sys.argv[2])
 
'''Finally, add the input files containing the text the word count jobs will be processing'''
Now that we have our project code, input files and Dataproc cluster setup we can proceed to run the Hadoop MapReduce and Spark wordcount jobs.
 
'''Run the Hadoop MapReduce Job'''
3 arguments:
wordcount gs://<myBucketName>/inputFolder gs://<myBucketName>output
 
'''note: Running the job will create the output folder, However for subsequent jobs be sure to delete the output folder else Hadoop or Spark will not run. This limitation is done to prevent existing output from being overwritten'''
When the jobs have completed and all the input files have been processed, Hadoop provides '''counters''', statistics on the executed Job
Also you can navigate back the the '''Jobs''' tab to see the total Elapsed Time of the job.
 
'''Some counters of note:'''
[[File:Dataproc-hadoop-2.jpeg]]
 
''' To output the files to a .txt file'''
# Open the SSH for the Master VM node: '''Menu -> Compute -> Compute Engine -> VM Instances -> SSH (of 'm' master node) -> Open in Browser Window'''
# Run following command in the shell to aggregate the results into 'output.txt' file
gsutil cat gs://rinsereduce/output/* > gs://rinsereduce/output/output.txtYou can then download the output file from the VM local storage to your local machinePress the dropdown from the Gear icon in the SSH and select '''Download File'''
76
edits