Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark

517 bytes added, 13:02, 30 November 2020
m
Spark vs Hadoop Wordcount Performance
[[File:Googlecloud-setup-11b.jpg]]
'''Ensure that Dataproc, Compute Engine, and Cloud Storage APIs are all enabled by going '''# Go to '''Menu -> API & Services -> Library.# Search for the API name and enable them if they are not already enabled.'''
Create a Cloud Storage Bucket by going from '''Menu -> Storage -> Browser -> Create Bucket'''
Make a note of the bucket name.
'''Copy the Hadoop wordcount example available on every Dataproc cluster, from Master node VM to our Cloud Storage bucket'''
# Open Secure Shell (SSH) from VM Instances list: Menu -> Compute -> Compute Engine.
# To copy from the VM local disk to Cloud Storage bucket enter the following command in the shell:
<Code> gsutil cp /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar gs://<myBucketName>/ </Code>
'''Save the Spark wordcount example into the Cloud Storage bucket by dragging and dropping it into the storage browswer'''# To open Browser: '''Menu -> Storage -> Browser'''
# Drag and drop the below word-count.py into the browser, or use 'UPLOAD FILES' to upload.
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda count1, count2: count1 + count2)
wordCounts.saveAsTextFile(sys.argv[2])
 
'''Finally, add the input files containing the text the word count jobs will be processing'''
* Go to Cloud Storage Bucket: '''Menu -> Storage -> Browser'''
* Create a new folder 'input' and open it
* Drag and Drop input files, or use 'UPLOAD FILES' or 'UPLOAD FOLDER'
 
For this analysis we are using archive text files of walkthroughs from https://gamefaqs.gamespot.com/
The files range in size from 4MB to 2.8GB for a total size of 7.7 GB of plain text.
[[File:Googlecloud-wordcountfiles.jpg]]
 
76
edits