76
edits
Changes
m
→Setting up Dataproc and Google Cloud Storage
Copy the Hadoop wordcount example available on every Dataproc cluster, from Master node VM to our Cloud Storage bucket
# Open Secure Shell (SSH) from VM Instances list: Menu -> Compute -> Compute Engine.
# To copy from the VM local disk to Cloud Storage bucket enter the following command in the shell:<Code> gsutil cp /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar gs://rinsereduce<myBucketName>/ </Code> Save the Spark wordcount example into the Cloud Storage bucket by dragging and dropping it into the storage browswer# To open Browser: Menu -> Storage -> Browser# Drag and drop the below word-count.py into the browser, or use 'UPLOAD FILES' to upload.<Code> #!/usr/bin/env python import pysparkimport sys if len(sys.argv) != 3: raise Exception("Exactly 2 arguments are required: <inputUri> <outputUri>") inputUri=sys.argv[1]outputUri=sys.argv[2] sc = pyspark.SparkContext()lines = sc.textFile(sys.argv[1])words = lines.flatMap(lambda line: line.split())wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda count1, count2: count1 + count2)wordCounts.saveAsTextFile(sys.argv[2])</Code>
=== Results ===