Changes

Jump to: navigation, search

GPU621/Apache Spark Fall 2022

1,119 bytes added, 17:01, 5 December 2022
Run the application on the cluster
When the action is triggered after the result, new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion. reduce(func)
2.1. Reduce()
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
System.out.println(result);
2. 2 Count()
count() returns the number of elements in RDD.
2.3. take(n)
The action take(n) returns n number of elements from RDD. It tries to cut the number of partition it accesses, so it represents a biased collection. We cannot presume the order of the elements.
2.4. collect()
The action collect() is the common and simplest operation that returns our entire RDDs content to driver program.
2.5. foreach()
When we have a situation where we want to apply operation on each element of RDD, but it should not return value to the driver. In this case, foreach() function is useful.
//map to only words
JavaRDD<String> wordsRDD = removeBlankLineRDD.flatMap(sentence -> Arrays.asList(sentence.split(" ")).iterator());
 
//create pair RDD
JavaPairRDD<String, Long> pairRDD = wordsRDD.mapToPair(word -> new Tuple2<>(word, 1L));
Ssh into the master node of the cluster. Then issue a copy command to copy the jar file to the master node using:
aws s3 cp <s3://yourbucket/jarFileName.jar> .
Then you can run the app using:
Sparkspark-submit <jarFileName.jar>
[[File: output spark.png | 800px]]
Congrats===Check cluster status===Spark provides a simple dash board to check the status of the cluster. Visit <your_cluster_master_DNS>:18080, now you know how will see the dash board.  [[File: Dashboard spark.png | 800px]]Click the application id, you can see more details like job descriptions.  [[File: Spark jobs.png | 800px]]Or the stage descriptions. [[File: Spark stages.png | 800px]] ===Conclusion=== With Amazon EMR you can set up a cluster to deploy process and analyze data with big data frameworks in just a few minutes. You can install Spark on an Amazon EMR cluster along with other Hadoop applications, and it can also leverage the EMR file system (EMRFS) to directly access data in Amazon S3.  ==References==https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark app on .html https://docs.aws EMR.amazon.com/emr/latest/ManagementGuide/emr-gs.html https://www.databricks.com/glossary/what-is-rdd#:~:text=RDD%20was%20the%20primary%20user,that%20offers%20transformations%20and%20actions. https://www.oreilly.com/library/view/apache-spark-2x/9781787126497/d0ae45f4-e8a1-4ea7-8036-606b7e27ddfd.xhtml https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/
92
edits

Navigation menu