Changes

GPU621/Apache Spark Fall 2022

1,119 bytes added, 17:01, 5 December 2022

→‎Run the application on the cluster

When the action is triggered after the result, new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion. reduce(func)

2.1. Reduce()

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).

System.out.println(result);

2. 2 Count()

count() returns the number of elements in RDD.

2.3. take(n)

The action take(n) returns n number of elements from RDD. It tries to cut the number of partition it accesses, so it represents a biased collection. We cannot presume the order of the elements.

2.4. collect()

The action collect() is the common and simplest operation that returns our entire RDDs content to driver program.

2.5. foreach()

When we have a situation where we want to apply operation on each element of RDD, but it should not return value to the driver. In this case, foreach() function is useful.

//map to only words

JavaRDD<String> wordsRDD = removeBlankLineRDD.flatMap(sentence -> Arrays.asList(sentence.split(" ")).iterator());

//create pair RDD

JavaPairRDD<String, Long> pairRDD = wordsRDD.mapToPair(word -> new Tuple2<>(word, 1L));

Ssh into the master node of the cluster. Then issue a copy command to copy the jar file to the master node using:

aws s3 cp <s3://yourbucket/jarFileName.jar> .

Then you can run the app using:

~~Spark~~spark-submit <jarFileName.jar>

[[File: output spark.png | 800px]]

~~Congrats~~===Check cluster status===Spark provides a simple dash board to check the status of the cluster. Visit <your_cluster_master_DNS>:18080, ~~now~~ you ~~know how~~ will see the dash board. [[File: Dashboard spark.png | 800px]]Click the application id, you can see more details like job descriptions. [[File: Spark jobs.png | 800px]]Or the stage descriptions. [[File: Spark stages.png | 800px]] ===Conclusion=== With Amazon EMR you can set up a cluster to ~~deploy~~ process and analyze data with big data frameworks in just a few minutes. You can install Spark on an Amazon EMR cluster along with other Hadoop applications, and it can also leverage the EMR file system (EMRFS) to directly access data in Amazon S3. ==References==https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark ~~app on~~ .html https://docs.aws ~~EMR~~.amazon.com/emr/latest/ManagementGuide/emr-gs.html https://www.databricks.com/glossary/what-is-rdd#:~:text=RDD%20was%20the%20primary%20user,that%20offers%20transformations%20and%20actions. https://www.oreilly.com/library/view/apache-spark-2x/9781787126497/d0ae45f4-e8a1-4ea7-8036-606b7e27ddfd.xhtml https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/

RobinYu

92

edits

Changes

GPU621/Apache Spark Fall 2022

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools