Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark Fall 2022

108 bytes added, 14:44, 7 December 2022
RDD Overview
One of the most important concepts in Spark is a resilient distributed dataset (RDD). RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file, or an existing Java collection in the driver programSpark is normally used to handle huge data, and transforming RDD is what makes itpossible for Spark to split the input data into different nodes. RDD also provides useful APIs for the programmer to call. We will introduce some key APIs provided by Spark Core 2.2.1 using Java 8. You can find more information about the RDD here. https://spark.apache.org/docs/2.2.1/rdd-programming-guide.html
===Spark Library Installation Using Maven===
===RDD APIs===
 
[[file: Actions_RDD.png|800px]]
Basically there are types of APIs. Transformations and Actions. Transformation APIs are functions that could return another RDD set. Using these APIs, we can create child RDD from parent RDD. Actions are the functions we want to perform onto the actual dataset. They will not return new RDDs.
92
edits