Changes

Jump to: navigation, search

GPU621/Apache Spark Fall 2022

75 bytes added, 15:36, 30 November 2022
RDD Overview
===RDD Overview===
One of the most important concepts in Spark is a resilient distributed dataset (RDD). RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file, or an existing Java collection in the driver program, and transforming it.
We will introduce some key APIs provided by Spark Core 2.2.1 using Java 8.
===Spark Library Installation Using Maven===
92
edits

Navigation menu