Changes

GPU621/Apache Spark Fall 2022

75 bytes added, 14:36, 30 November 2022

→‎RDD Overview

===RDD Overview===

One of the most important concepts in Spark is a resilient distributed dataset (RDD). RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file, or an existing Java collection in the driver program, and transforming it.

We will introduce some key APIs provided by Spark Core 2.2.1 using Java 8.

===Spark Library Installation Using Maven===

RobinYu

92

edits

Changes

GPU621/Apache Spark Fall 2022

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools