Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark Fall 2022

2 bytes added, 20:49, 3 December 2022
Create RDDs
1. Parallelized Collections
 
Let’s start with some Java collections by calling JavaSparkContext’s parallelize method on an existing Collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
2. External Datasets
 
The other way is to create RDD from any storage source supported by Hadoop, including your local file system, HDFS, Amazon S3, etc. Text file RDDs can be created using SparkContext’s textFile method. This method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.
//From local file
92
edits