Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark Fall 2022

587 bytes added, 16:25, 30 November 2022
Create RDDs
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
# Parallelized Collections
Let’s start with some Java collections by calling JavaSparkContext’s parallelize method on an existing Collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
//use RDD to run create RDDS
JavaRDD<Integer> javaRDD = sc.parallelize(inputData);
 
#External Datasets
The other way is to create RDD from any storage source supported by Hadoop, including your local file system, HDFS, Amazon S3, etc. Text file RDDs can be created using SparkContext’s textFile method. This method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.
//From local file
JavaRDD<String> sentences = sc.textFile("src/main/resources/subtitles/input.txt");
//From a S3 file
JavaRDD<String> sentences = sc.textFile("s3://gpu621-demo/input.txt");
==Deploy Apache Spark Application On AWS==
92
edits