Changes

GPU621/Apache Spark Fall 2022

587 bytes added, 16:25, 30 November 2022

→‎Create RDDs

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

# Parallelized Collections

Let’s start with some Java collections by calling JavaSparkContext’s parallelize method on an existing Collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

//use RDD to run create RDDS

JavaRDD<Integer> javaRDD = sc.parallelize(inputData);

#External Datasets

The other way is to create RDD from any storage source supported by Hadoop, including your local file system, HDFS, Amazon S3, etc. Text file RDDs can be created using SparkContext’s textFile method. This method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.

//From local file

JavaRDD<String> sentences = sc.textFile("src/main/resources/subtitles/input.txt");

//From a S3 file

JavaRDD<String> sentences = sc.textFile("s3://gpu621-demo/input.txt");

==Deploy Apache Spark Application On AWS==

RobinYu

92

edits

CDOT Wiki β

Changes

GPU621/Apache Spark Fall 2022

CDOT Wiki ^β