Changes

GPU621/Apache Spark Fall 2022

837 bytes added, 15:33, 30 November 2022

→‎Apache Spark Core API

JavaSparkContext sc = new JavaSparkContext(conf);

sc.setLogLevel("WARN");

===Create RDDs===

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

# Parallelized Collections

Let’s start with some Java collections by calling JavaSparkContext’s parallelize method on an existing Collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

//create input data list

List<Integer> inputData = new ArrayList<>();

inputData.add(11);

inputData.add(22);

inputData.add(33);

inputData.add(44);

//use RDD to run create RDDS

JavaRDD<Integer> javaRDD = sc.parallelize(inputData);

==Deploy Apache Spark Application On AWS==

RobinYu

92

edits

CDOT Wiki β

Changes

GPU621/Apache Spark Fall 2022

CDOT Wiki ^β