Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark Fall 2022

837 bytes added, 15:33, 30 November 2022
Apache Spark Core API
JavaSparkContext sc = new JavaSparkContext(conf);
sc.setLogLevel("WARN");
 
===Create RDDs===
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
 
# Parallelized Collections
Let’s start with some Java collections by calling JavaSparkContext’s parallelize method on an existing Collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
 
//create input data list
List<Integer> inputData = new ArrayList<>();
inputData.add(11);
inputData.add(22);
inputData.add(33);
inputData.add(44);
 
//use RDD to run create RDDS
JavaRDD<Integer> javaRDD = sc.parallelize(inputData);
==Deploy Apache Spark Application On AWS==
92
edits