92
edits
Changes
→Apache Spark Core API
JavaSparkContext sc = new JavaSparkContext(conf);
sc.setLogLevel("WARN");
===Create RDDs===
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
# Parallelized Collections
Let’s start with some Java collections by calling JavaSparkContext’s parallelize method on an existing Collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
//create input data list
List<Integer> inputData = new ArrayList<>();
inputData.add(11);
inputData.add(22);
inputData.add(33);
inputData.add(44);
//use RDD to run create RDDS
JavaRDD<Integer> javaRDD = sc.parallelize(inputData);
==Deploy Apache Spark Application On AWS==