Difference between revisions of "GPU621/Apache Spark Fall 2022"
(→Create And Set Up Spark) |
(→Apache Spark Core API) |
||
Line 37: | Line 37: | ||
JavaSparkContext sc = new JavaSparkContext(conf); | JavaSparkContext sc = new JavaSparkContext(conf); | ||
sc.setLogLevel("WARN"); | sc.setLogLevel("WARN"); | ||
+ | |||
+ | ===Create RDDs=== | ||
+ | There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. | ||
+ | |||
+ | # Parallelized Collections | ||
+ | Let’s start with some Java collections by calling JavaSparkContext’s parallelize method on an existing Collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. | ||
+ | |||
+ | //create input data list | ||
+ | List<Integer> inputData = new ArrayList<>(); | ||
+ | inputData.add(11); | ||
+ | inputData.add(22); | ||
+ | inputData.add(33); | ||
+ | inputData.add(44); | ||
+ | |||
+ | //use RDD to run create RDDS | ||
+ | JavaRDD<Integer> javaRDD = sc.parallelize(inputData); | ||
==Deploy Apache Spark Application On AWS== | ==Deploy Apache Spark Application On AWS== |
Revision as of 15:33, 30 November 2022
Contents
Apache Spark
Apache Spark Core API
RDD Overview
One of the most important concepts in Spark is a resilient distributed dataset (RDD). RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file, or an existing Java collection in the driver program, and transforming it. We will introduce some key APIs provided by Spark Core 2.2.1 using Java 8. You can find more information about the RDD here. https://spark.apache.org/docs/2.2.1/rdd-programming-guide.html
Spark Library Installation Using Maven
An Apache Spark application can be easily instantiated using Maven. To add the required libraries, you can copy and paste the following code into the "pom.xml".
<properties> <maven.compiler.source>8</maven.compiler.source> <maven.compiler.target>8</maven.compiler.target> </properties> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.2.0</version> </dependency> </dependencies>
Create And Set Up Spark
Spark needs to be set up in a cluster so first we need to create a JavaSparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application. We will talk about how to set up a spark in a cluster later. Now let's try to create a spark locally. To do that, we will need the following code:
//create and set up spark SparkConf conf = new SparkConf().setAppName("HelloSpark").setMaster("local[*]"); JavaSparkContext sc = new JavaSparkContext(conf); sc.setLogLevel("WARN");
Create RDDs
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
- Parallelized Collections
Let’s start with some Java collections by calling JavaSparkContext’s parallelize method on an existing Collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
//create input data list List<Integer> inputData = new ArrayList<>(); inputData.add(11); inputData.add(22); inputData.add(33); inputData.add(44);
//use RDD to run create RDDS JavaRDD<Integer> javaRDD = sc.parallelize(inputData);