GPU621/Apache Spark Fall 2022
Apache Spark
Apache Spark Core API
RDD Overview
One of the most important concepts in Spark is a resilient distributed dataset (RDD). RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file, or an existing Java collection in the driver program, and transforming it.
Spark Library Installation Using Maven
An Apache Spark application can be easily instantiated using Maven. To add the required libraries, you can copy and paste the following code into the "pom.xml".
<properties> <maven.compiler.source>8</maven.compiler.source> <maven.compiler.target>8</maven.compiler.target> </properties> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.2.0</version> </dependency> </dependencies>
Create And Set Up Spark
Spark needs to be set up in a cluster. But you can also run it locally and act as a cluster. We will talk about how to set up a spark in a cluster later. Now let's try to create a spark locally. To do that, we will need the following code:
//create and set up spark SparkConf conf = new SparkConf().setAppName("HelloSpark").setMaster("local[*]"); JavaSparkContext sc = new JavaSparkContext(conf); sc.setLogLevel("WARN");