Changes

GPU621/Apache Spark Fall 2022

1,681 bytes added, 23:14, 29 November 2022

→‎Apache Spark Core API

==Apache Spark==

==Apache Spark Core API==

===RDD Overview===

One of the most important concepts in Spark is a resilient distributed dataset (RDD). RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file, or an existing Java collection in the driver program, and transforming it.

===Spark Installation Using Maven===

An Apache Spark application can be easily instantiated using Maven. To add the required libraries, you can copy and paste the following code in the "pom.xml".

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<groupId>org.example</groupId>

<artifactId>Spark11</artifactId>

<version>1.0-SNAPSHOT</version>

<maven.compiler.source>11</maven.compiler.source>

<maven.compiler.target>11</maven.compiler.target>

</properties>

<groupId>org.apache.spark</groupId>

<artifactId>spark-core_2.10</artifactId>

</dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-sql_2.10</artifactId>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-hdfs</artifactId>

</dependency>

</dependencies>

</project>

==Deploy Apache Spark Application On AWS==

RobinYu

92

edits

Changes

GPU621/Apache Spark Fall 2022

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools