Difference between revisions of "GPU621/Apache Spark Fall 2022"
(→Apache Spark) |
(→Apache Spark Core API) |
||
Line 1: | Line 1: | ||
==Apache Spark== | ==Apache Spark== | ||
==Apache Spark Core API== | ==Apache Spark Core API== | ||
+ | ===RDD Overview=== | ||
+ | One of the most important concepts in Spark is a resilient distributed dataset (RDD). RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file, or an existing Java collection in the driver program, and transforming it. | ||
+ | |||
+ | ===Spark Installation Using Maven=== | ||
+ | An Apache Spark application can be easily instantiated using Maven. To add the required libraries, you can copy and paste the following code in the "pom.xml". | ||
+ | |||
+ | <?xml version="1.0" encoding="UTF-8"?> | ||
+ | <project xmlns="http://maven.apache.org/POM/4.0.0" | ||
+ | xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
+ | xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> | ||
+ | <modelVersion>4.0.0</modelVersion> | ||
+ | |||
+ | <groupId>org.example</groupId> | ||
+ | <artifactId>Spark11</artifactId> | ||
+ | <version>1.0-SNAPSHOT</version> | ||
+ | |||
+ | <properties> | ||
+ | <maven.compiler.source>11</maven.compiler.source> | ||
+ | <maven.compiler.target>11</maven.compiler.target> | ||
+ | </properties> | ||
+ | <dependencies> | ||
+ | <dependency> | ||
+ | <groupId>org.apache.spark</groupId> | ||
+ | <artifactId>spark-core_2.10</artifactId> | ||
+ | <version>2.2.0</version> | ||
+ | </dependency> | ||
+ | <dependency> | ||
+ | <groupId>org.apache.spark</groupId> | ||
+ | <artifactId>spark-sql_2.10</artifactId> | ||
+ | <version>2.2.0</version> | ||
+ | </dependency> | ||
+ | <dependency> | ||
+ | <groupId>org.apache.hadoop</groupId> | ||
+ | <artifactId>hadoop-hdfs</artifactId> | ||
+ | <version>2.2.0</version> | ||
+ | </dependency> | ||
+ | </dependencies> | ||
+ | </project> | ||
==Deploy Apache Spark Application On AWS== | ==Deploy Apache Spark Application On AWS== |
Revision as of 22:14, 29 November 2022
Contents
Apache Spark
Apache Spark Core API
RDD Overview
One of the most important concepts in Spark is a resilient distributed dataset (RDD). RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file, or an existing Java collection in the driver program, and transforming it.
Spark Installation Using Maven
An Apache Spark application can be easily instantiated using Maven. To add the required libraries, you can copy and paste the following code in the "pom.xml".
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId> <artifactId>Spark11</artifactId> <version>1.0-SNAPSHOT</version>
<properties> <maven.compiler.source>11</maven.compiler.source> <maven.compiler.target>11</maven.compiler.target> </properties> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>2.2.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.2.0</version> </dependency> </dependencies>
</project>