Difference between revisions of "GPU621/Apache Spark Fall 2022"

From CDOT Wiki
Jump to: navigation, search
(Deploy Apache Spark Application On AWS)
(Deploy Apache Spark Application On AWS)
Line 103: Line 103:
 
-IMAGE-
 
-IMAGE-
  
===Create aN S3 bucket===
+
===Create an S3 bucket===
  
Unlike the previous case where we run on a solo computer, now we need to run the application on different nodes. It makes no sense to read the file from a local hard disk because most of the time the file will be too big for one node to handle. We need to put the file onto something that all nodes can share, and we can use aN S3 file as the input file. S3 is another service AWS provides. The size of a single file on S3 can be as large as 5TB.  I will skip this part and please search how to create a new bucket to hold both the input file and the application package. Please make the bucket open to public so you will not have the permission issue later on.
+
Unlike the previous case where we run on a solo computer, now we need to run the application on different nodes. It makes no sense to read the file from a local hard disk because most of the time the file will be too big for one node to handle. We need to put the file onto something that all nodes can share, and we can use aN S3 file as the input file. S3 is another service AWS provides. The size of a single file on S3 can be as large as 5TB.  I will skip this part and please search how to create a new bucket to hold both the input file and the application package. Please make the bucket open to the public so you will not have permission issue later on.
 +
 
 +
-IMAGE-
 +
 
 +
===Build the application===
 +
 
 +
1. Change code
 +
 
 +
In order to run it on the cloud cluster, we need to do some modifications to the code. First, let’s change the file path from a local position to the s3.
 +
 
 +
From
 +
 
 +
JavaRDD<String> sentences = sc.textFile("src/main/resources/subtitles/input.txt");
 +
 
 +
To
 +
 
 +
JavaRDD<String> sentences = sc.textFile("s3://gpu621-demo/input.txt");
 +
 
 +
Also, add the entry point class to the pom file:
 +
 
 +
-IMAGE-
 +
 
 +
2. Build the package
 +
 
 +
Build the package using command line or the IDE. If you are using Idea, click the package under lifecycle Tab of maven
 +
 
 +
-IMAGE-
 +
 
 +
You will get a jar file under the target folder.
 +
 
 +
===Upload the jar file to your S3 bucket===
 +
 
 +
Upload the jar file into the S3 bucket you created before.
 +
 
 +
===Run the application on the cluster===
 +
 
 +
Ssh into the master node of the cluster. Then issue a copy command to copy the jar file to the master node using:
 +
 
 +
aws s3 cp <s3://yourbucket/jarFileName.jar>
 +
 
 +
Then you can run the app using:
 +
 
 +
Spark-submit <jarFileName.jar>
 +
 
 +
Then you should see the log info and the output of your application.
 +
 
 +
-IMAGE-
 +
 
 +
That’s how we deploy a spark app on aws EMR.

Revision as of 20:47, 3 December 2022

Apache Spark

Apache Spark Core API

RDD Overview

One of the most important concepts in Spark is a resilient distributed dataset (RDD). RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file, or an existing Java collection in the driver program, and transforming it. We will introduce some key APIs provided by Spark Core 2.2.1 using Java 8. You can find more information about the RDD here. https://spark.apache.org/docs/2.2.1/rdd-programming-guide.html

Spark Library Installation Using Maven

An Apache Spark application can be easily instantiated using Maven. To add the required libraries, you can copy and paste the following code into the "pom.xml".

   <properties>
       <maven.compiler.source>8</maven.compiler.source>
       <maven.compiler.target>8</maven.compiler.target>
   </properties>
   <dependencies>
       <dependency>
           <groupId>org.apache.spark</groupId>
           <artifactId>spark-core_2.10</artifactId>
           <version>2.2.0</version>
       </dependency>
       <dependency>
           <groupId>org.apache.spark</groupId>
           <artifactId>spark-sql_2.10</artifactId>
           <version>2.2.0</version>
       </dependency>
       <dependency>
           <groupId>org.apache.hadoop</groupId>
           <artifactId>hadoop-hdfs</artifactId>
           <version>2.2.0</version>
       </dependency>
   </dependencies>

Create And Set Up Spark

Spark needs to be set up in a cluster so first we need to create a JavaSparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application. We will talk about how to set up a spark in a cluster later. Now let's try to create a spark locally. To do that, we will need the following code:

  //create and set up spark
  SparkConf conf = new SparkConf().setAppName("HelloSpark").setMaster("local[*]");
  JavaSparkContext sc = new JavaSparkContext(conf);
  sc.setLogLevel("WARN");

Create RDDs

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

1. Parallelized Collections Let’s start with some Java collections by calling JavaSparkContext’s parallelize method on an existing Collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

       //create input data list
       List<Integer> inputData = new ArrayList<>();
       inputData.add(11);
       inputData.add(22);
       inputData.add(33);
       inputData.add(44);
       //use RDD to run create RDDS
       JavaRDD<Integer> javaRDD = sc.parallelize(inputData);

2. External Datasets The other way is to create RDD from any storage source supported by Hadoop, including your local file system, HDFS, Amazon S3, etc. Text file RDDs can be created using SparkContext’s textFile method. This method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.

       //From local file
       JavaRDD<String> sentences = sc.textFile("src/main/resources/subtitles/input.txt");
       //From a S3 file
       JavaRDD<String> sentences = sc.textFile("s3://gpu621-demo/input.txt");

Deploy Apache Spark Application On AWS

Amazon EMR is a cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto provided by AWS cloud service. EMR is easy to use and it has low cost, so it’s a great start for spark beginners.

Prerequisite

From here, I will assume you have an AWS service account and that you have basic knowledge about AWS services like how to use S3 bucket, or how to add role or policy to services. Also, you will need to have basic knowledge about SSH and Linux commands.

Create an EMR cluster

Search and choose EMR on AWS service panel.

-IMAGE-

Click the Create Cluster button.

-IMAGE-

Enter as cluster name and choose a release version. Here I will choose the EMR-5.11.1 for the Release version. For the application, you can see that there are many options, we will choose Spark as this is our main topic.

-IMAGE-

Next, we need to choose an instance type. As you may know, the cluster will run on multiple EC2 instances and different EC2 instances have different features. Please note, different EC2 types cost differently. Please refer to the EC2 type table to check the prices. Here I will choose c4.large type as it’s the most inexpensive one. For the number of instances, I will choose 3, that is, one master and 2 nodes.

For the security part. Please choose an EC2 key pair you already used for other services before, or create a new one.

Click Create Cluster button to wait for the cluster to be set up.

-IMAGE-

You will see a page like this. Next, we need to change the security group for Master, which acts like a firewall to add an inbound rule.

-IMAGE-

We need to open port 22 and port 18080 for your IP so that you can visit the Master EC2.

Then, you can try to ssh to the master node by using

ssh -I <private_key.pem> hadoop@<MasterPublicDNS>

You should see a welcome page like this:

-IMAGE-

Create an S3 bucket

Unlike the previous case where we run on a solo computer, now we need to run the application on different nodes. It makes no sense to read the file from a local hard disk because most of the time the file will be too big for one node to handle. We need to put the file onto something that all nodes can share, and we can use aN S3 file as the input file. S3 is another service AWS provides. The size of a single file on S3 can be as large as 5TB. I will skip this part and please search how to create a new bucket to hold both the input file and the application package. Please make the bucket open to the public so you will not have permission issue later on.

-IMAGE-

Build the application

1. Change code

In order to run it on the cloud cluster, we need to do some modifications to the code. First, let’s change the file path from a local position to the s3.

From

JavaRDD<String> sentences = sc.textFile("src/main/resources/subtitles/input.txt");

To

JavaRDD<String> sentences = sc.textFile("s3://gpu621-demo/input.txt"); 

Also, add the entry point class to the pom file:

-IMAGE-

2. Build the package

Build the package using command line or the IDE. If you are using Idea, click the package under lifecycle Tab of maven

-IMAGE-

You will get a jar file under the target folder.

Upload the jar file to your S3 bucket

Upload the jar file into the S3 bucket you created before.

Run the application on the cluster

Ssh into the master node of the cluster. Then issue a copy command to copy the jar file to the master node using:

aws s3 cp <s3://yourbucket/jarFileName.jar> 

Then you can run the app using:

Spark-submit <jarFileName.jar>

Then you should see the log info and the output of your application.

-IMAGE-

That’s how we deploy a spark app on aws EMR.