92
edits
Changes
→Deploy Apache Spark Application On AWS
-IMAGE-
===Create aN an S3 bucket===
Unlike the previous case where we run on a solo computer, now we need to run the application on different nodes. It makes no sense to read the file from a local hard disk because most of the time the file will be too big for one node to handle. We need to put the file onto something that all nodes can share, and we can use aN S3 file as the input file. S3 is another service AWS provides. The size of a single file on S3 can be as large as 5TB. I will skip this part and please search how to create a new bucket to hold both the input file and the application package. Please make the bucket open to the public so you will not have the permission issue later on. -IMAGE- ===Build the application=== 1. Change code In order to run it on the cloud cluster, we need to do some modifications to the code. First, let’s change the file path from a local position to the s3. From JavaRDD<String> sentences = sc.textFile("src/main/resources/subtitles/input.txt"); To JavaRDD<String> sentences = sc.textFile("s3://gpu621-demo/input.txt"); Also, add the entry point class to the pom file: -IMAGE- 2. Build the package Build the package using command line or the IDE. If you are using Idea, click the package under lifecycle Tab of maven -IMAGE- You will get a jar file under the target folder. ===Upload the jar file to your S3 bucket=== Upload the jar file into the S3 bucket you created before. ===Run the application on the cluster=== Ssh into the master node of the cluster. Then issue a copy command to copy the jar file to the master node using: aws s3 cp <s3://yourbucket/jarFileName.jar> Then you can run the app using: Spark-submit <jarFileName.jar> Then you should see the log info and the output of your application. -IMAGE- That’s how we deploy a spark app on aws EMR.