92
edits
Changes
→Deploy Apache Spark Application On AWS
Click the Create Cluster button.
-IMAGE-
Enter as cluster name and choose a release version. Here I will choose the EMR-5.11.1 for the Release version. For the application, you can see that there are many options, we will choose Spark as this is our main topic.
-IMAGE-
Next, we need to choose an instance type. As you may know, the cluster will run on multiple EC2 instances and different EC2 instances have different features. Please note, different EC2 types cost differently. Please refer to the EC2 type table to check the prices. Here I will choose c4.large type as it’s the most inexpensive one. For the number of instances, I will choose 3, that is, one master and 2 nodes.
For the security part. Please choose an EC2 key pair you already used for other services before, or create a new one.
Click Create Cluster button to wait for the cluster to be set up.
-IMAGE-
You will see a page like this. Next, we need to change the security group for Master, which acts like a firewall to add an inbound rule.
-IMAGE-
We need to open port 22 and port 18080 for your IP so that you can visit the Master EC2.
Then, you can try to ssh to the master node by using
ssh -I <private_key.pem> hadoop@<MasterPublicDNS>
You should see a welcome page like this:
-IMAGE-
===Create aN S3 bucket===
Unlike the previous case where we run on a solo computer, now we need to run the application on different nodes. It makes no sense to read the file from a local hard disk because most of the time the file will be too big for one node to handle. We need to put the file onto something that all nodes can share, and we can use aN S3 file as the input file. S3 is another service AWS provides. The size of a single file on S3 can be as large as 5TB. I will skip this part and please search how to create a new bucket to hold both the input file and the application package. Please make the bucket open to public so you will not have the permission issue later on.