Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark

274 bytes added, 16:28, 30 November 2020
m
no edit summary
[[File:Google-cloud-dataproc.png]]
 
# We will use the Google Cloud Platform '''Dataproc''' to deploy a 6 virtual machine (VM) nodes (1 master, 5 workers) cluster that is automatically configured for both Hadoop and Spark.
Dataproc is a managed Spark and Hadoop service that automates tasks for rapid cluster creation and management. Users can use cluster for larg-scale data processing with Spark or Hadoop with the same cluster. Virtual Machine (VM) nodes in the cluster create in minutes, with all node pre-configured and installed with Hadoop, Spark and other tools. Usage is charged by virtual CPU per hour, with standard and higher performance hardware configurations available at different rate.
 
=== Setting up Dataproc and Google Cloud Storage===
[[File:Googlecloud-setup-6b.jpg]]
 
'''We will create 5 worker nodes and 1 master node using the N1 series General-Purpose machine with 4vCPU and 15 GB memory and a disk size of 32-50 GB for all nodes.
You can see the cost of your machine configuration per hour. Using machines with more memory, computing power, etc will cost more per hourly use.'''
 '''Create a cluster with 1 standard master node and 5 worker nodes'''* Name your cluster and choose a region and zone* Select a low-cost machine configuration** I.e. General Purpose N1 4vCPU, 15 GB memory for all nodes** 32 GB Standard Persistent Disk[[File:Googlecloud-dataprocsetup-19.jpg]] 
'''Allow API access to all google Cloud services in the project.'''
[[File:Googlecloud-setupdataproc-91.jpg]]
'''To view View the individual nodes in the cluster go '''Go to '''Menu -> Virtual Machines -> VM Instances'''
[[File:Googlecloud-setup-11b.jpg]]
'''noteNote: ''' Running the job will create the output folder, However for .<br/>For subsequent jobs '''be sure to delete the output folder ''' else Hadoop or Spark will not run. <br/>This limitation is done exists to prevent existing output from being overwritten''' 
[[File:Dataproc-hadoop.jpg]]
'''noteNote: ''' Running the job will create the output folder, However for .<br/>For subsequent jobs '''be sure to delete the output folder ''' else Hadoop or Spark will not run. <br/>This limitation is done exists to prevent existing output from being overwritten'''
== Results ==
76
edits