Changes

GPU621/Apache Spark

274 bytes added, 16:28, 30 November 2020

m

no edit summary

[[File:Google-cloud-dataproc.png]]

# We will use the Google Cloud Platform '''Dataproc''' to deploy a 6 virtual machine (VM) nodes (1 master, 5 workers) cluster that is automatically configured for both Hadoop and Spark.

Dataproc is a managed Spark and Hadoop service that automates tasks for rapid cluster creation and management. Users can use cluster for larg-scale data processing with Spark or Hadoop with the same cluster. Virtual Machine (VM) nodes in the cluster create in minutes, with all node pre-configured and installed with Hadoop, Spark and other tools. Usage is charged by virtual CPU per hour, with standard and higher performance hardware configurations available at different rate.

=== Setting up Dataproc and Google Cloud Storage===

[[File:Googlecloud-setup-6b.jpg]]

'''We will create 5 worker nodes and 1 master node using the N1 series General-Purpose machine with 4vCPU and 15 GB memory and a disk size of 32-50 GB for all nodes.

You can see the cost of your machine configuration per hour. Using machines with more memory, computing power, etc will cost more per hourly use.'''

'''Create a cluster with 1 standard master node and 5 worker nodes'''* Name your cluster and choose a region and zone* Select a low-cost machine configuration** I.e. General Purpose N1 4vCPU, 15 GB memory for all nodes** 32 GB Standard Persistent Disk[[File:Googlecloud-~~dataproc~~setup-19.jpg]]

'''Allow API access to all google Cloud services in the project.'''

[[File:Googlecloud-~~setup~~dataproc-91.jpg]]

'''~~To view~~ View the individual nodes in the cluster go '''Go to '''Menu -> Virtual Machines -> VM Instances'''

[[File:Googlecloud-setup-11b.jpg]]

'''~~note~~Note: ''' Running the job will create the output folder~~, However for~~ .<br/>For subsequent jobs '''be sure to delete the output folder ''' else Hadoop or Spark will not run. <br/>This limitation ~~is done~~ exists to prevent existing output from being overwritten~~'''~~

[[File:Dataproc-hadoop.jpg]]

'''~~note~~Note: ''' Running the job will create the output folder~~, However for~~ .<br/>For subsequent jobs '''be sure to delete the output folder ''' else Hadoop or Spark will not run. <br/>This limitation ~~is done~~ exists to prevent existing output from being overwritten~~'''~~

== Results ==

DanielPark

76

edits

CDOT Wiki β

Changes

GPU621/Apache Spark

CDOT Wiki ^β