Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark Fall 2022

7,019 bytes added, 14:44, 7 December 2022
RDD Overview
==Group 3 Information==
Alan Huang;
Jianchang Yu;
Tim Lin;
 
==Apache Spark Introduction==
 
[[file: Spark_2022.png|600px]]
 
Apache Spark is an open source cluster computing framework pioneered by Matei Zaharia at the University of California, Berkeley's AMPLab in 2009 and released open source in 2010 under the BSD license.Spark uses in-memory computing technology to analyze data in memory while it is still being written to the hard disk. Spark allows users to load data into cluster memory and query it multiple times, making it ideal for machine learning algorithms.
 
==Spark features==
Spark has a great future. It can scale to over 8000 nodes. Spark Streaming is scalable, high-throughput, and fault-tolerant for processing instant data streams.Spark SQL supports structured and relational query processing SQL.MLlib high-end library for machine learning algorithms and Graphx graphics processing algorithms.
 
==Spark Ecosystem==
 
[[file: Spark_component_2022.png|800px]]
 
 
===1. Spark Core===
 
The Spark core is the project's foundation, providing distributed task scheduling, and basic I/O functionality. The underlying program abstraction is called Resilient Distributed Datasets, or RDDs, which is a collection of data that can be manipulated in parallel through fault-tolerant mechanisms. The abstraction of RDDs is presented through language integration APIs in Scala, Java, and Python, simplifying programming complexity and allowing applications to manipulate RDDs in a manner similar to manipulating native datasets.
 
===2. Spark SQL===
Spark SQL brings a data abstraction concept called SchemaRDD to the Spark core to provide support for structured and semi-structured data. Spark SQL provides domain-specific languages, and you can manipulate SchemaRDDs using Scala, Java, or Python. It also supports the use of the SQL language using the command line interface and ODBC/JDBC server.
 
===3. Spark Streaming===
 
Spark Streaming takes advantage of Spark's core fast scheduling capabilities to perform stream analysis. It intercepts small batches of information and performs RDD transformations on them. This design allows streaming analysis to use the same set of application code written for batch analysis within the same engine.
 
===4. MLlib===
MLlib is a distributed machine learning framework on Spark.
 
===5. GraphX===
GraphX is a distributed graph processing framework on Spark. It provides a set of APIs for expressing graph computations and can emulate Pregel abstraction. graphX also provides optimized runs for this abstraction.
 
==Spark Application==
===1. The iterative operations and the multiple operations of the specific data sets ===
Spark is developed based on a memory-based iterative computing framework, so Spark has the advantage that the amount of read data will increase as the number of iterations increases. In the case of where iterative operations are applied or specific data sets need to be operated multiple times, Spark is very effective.
===2. Real-time calculation===
With the batch processing capability of Spark Streaming, Spark has the advantage of large throughput in real-time statistical analysis calculations.
===3. Batch data processing===
Spark has the ability to process massive data
*With the advantage of high throughput with Spark Streaming, Spark can also process massive data in real time.
 
==Spark cluster mode==
===1. Local mode===
For local development and testing, it is usually divided into local single-thread and local-cluster multi-thread.
===2. Standalone cluster mode===
Running on the Standalone cluster manager, Standalone is responsible for resource management, Spark is responsible for task scheduling and calculation
===3. Hadoop YARN cluster mode===
Running on the Hadoop YARN cluster manager, Hadoop YARN is responsible for resource management, Spark is responsible for task scheduling and calculation
===4. Apache Mesos cluster mode===
Running on the Apache Mesos cluster manager, Apache Mesos is responsible for resource management, Spark is responsible for task scheduling and computing
===5. Kubernetes cluster mode===
Running on the Kubernetes cluster manager, Kubernetes is responsible for resource management, and Spark is responsible for task scheduling and computing
 
==Spark Implementation==
===Work Flow Chart===
[[File:Cluster-overview.png]]
===The implementation of Spark requires the following Components===
====1.Driver Program (SparkContext)====
SparkContext is the main entry point for all Spark functions.
====2.Cluster Manager====
The cluster manager is used for resource management of applications.
====3. Worker node====
Work nodes are used to submit tasks to executors, report executor status information, cpu and memory information to the cluster manager.
====4. Executor====
Components that perform computational tasks. It is a process responsible for running tasks, saving data and returning result data.
 
===The implementation of Spark has the following steps===
*1. The SparkContext applies for computing resources from the Cluster Manager.
*2. The Cluster Manager receives the request and start allocating the resources. (Creates and activates the executor on the worker node.)
*3. The SparkContext sends the program/application code (jar package or python file, etc.) and task to the Executor. Executor executes the task and saves/returns the result
*4. The SparkContext will collect the results.
==Apache Spark Core API==
===RDD Overview===
 [[file:RDD_Spark.jpg|800px]]  One of the most important concepts in Spark is a resilient distributed dataset (RDD). RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file, or an existing Java collection in the driver programSpark is normally used to handle huge data, and transforming RDD is what makes itpossible for Spark to split the input data into different nodes. RDD also provides useful APIs for the programmer to call. We will introduce some key APIs provided by Spark Core 2.2.1 using Java 8. You can find more information about the RDD here. https://spark.apache.org/docs/2.2.1/rdd-programming-guide.html
===Spark Library Installation Using Maven===
===RDD APIs===
 
[[file: Actions_RDD.png|800px]]
Basically there are types of APIs. Transformations and Actions. Transformation APIs are functions that could return another RDD set. Using these APIs, we can create child RDD from parent RDD. Actions are the functions we want to perform onto the actual dataset. They will not return new RDDs.
When the action is triggered after the result, new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion. reduce(func)
2.1. Reduce()
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one).
System.out.println(result);
2. 2 Count()
count() returns the number of elements in RDD.
2.3. take(n)
The action take(n) returns n number of elements from RDD. It tries to cut the number of partition it accesses, so it represents a biased collection. We cannot presume the order of the elements.
2.4. collect()
The action collect() is the common and simplest operation that returns our entire RDDs content to driver program.
2.5. foreach()
When we have a situation where we want to apply operation on each element of RDD, but it should not return value to the driver. In this case, foreach() function is useful.
//map to only words
JavaRDD<String> wordsRDD = removeBlankLineRDD.flatMap(sentence -> Arrays.asList(sentence.split(" ")).iterator());
 
//create pair RDD
JavaPairRDD<String, Long> pairRDD = wordsRDD.mapToPair(word -> new Tuple2<>(word, 1L));
[[File: build package.png | 800px400px]]
You will get a jar file under the target folder.
Ssh into the master node of the cluster. Then issue a copy command to copy the jar file to the master node using:
aws s3 cp <s3://yourbucket/jarFileName.jar> .
Then you can run the app using:
Sparkspark-submit <jarFileName.jar>  [[File: run jar.png | 800px]] You should see the log info and the output of your application.  [[File: output spark.png | 800px]] ===Check cluster status===Spark provides a simple dash board to check the status of the cluster. Visit <your_cluster_master_DNS>:18080, you will see the dash board.  [[File: Dashboard spark.png | 800px]]Click the application id, you can see more details like job descriptions.  [[File: Spark jobs.png | 800px]]Or the stage descriptions. [[File: Spark stages.png | 800px]] ===Conclusion=== With Amazon EMR you can set up a cluster to process and analyze data with big data frameworks in just a few minutes. You can install Spark on an Amazon EMR cluster along with other Hadoop applications, and it can also leverage the EMR file system (EMRFS) to directly access data in Amazon S3.  ==References==https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html https://www.databricks.com/glossary/what-is-rdd#:~:text=RDD%20was%20the%20primary%20user,that%20offers%20transformations%20and%20actions. https://www.oreilly.com/library/view/apache-spark-2x/9781787126497/d0ae45f4-e8a1-4ea7-8036-606b7e27ddfd.xhtml https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/ https://www.databricks.com/glossary/what-is-spark-streaming#:~:text=Spark%20Streaming%20is%20an%20extension,%2C%20databases%2C%20and%20live%20dashboards
Then you should see the log info and the output of your applicationhttps://spark.apache.org/docs/latest/streaming-programming-guide.html
https://hevodata.com/learn/spark-IMAGEbatch-processing/
That’s how we deploy a https://spark app on aws EMR.apache.org/docs/latest/cluster-overview.html
92
edits