Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark Fall 2022

1,341 bytes added, 14:44, 7 December 2022
RDD Overview
==Group 3 Information==
Alan Huang;
Jianchang Yu;
==Apache Spark Introduction==
 
[[file: Spark_2022.png|600px]]
Apache Spark is an open source cluster computing framework pioneered by Matei Zaharia at the University of California, Berkeley's AMPLab in 2009 and released open source in 2010 under the BSD license.Spark uses in-memory computing technology to analyze data in memory while it is still being written to the hard disk. Spark allows users to load data into cluster memory and query it multiple times, making it ideal for machine learning algorithms.
 
==Spark features==
==Spark Ecosystem==
 
[[file: Spark_component_2022.png|800px]]
 
===1. Spark Core===
The Spark core is the project's foundation, providing distributed task scheduling, scheduling, and basic I/O functionality. The underlying program abstraction is called Resilient Distributed Datasets, or RDDs, which is a collection of data that can be manipulated in parallel through fault-tolerant mechanisms. The abstraction of RDDs is presented through language integration APIs in Scala, Java, and Python, simplifying programming complexity and allowing applications to manipulate RDDs in a manner similar to manipulating native datasets.
===2. Spark SQL===
===4. MLlib===
MLlib is a distributed machine learning framework on Spark. the Spark distributed memory-based architecture is 10 times faster than Hadoop disk-based Apache Mahout and scales even better than Vowpal Wabbit.
===5. GraphX===
==Spark Application==
 
===1. The iterative operations and the multiple operations of the specific data sets ===
Spark is developed based on a memory-based iterative computing framework, so Spark has the advantage that the amount of read data will increase as the number of iterations increases. In the case of where iterative operations are applied or specific data sets need to be operated multiple times, Spark is very effective.
===Work Flow Chart===
[[File:Cluster-overview.png]]
===The implementation of Spark requires the following Components===
====1.Driver Program (SparkContext)====
SparkContext is the main entry point for all Spark functions.
====2.Cluster Manager====
The cluster manager is used for resource management of applications.
====3. Worker node====
Work nodes are used to submit tasks to executors, report executor status information, cpu and memory information to the cluster manager.
====4. Executor====
Components that perform computational tasks. It is a process responsible for running tasks, saving data and returning result data.
===The implementation of Spark has the following steps===*1. The SparkContext applies for computing resources from the Cluster Manager.*2. The Cluster Manager receives the request and start allocating the resources. (Creates and activates the executor on the worker node.)*3. The SparkContext sends the program/application code (jar package or python file, etc.) and task to the Executor. Executor executes the task and saves/returns the result*4. The SparkContext will collect the results.
==Apache Spark Core API==
===RDD Overview===
 [[file:RDD_Spark.jpg|800px]]  One of the most important concepts in Spark is a resilient distributed dataset (RDD). RDD is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. RDDs are created by starting with a file, or an existing Java collection in the driver programSpark is normally used to handle huge data, and transforming RDD is what makes itpossible for Spark to split the input data into different nodes. RDD also provides useful APIs for the programmer to call. We will introduce some key APIs provided by Spark Core 2.2.1 using Java 8. You can find more information about the RDD here. https://spark.apache.org/docs/2.2.1/rdd-programming-guide.html
===Spark Library Installation Using Maven===
===RDD APIs===
 
[[file: Actions_RDD.png|800px]]
Basically there are types of APIs. Transformations and Actions. Transformation APIs are functions that could return another RDD set. Using these APIs, we can create child RDD from parent RDD. Actions are the functions we want to perform onto the actual dataset. They will not return new RDDs.
https://data-flair.training/blogs/spark-rdd-operations-transformations-actions/
 
https://www.databricks.com/glossary/what-is-spark-streaming#:~:text=Spark%20Streaming%20is%20an%20extension,%2C%20databases%2C%20and%20live%20dashboards
 
https://spark.apache.org/docs/latest/streaming-programming-guide.html
https://hevodata.com/learn/spark-batch-processing/
https://spark.apache.org/docs/latest/streamingcluster-programming-guideoverview.html
92
edits