Changes

GPU621/Apache Spark Fall 2022

2,584 bytes added, 13:50, 6 December 2022

→‎Group Information

Jianchang Yu;

Tim Lin;

==Apache Spark Introduction==

Apache Spark is an open source cluster computing framework pioneered by Matei Zaharia at the University of California, Berkeley's AMPLab in 2009 and released open source in 2010 under the BSD license.Spark uses in-memory computing technology to analyze data in memory while it is still being written to the hard disk. Spark allows users to load data into cluster memory and query it multiple times, making it ideal for machine learning algorithms.

==Spark features==

Spark has a great future. It can scale to over 8000 nodes. Spark Streaming is scalable, high-throughput, and fault-tolerant for processing instant data streams.Spark SQL supports structured and relational query processing SQL.MLlib high-end library for machine learning algorithms and Graphx graphics processing algorithms.

==Spark Ecosystem==

===1. Spark Core===

The Spark core is the project's foundation, providing distributed task scheduling, scheduling, and basic I/O functionality. The underlying program abstraction is called Resilient Distributed Datasets, or RDDs, which is a collection of data that can be manipulated in parallel through fault-tolerant mechanisms. The abstraction of RDDs is presented through language integration APIs in Scala, Java, and Python, simplifying programming complexity and allowing applications to manipulate RDDs in a manner similar to manipulating native datasets.

===2. Spark SQL===

Spark SQL brings a data abstraction concept called SchemaRDD to the Spark core to provide support for structured and semi-structured data. Spark SQL provides domain-specific languages, and you can manipulate SchemaRDDs using Scala, Java, or Python. It also supports the use of the SQL language using the command line interface and ODBC/JDBC server.

===3. Spark Streaming===

Spark Streaming takes advantage of Spark's core fast scheduling capabilities to perform stream analysis. It intercepts small batches of information and performs RDD transformations on them. This design allows streaming analysis to use the same set of application code written for batch analysis within the same engine.

===4. MLlib===

MLlib is a distributed machine learning framework on Spark. the Spark distributed memory-based architecture is 10 times faster than Hadoop disk-based Apache Mahout and scales even better than Vowpal Wabbit.

===5. GraphX===

GraphX is a distributed graph processing framework on Spark. It provides a set of APIs for expressing graph computations and can emulate Pregel abstraction. graphX also provides optimized runs for this abstraction.

==Apache Spark Core API==

RobinYu

92

edits

CDOT Wiki β

Changes

GPU621/Apache Spark Fall 2022

CDOT Wiki ^β