GPU621/Apache Spark
Contents
Group Members
- Akhil Balachandran
- Daniel Park
vs
MapReduce was famously used by Google to process massive data sets in parallel on a distributed cluster in order to index the web for accurate and efficient search results. Apache Hadoop, the open-source platform inspired by Google’s early proprietary technology has been one of the most popular big data processing frameworks. However, in recent years its usage has been declining in favor of other increasingly popular technologies, namely Apache Spark.
This project will focus on demonstrating how a particular use case performs in Apache Hadoop versus Apache spark, and how this relates to the rising and waning adoption of Spark and Hadoop respectively. It will compare the advantages of Apache Hadoop versus Apache Spark for certain big data applications.
Introduction
Apache Hadoop
What is Apache Hadoop?
Applications
Apache Spark
Apache Spark is a unified analytics engine for large-scale data processing. It is an open-source, general-purpose cluster-computing framework that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Since its inception, Spark has become one of the biggest big data distributed processing frameworks in the world. It can be deployed in a variety of ways, provides high-level APIs in Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing.
Architecture
Applications
Overview: Spark vs Hadoop
Advantage and Disadvantages
Parallelism
Performance
Analysis: Spark vs Hadoop
Methodology
Setup
Results
Conclusion
Progress
- Nov 9, 2020 - Added project description
- Nov 20, 2020 - Added outline and subsections