Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark

851 bytes added, 14:58, 30 November 2020
Copmarison: Spark vs Hadoop MapReduce
* In the travel industry. For example, TripAdvisor uses Spark to plan trips and provide personalized customer recommendations.
= CopmarisonComparison: Spark vs Hadoop MapReduce =
=== Performance ===
Spark processes data in RAM while Hadoop persists data back to the disk after a map or reduce action. Spark has been found to run '''100 times faster in-memory''', and '''10 times faster on disk'''. Spark won the 2014 Gray Sort Benchmark where it sorted 100TB of data using 206 machines in 23 minutes beating a Hadoop MapReduce cluster's previous world record of 72 minutes using 2100 nodes.
=== Ease of Use ===
Spark is easier to program and includes an interactive mode. It has various pre-built APIs for Java, Scala, and Python. Hadoop MapReduce is harder to program but there are some tools available to make it easier.
=== Cost ===
According to benchmarks, Spark is more cost-effective as it requires less hardware to perform the same tasks faster.
=== Compatibility ===Spark can run as a standalone application or on top of Hadoop YARN or Apache Mesos. Spark supports data sources that implement Hadoop input format, so it can integrate with all the same data sources and file formats that Hadoop supports.  === Data Processing ===In addition to plain data processing, Spark can also process graphs, and it also has the MLlib machine learning library. Due to its high performance, Spark can do both real-time and batch processing. However, Hadoop MapReduce is great only for batch processing.  === Fault Tolerance ===Both support retries per task and speculative execution. However, since Hadoop runs on disk, it is slightly more tolerant than Spark. === Security ===Both Spark and Hadoop have access to support for Kerberos authentication, but Hadoop has more fine-grained security controls for HDFS.
== Spark vs Hadoop Wordcount Performance ==