Open main menu

CDOT Wiki β

Changes

GPU621/Apache Spark

477 bytes added, 16:44, 30 November 2020
m
Conclusion
[[File:Googlecloud-hdfs.jpg|thumb|upright=2|right|alt=Spark cluster|5.1 Spark vs Hadoop MapReduce]]
Using the same hardware (RAM, CPUs, HDD) across a 6 node cluster and processing the same data (8 .txt files for total size of 7.77 GB) we can only see an approximately 12% performance improvement between Hadoop Mapreduce and Spark using a word count algorithm. This falls far short of the 10 times faster on disk and 100 times faster in-memory.
 
Spark does require more memory than Hadoop to cache data in memory, however that should not be a limitation in this case as the worker nodes have 15 GB of memory and none of the input files exceed 2GB. This is more than enough space for Spark to store input data in memory in the resilient distributed datasets (RDDs). It is worth noting that Spark performs best when iterating over the same data many times, while MapReduce was designed for single pass jobs. Furthermore these
Further testing and analyzing Spark internal data could be done to determine if any bottlenecks exists which are limiting the performance of Spark. For example, how well is the cluster utilizing the hardware, namely the RAM?
One possible explanation is the use of Google Cloud Storage Bucket to store the data rather than in Hadoop Distributed File System (HDFS). Both jobs are operating directly on data in the Cloud Strage rather than the HDFS. This may be reducing data access time for Hadoop MapReduce or introducing data access time to Apache Spark, as opposed to having the input data stored directly on the VM data nodes.
 
 
== Progress ==
76
edits