Changes

Jump to: navigation, search

GPU621/ApacheSpark

1,100 bytes added, 08:54, 26 November 2018
no edit summary
=== Why Apache Spark ===
Data is exploded in volume, velocity and variety <br /> The need to have faster analytic results becomes increasingly important <br />Support near real time analytics to answer business questions<br /> === Spark and Hadoop ===Hadoop = HDFS(Hadoop Distributed File System) + MapReduce(data processing model)<br />Spark is advanced data processing/analysis model which is replacing MapReduce <br />Spark does not have its own file system so it run on the top of HDFS <br /> === Spark vs MapReduce === 
== Features ==
<b> Easy to use</b> <br />Supporting python. Java and Scala<br />Libraries for sql, ml, streaming<br /><b> General-purpose </b> <br /> Batch like MapReduce is included<br />Iterative algorithm<br />Interactive queries and streaming which return results immediately <br /><b> Speed</b> <br />In memory computations<br />Faster than MapReduce for complex application on disks<br /> == Resilient Distributed Datasets (RDDs) ==Spark revolves around RDDs it is a fundamental data structure in spark. <br />It is an immutable distributed collection of objects which can be operated on in parallel.<br />Two ways to implement RDDs <br />1) Parallelizing an existing collection <br />2) Referencing a data set in an external storage system  === Operations === <b> Transformations </b> <br />Create a new data set from existing one <br /> <b> Actions </b> <br />Return a value to the driver program after running computation on data set <br /> == Examples & Use Case == It is used in healthcare, media, finance, retail, travel.
== Examples = Finance and Fraud Detection ===
33
edits

Navigation menu