Changes

GPU621/ApacheSpark

1,100 bytes added, 08:54, 26 November 2018

no edit summary

=== Why Apache Spark ===

Data is exploded in volume, velocity and variety The need to have faster analytic results becomes increasingly important Support near real time analytics to answer business questions === Spark and Hadoop ===Hadoop = HDFS(Hadoop Distributed File System) + MapReduce(data processing model) Spark is advanced data processing/analysis model which is replacing MapReduce Spark does not have its own file system so it run on the top of HDFS === Spark vs MapReduce ===

== Features ==

Easy to use Supporting python. Java and Scala Libraries for sql, ml, streaming General-purpose Batch like MapReduce is included Iterative algorithm Interactive queries and streaming which return results immediately Speed In memory computations Faster than MapReduce for complex application on disks == Resilient Distributed Datasets (RDDs) ==Spark revolves around RDDs it is a fundamental data structure in spark. It is an immutable distributed collection of objects which can be operated on in parallel. Two ways to implement RDDs 1) Parallelizing an existing collection 2) Referencing a data set in an external storage system === Operations === Transformations Create a new data set from existing one Actions Return a value to the driver program after running computation on data set == Examples & Use Case == It is used in healthcare, media, finance, retail, travel.

== ~~Examples~~ = Finance and Fraud Detection ===

Sathia

33

edits

Changes

GPU621/ApacheSpark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools