Difference between revisions of "GPU621/ApacheSpark"
(→What is Apache Spark) |
(→Finance and Stock trading Use Case) |
||
(11 intermediate revisions by the same user not shown) | |||
Line 12: | Line 12: | ||
=== History of Apache Spark === | === History of Apache Spark === | ||
− | 2009: a distributed system framework initiated at UC Berkeley AMPLab by MateiZaharia | + | 2009: a distributed system framework initiated at UC Berkeley AMPLab by MateiZaharia <br /> |
− | 2010: Open sourced under a BSD license | + | 2010: Open sourced under a BSD license <br /> |
− | 2013: The project was donated to the Apache Software Foundation and the license was changed to Apache 2.0 | + | 2013: The project was donated to the Apache Software Foundation and the license was changed to Apache 2.0 <br /> |
− | 2014: Became an Apache Top-Level Project. Used by Databricks to set a world record in large-scale sorting in November | + | 2014: Became an Apache Top-Level Project. Used by Databricks to set a world record in large-scale sorting in November <br /> |
− | 2014-present: Exists as a next generation real-time and batch processing framework | + | 2014-present: Exists as a next generation real-time and batch processing framework <br /> |
=== Why Apache Spark === | === Why Apache Spark === | ||
− | + | Data is exploded in volume, velocity and variety <br /> | |
− | + | The need to have faster analytic results becomes increasingly important <br /> | |
− | + | Support near real time analytics to answer business questions <br /> | |
+ | |||
+ | === Spark and Hadoop === | ||
+ | Hadoop = HDFS(Hadoop Distributed File System) + MapReduce(data processing model)<br /> | ||
+ | Spark is advanced data processing/analysis model which is replacing MapReduce <br /> | ||
+ | Spark does not have its own file system so it run on the top of HDFS <br /> | ||
+ | |||
+ | [[File:10a.PNG]] | ||
+ | |||
+ | === Spark vs MapReduce === | ||
+ | |||
+ | [[File:3.PNG]] | ||
== Features == | == Features == | ||
− | Easy to use | + | <b> Easy to use </b> <br /> |
− | Supporting python. Java and Scala | + | Supporting python. Java and Scala <br /> |
− | Libraries for sql, ml, streaming | + | Libraries for sql, ml, streaming <br /> |
− | General-purpose | + | <b> General-purpose </b> <br /> |
− | Batch like MapReduce is included | + | Batch like MapReduce is included <br /> |
− | Iterative algorithm | + | Iterative algorithm <br /> |
− | Interactive queries and streaming which return results immediately | + | Interactive queries and streaming which return results immediately <br /> |
− | Speed | + | <b> Speed </b> <br /> |
− | In memory computations | + | In memory computations <br /> |
− | Faster than MapReduce for complex application on disks | + | Faster than MapReduce for complex application on disks <br /> |
+ | |||
+ | [[File:2abc.png ]] | ||
+ | |||
+ | == Resilient Distributed Datasets (RDDs) == | ||
+ | Spark revolves around RDDs it is a fundamental data structure in spark. <br /> | ||
+ | It is an immutable distributed collection of objects which can be operated on in parallel.<br /> | ||
+ | Two ways to implement RDDs <br /> | ||
+ | 1) Parallelizing an existing collection <br /> | ||
+ | 2) Referencing a data set in an external storage system | ||
+ | |||
+ | === Operations === | ||
+ | |||
+ | <b> Transformations </b> <br /> | ||
+ | Create a new data set from existing one <br /> | ||
+ | [[File:5bc.PNG ]] | ||
+ | |||
+ | |||
+ | <b> Actions </b> <br /> | ||
+ | Return a value to the driver program after running computation on data set <br /> | ||
+ | |||
+ | [[File:6.PNG]] | ||
+ | |||
+ | |||
+ | These examples and more are found at https://spark.apache.org/docs/latest/rdd-programming-guide.html | ||
+ | |||
+ | == Examples == | ||
+ | |||
+ | === Word Count === | ||
+ | |||
+ | [[File:4.PNG]] | ||
+ | |||
+ | Using transformations ( flatmap, map, reduceByKey ) to build a data set of string and int pairs. It is then saved into a file | ||
+ | |||
+ | === Finance and Stock trading Use Case === | ||
+ | |||
+ | Imagine that you are working for a financial company and your job is to buy in and buy out stocks to make money. The decision you make highly depends on the prediction which is calculated by your financial model. In this kind of situation, how long it takes for your financial model to make a prediction is very critical. We know that the price of stocks change very fast. In a couple seconds a stock can change prices drastically. Thus, if your model cannot provide you a near real time response, you might lose your opportunity to trade your stocks with the best price. Apache Spark can be utilized to create financial models to make predictions in real time. | ||
− | + | [[File:7ab.png]] |
Latest revision as of 12:36, 26 November 2018
Contents
Team Members
Introduction
What is Apache Spark ?
An open-source distributed general-purpose cluster-computing framework for Big Data.
History of Apache Spark
2009: a distributed system framework initiated at UC Berkeley AMPLab by MateiZaharia
2010: Open sourced under a BSD license
2013: The project was donated to the Apache Software Foundation and the license was changed to Apache 2.0
2014: Became an Apache Top-Level Project. Used by Databricks to set a world record in large-scale sorting in November
2014-present: Exists as a next generation real-time and batch processing framework
Why Apache Spark
Data is exploded in volume, velocity and variety
The need to have faster analytic results becomes increasingly important
Support near real time analytics to answer business questions
Spark and Hadoop
Hadoop = HDFS(Hadoop Distributed File System) + MapReduce(data processing model)
Spark is advanced data processing/analysis model which is replacing MapReduce
Spark does not have its own file system so it run on the top of HDFS
Spark vs MapReduce
Features
Easy to use
Supporting python. Java and Scala
Libraries for sql, ml, streaming
General-purpose
Batch like MapReduce is included
Iterative algorithm
Interactive queries and streaming which return results immediately
Speed
In memory computations
Faster than MapReduce for complex application on disks
Resilient Distributed Datasets (RDDs)
Spark revolves around RDDs it is a fundamental data structure in spark.
It is an immutable distributed collection of objects which can be operated on in parallel.
Two ways to implement RDDs
1) Parallelizing an existing collection
2) Referencing a data set in an external storage system
Operations
Transformations
Create a new data set from existing one
Actions
Return a value to the driver program after running computation on data set
These examples and more are found at https://spark.apache.org/docs/latest/rdd-programming-guide.html
Examples
Word Count
Using transformations ( flatmap, map, reduceByKey ) to build a data set of string and int pairs. It is then saved into a file
Finance and Stock trading Use Case
Imagine that you are working for a financial company and your job is to buy in and buy out stocks to make money. The decision you make highly depends on the prediction which is calculated by your financial model. In this kind of situation, how long it takes for your financial model to make a prediction is very critical. We know that the price of stocks change very fast. In a couple seconds a stock can change prices drastically. Thus, if your model cannot provide you a near real time response, you might lose your opportunity to trade your stocks with the best price. Apache Spark can be utilized to create financial models to make predictions in real time.