Open main menu

CDOT Wiki β

Changes

The Real A Team

43 bytes added, 12:31, 7 April 2016
Introduction To Spark
==PreAssignment==
This assignment will be going over how to do a simple word count in Scala using Spark.
 
===Introduction To Spark===
To introduce myself to spark I watched a couple of youtube videos that were filmed at a Spark conference. The videos can be found here: https://www.youtube.com/watch?v=nxCm-_GdTl8 They include 7 videos that go over what Spark is, and how it is meant to be used.
 
In summery Spark is built to run big data across many machines. It is not meant for many transactions, but is meant for the analysis of data.
 
Spark is built using a RDD (Resilient Distributed Dataset). This means that Spark does not edit the data that is passed in, but rather used the data to preform transformations (filters, joins, maps, etc.) and then actions (reductions, counts, etc.)
 
The results are stored into new datasets instead of altering existing ones.
 
The RDD's are meant to be stored in memory for quick access, however Spark is built so that if necessary the RDD's can be written to the drive. (At a reduced IO speed).
 
As mentioned above, there are three main steps in a Spark program. Creation, Transformation, Action.
====Setting Up The Scala Environment For Spark====
To run a stand alone application on windows, using the Scala IDE, you need to create a Marven Maven project. You do this by clicking File > new > Project > Marven Maven Project. Once the project is created, to use Scala instead of Java, the name of your source file should be refactored from src/main/java to src/main/scala.
From here you need to edit the pom.xml file to include Spark.
</nowiki>
 
===Introduction To Spark===
To introduce myself to spark I watched a couple of youtube videos that were filmed at a Spark conference. The videos can be found here: https://www.youtube.com/watch?v=nxCm-_GdTl8 They include 7 videos that go over what Spark is, and how it is meant to be used.
 
In summery Spark is built to run big data across many machines. It is not meant for many transactions, but is meant for the analysis of data.
 
Spark is built using a RDD (Resilient Distributed Dataset). This means that Spark does not edit the data that is passed in, but rather used the data to preform transformations (filters, joins, maps, etc.) and then actions (reductions, counts, etc.)
 
The results are stored into new datasets instead of altering existing ones.
 
The RDD's are meant to be stored in memory for quick access, however Spark is built so that if necessary the RDD's can be written to the drive. (At a reduced IO speed).
 
As mentioned above, there are three main steps in a Spark program. Creation, Transformation, Action.
| Large || 157 || 50 || 14135
|}
 
[[Image:Graph_Spark_Vs_CPP.png|500px| ]]
 
What this shows is that the overhead of each parallelization technique is important to consider. Since scala+spark is meant to run across multiple computers, there is no optimization when running on a single machine, and as you can see by the chart above, the time taken to complete the program increases with the file size.
==How Does This Help You?==
There are a lot of jobs, and there is a lot of money in big data. The average salary of a Sr. Scala developer is $152,000, Average salary of Sr. Java developer is $88,000, Average salary of Sr. C/C++ developer is $89,000.  Is it worth learning Scala?