The Real A Team
Using Apache's Spark
This assignment is dedicated to learning how to use Apache's Spark. I chose to use the programming language Scala because it is the native language Spark was created for.
A Team Members
- Adrian Sauvageot, All
- ...
Assignment
This assignment will be going over how to do a simple word count in Scala using Spark.
Introduction To Scala
Scala is a programming language that is based off of Java, but is meant for more scalable applications. Information on scala can be found at: http://www.scala-lang.org/
IDE Used
For this asignment the Scala IDE was used. It can be found here: http://scala-ide.org/ The IDE is a version of eclipse with Scala built in.
Setting Up The Scala Environment
To run an application on windows, using the Scala IDE, you need to create a Marven project. You do this by clicking File > new > Project > Marven Project. Once the project is created, to use Scala instead of Java, the name of your source file should be refacotred from src/main/java to src/main/scala.
From here you need to edit the pom.xml file to include Spark.
You can do this by adding the following code:
<dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.6.0</version> </dependency> </dependencies>
When the project is built it will include the Apache Spark files.
Imports In Object
Like Java, Scala uses both classes and objects, to use Spark inside these you must import it by adding the following two lines to the top of the file.
import org.apache.spark.SparkConf import org.apache.spark.SparkContext
Windows Bug
There is a known bug on windows for Scala where it will not run unless hadoop is installed. This happens even though hadoop is not required for Scala or Spark. To fix this issue, a workaround is to add one of the hadoop exe files to your program.
To do this, I had to download the pre-compiled hadoop file and put it in my C directory. From there I had to add the following line to my file to tell Spark where to find the exe.
System.setProperty("hadoop.home.dir", "c:\\winutil\\")
Introduction To Spark
To introduce myself to spark I watched a couple of youtube videos that were filmed at a Spark conference. The videos can be found here: https://www.youtube.com/watch?v=nxCm-_GdTl8 They include 7 videos that go over what Spark is, and how it is meant to be used.
In summery Spark is built to run big data across many machines. It is not meant for many transactions, but is meant for the analysis of data.
Spark is built using a RDD (Resilient Distributed Dataset). This means that Spark does not edit the data that is passed in, but rather used the data to preform transformations (filters, joins, maps, etc.) and then actions (reductions, counts, etc.)
The results are stored into new datasets instead of altering existing ones.
The RDD's are meant to be stored in memory for quick access, however Spark is built so that if necessary the RDD's can be written to the drive. (At a reduced IO speed).
As mentioned above, there are three main steps in a Spark program. Creation, Transformation, Action.
Creation
The creation step is the step in which the data is loaded into the first RDD. This data can be loaded from multiple sources, but for the purpose of this assignment it is loaded from a local text file. In scala, the following line will load a text file into a variable.
val test = sc.textFile("food.txt")
Transformation
The next step is to transform the data that is in the first RDD into something you want to use. You can do this by using filter, map, union, join, sort, etc. In this example we would like to count the number of times each word is used in a piece of text. The following code snip-it will split the RDD by spaces, and map each word with a key value. The value is set to 1 to count each word as 1.
test.flatMap { line => line.split(" ") { .map { word => (word,1) }
Action
Finally an action is taken to give the programmer the desired output, for example count, collect, reduce, lookup, save. In this example we want to reduce the RDD on each word by adding the value of the word. To do this we can use:
.reduceByKey(_ + _)
Full Program
To run the program on windows, the following is the main.
object WordCount { def main(args: Array[String]) ={ System.setProperty("hadoop.home.dir", "c:\\winutil\\") val conf = new SparkConf() .setAppName("WordCount") .setMaster("local") val sc = new SparkContext(conf) val test = sc.textFile("food.txt") test.flatMap { line => line.split(" ") } .map { word => (word,1) } .reduceByKey(_ + _) .saveAsTextFile("food.count.txt") sc.stop } }