1. spark-shell interactive programming
(1) The total number of students Department
scala> val lines = sc.textFile("file:///usr/local/spark/sparklab/Data01.txt") lines: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/sparklab/Data01.txt MapPartitionsRDD[4] at textFile at <console>:24 scala> val info = lines.map(row => row.split(",")(0)) Info: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [5] for a map to the <console>: 25 scala> val latest = info.distinct() latest: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [8] for distinct to <console>: 25 scala> latest.count res0: Long = 265
(2) The department opened a total of how many courses to
scala> val lines = sc.textFile("file:///usr/local/spark/sparklab/Data01.txt") lines: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/sparklab/Data01.txt MapPartitionsRDD[4] at textFile at <console>:24 scala> val course = lines.map(row => row.split(",")(1)) course: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [9] for a map to the <console>: 25 scala> val course_num = course.distinct() course_num: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [12] for distinct to <console>: 25 scala> course_num.count res1: Long = 8
(3) total score Tom average is the number of students
scala> val tom = lines.map(row => row.split(",")(0)=="Tom") tom: org.apache.spark.rdd.RDD [Boolean] = MapPartitionsRDD [13] to map to the <console>: 25 scala> val tom = lines.filter(row => row.split(",")(0)=="Tom") tom: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [14] for the filter to the <console>: 25 scala> tom.foreach(println) Tom,DataBase,26 Tom,Algorithm,12 Tom,OperatingSystem,16 Tom, Python, 40 Tom,Software,60 scala> tom.map(row => (row.split(",")(0),row.split(",")(2).toInt)).mapValues(x => (x,1)).reduceByKey((x,y) => (x._1+y._1,x._2+y._2)).mapValues(x => (x._1/x._2)).collect() res6: Array[(String, Int)] = Array((Tom,30))
(4) seek elective courses each student's gate count
scala> val c_num = lines.map(row=>(row.split(",")(0),row.split(",")(1))) c_num: org.apache.spark.rdd.RDD [(String, String)] = MapPartitionsRDD [21] to map to the <console>: 25 scala> c_num.mapValues(x => (x,1)).reduceByKey((x,y) => (" ",x._2 + y._2)).mapValues(x => x._2).foreach(println) (Ford,3) (Lionel,4) (Verne,3) (Lennon,4) (Joshua,4) (Marvin,3) (Marsh,4) (Bartholomew,5) (Conrad,2) (Armand,3) (Jonathan,4) (Broderick,3) (Brady,5) (Derrick,6) (Rod,4) (Willie,4) (Walter,4) (Boyce,2) (Duncann,5) (Elvis,2) (Elmer,4) (Bennett,6) (Elton,5) (Dec 5) (Jim,4) (Adonis,5) (Abel,4) (Peter,4) (Alvis,6) (Joseph,3) (Raymondt,6) (Kerwin,3) (Wright,4) (Adam,3) (Borg,4) (Sandy,1) (Ben, 4) (Miles,6) (Clyde,7) (Francis,4) (Dempsey,4) (Ellis,4) (Edward,4) (Mick,4) (Cleveland,4) (Luther, 5) (Virgil,5) (Ivan,4) (Alvin,5) (Dick,3) (Proof, 4) (Leo,5) (Saxon,7) (Armstrong,2) (Hogan,4) (Sid,3) (Blair,4) (Colbert,4) (Lucien,5) (Kerr,4) (Montague,3) (Giles,7) (Kevin,4) (Uriah,1) (Jeffrey,4) (Simon,2) (Elijah,4) (Greg,4) (Colin,5) (Arlen,4) (Maxwell,4) (Payne,6) (Kennedy,4) (Spencer,5) (Kent,4) (Griffith,4) (Jeremy,6) (Alan,5) (Andrew,4) (Jerry,3) (Donahue,5) (Gilbert,3) (Bishop,2) (Bernard,2) (Egbert,4) (George,4) (Noah,4) (Bruce,3) (Mike,3) (Frank,3) (Boris,6) (Tony,3) (Christ,2) (Ken,3) (Milo,2) (Victor,2) (Clare,4) (Nigel,3) (Christopher,4) (Robin,4) (Chad,6) (Alfred,2) (Woodrow,3) (Rory,4) (Dennis,4) (Ward,4) (Chester,6) (Emmanuel,3) (Stan,3) (Jerome,3) (Corey,4) (Harvey,7) (Herbert,3) (Maurice,2) (Merle,3) (The, 6) (Bing,6) (Charles,3) (Clement,5) (Leopold,7) (Brian,6) (Horace,5) (Sebastian,6) (Bernie,3) (Basil,4) (Michael,5) (Ernest,5) (Tom,5) (Vic,3) (Eli,5) (Duke,4) (Alva,5) (Lester,4) (Hayden,3) (Bertram,3) (Bart,5) (Adair,3) (Sidney,5) (Bowen,5) (Roderick,4) (Colby,4) (Jay,6) (Meredith,4) (Harold,4) (Max,3) (Scott,3) (Barton,1) (Elliot,3) (Matthew,2) (Alexander,4) (Todd,3) (Wordsworth,4) (Geoffrey,4) (Devin,4) (Donald,4) (Roy,6) (Harry,4) (Abbott,3) (Baron,6) (Mark,7) (Lewis,4) (Rock,6) (Eugene,1) (Aries,2) (Samuel,4) (Glenn,6) (Will,3) (Gerald,4) (Henry,2) (Jesse,7) (Bradley,2) (Merlin,5) (Monroe,3) (Hobart,4) (Ron,6) (Archer,5) (Nick,5) (Louis,6) (Just 5) (Randolph,3) (Benson,4) (John,6) (Abraham,3) (Benedict,6) (Marico,6) (Berg,4) (Aldrich,3) (Lou,2) (Brook,4) (Ronald,3) (Pete,3) (Nicholas,5) (Bill,2) (Harlan,6) (Tracy,3) (Gordon,4) (Alston,4) (Andy,3) (Bruno,5) (Beck,4) (Phil,3) (Barry,5) (Nelson,5) (Antony,5) (Rodney,3) (Truman,3) (Marlon,4) (Don,2) (Philip,2) (Sean,6) (Webb,7) (Solomon,5) (Aaron,4) (Blake,4) (Amos,5) (Chapman,4) (Jonas,4) (Valentine,8) (Angelo,2) (Boyd,3) (Benjamin,4) (Winston,4) (Allen,4) (Evan,3) (Albert,3) (Newman,2) (Jason,4) (Hilary,4) (William,6) (Dean,7) (Claude,2) (Booth,6) (Channing,4) (Jeff,4) (Webster,2) (Marshall,4) (Cliff,5) (Dominic,4) (Upton,5) (Herman,3) (Levi,2) (Clark,6) (Hiram,6) (Drew,5) (Bert,3) (Alger,5) (Brandon,5) (Antonio,3) (Elroy,5) (Leonard,2) (Adolph,4) (Blithe,3) (Kenneth,3) (Perry,5) (Matt,4) (Eric,4) (Archibald,5) (Martin,3) (Kim,4) (Clarence,7) (Vincent,5) (Winfred,3) (Christian,2) (Bob,3) (Enoch,3)
(5) The total number of lines DataBase elective courses;
scala> val lines = sc.textFile("file:///usr/local/spark/sparklab/Data01.txt") lines: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/sparklab/Data01.txt MapPartitionsRDD[4] at textFile at <console>:24
scala> val database_num = lines.filter(row => row.split(",")(1)=="DataBase") database_num: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [20] for the filter to the <console>: 25 scala> database_num.count res7: Long = 126
(6) the average score of each course is how much
scala> val ave = lines.map(row=>(row.split(",")(1),row.split(",")(2).toInt)) ave: org.apache.spark.rdd.RDD [(String, Int)] = MapPartitionsRDD [26] to map to the <console>: 25 scala> ave.mapValues(x=>(x,1)).reduceByKey((x,y) => (x._1+y._1,x._2 + y._2)).mapValues(x => (x._1/ x._2)).collect() res9: Array[(String, Int)] = Array((CLanguage,50), (Software,50), (Python,57), (Algorithm,48), (DataStructure,47), (DataBase,50), (ComputerNetwork,51), (OperatingSystem,54))
(7) Accumulator calculated using the total number of candidates for this course of DataBase
scala> val lines = sc.textFile("file:///usr/local/spark/sparklab/Data01.txt") lines: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark/sparklab/Data01.txt MapPartitionsRDD[1] at textFile at <console>:24 scala> val database_num = lines.filter(row=>row.split(",")(1)=="DataBase").map(row=>(row.split(",")(1),1)) database_num: org.apache.spark.rdd.RDD [(String, Int)] = MapPartitionsRDD [3] for a map to the <console>: 25 scala> val accum = sc.longAccumulator("My Accumulator") accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0) scala> database_num.values.foreach(x => accum.add(x)) scala> accum.value res1: Long = 126
2. Write stand-alone application for data deduplication
For two input files A and B, prepared Spark standalone applications, two files are combined, and wherein eliminate duplicate content, a new file created C. The following is a sample input and output files, for reference.
A sample input file as follows:
20,170,101 X
20,170,102 Y
20,170,103 X
20,170,104 Y
20,170,105 Z
20,170,106 Z
input sample file B as follows:
20,170,101 Y
20,170,102 Y
20,170,103 X
20,170,104 Z
20,170,105 Y
obtained according to the documents A and B inputs of the combined output file sample C as follows:
20,170,101 X
20,170,101 Y
20,170,102 Y
20170103 x
20170104 y
20170104 to
20170105 y
20170105 to
20170106 of
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.HashPartitioner object lab04{ def main(args: Array[String]) { val conf = new SparkConf().setAppName("RemDup") val sc = new SparkContext(conf) val dataFile ="file:///usr/local/spark/sparklab/a.txt,file:///usr/local/spark/sparklab/b.txt" val data = sc.textFile(dataFile,2) val da = data.distinct() da.foreach(println) } }
3. Write standalone applications to achieve averaged problem
Each input file indicating the performance of students in the class of a subject, each line consists of two fields, the first is the name of the student, and the second is student achievement; Spark writing standalone applications to obtain the average score for all students, and output to a new file. The following is a sample input and output files, for reference.
Algorithm Results:
Xiaoming 92
red 87
small new 82
Mary 90
Database Results:
Xiaoming 95
red 81
small new 89
Mary 85
the Python Results:
Xiaoming 82
red 83
small new 94
Mary 91
Average results are as follows:
(red, 83.67)
(small new, 88.33)
(Xiao Ming, 89.67)
(Mary, 88.67)
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.HashPartitioner object lab043 { def main(args: Array[String]) { val conf = new SparkConf().setAppName("AvgScore") val sc = new SparkContext(conf) val dataFile = "file:///usr/local/spark/sparklab/lab043/1.txt,file:///usr/local/spark/sparklab/lab043/2.txt,file:///usr/local/spark/sparklab/lab043/3.txt" val data = sc.textFile(dataFile,3) var score = data.map(line=>(line.split(" ")(0),line.split(" ")(1).toInt)).mapValues(x=>(x,1)).reduceByKey((x,y)=>(x._1+y._1,x._2+y._2)).mapValues(x=>(x._1/x._2)).collect().foreach(println) //res.saveAsTextFile("result") } }