RDD There are two types of operation, namely Transformation (returns a new RDD) and Action (return values).
1.Transformation: Create a new RDD RDD based on the existing data set build
(1) map (func): Each element RDD data calls to map the set make use of func, and then returns a new RDD, the returned data set is distributed data sets.
(2) filter (func): calling filter for each element of the data set are used RDD func, and then returns a func so as to true RDD element configuration.
(3) flatMap (func): map and the like, but the result is a plurality of flatMap generated.
(4) mapPartitions (func): map and the like, but each map Element, and mapPartitions each partition.
(5) mapPartitionsWithSplit (func): and mapPartitions like, but the role of func is on one of the split, so there should be func index.
(6)sample(withReplacement,faction,seed):抽样。
(7) union (otherDataset): Returns a new dataset, comprising a source of a given dataset and the dataset set of elements.
(8) distinct ([numTasks]): Returns a new dataset, the dataset containing the element distinct source of the dataset.
(9) groupByKey (numTasks): return (K, Seq [V]), i.e. in Hadoop reduce function accepts a key-valuelist.
(10) reduceByKey (func, [numTasks]): that is, with a given role then reduce func groupByKey generated (K, Seq [V]), such as sum, average.
(11) sortByKey ([ascending], [numTasks]): according to sort key is ascending or descending, Ascending a boolean type.
2.Action: After running computing RDD data set, return a result value or the write external storage
(1) reduce (func): is gathered, it is passed two parameter input function returns a value, this function must be commutative and associative law.
(2) collect (): generally small enough or when the filter result, and then return to collect a package array.
(3) count (): returns the number of the element in the dataset.
(4) first (): returns the first element in the dataset.
(5) take (n): before returning the n elements.
(6) takeSample (withReplacement, num, seed): Returns a sample of the dataset num elements, random seed seed.
(7) saveAsTextFile (path): the dataset wrote in a textfile, or HDFS, or HDFS file system support, Spark put each record are converted to a row, and then written to the file in.
(8) saveAsSequenceFile (path): it can only be used for a key-value, and then generates SequenceFile written to the local file system or Hadoop.
(9) countByKey (): returns the number corresponding to a map key acting on a RDD.
(10) foreach (func): for each element in the dataset used func.
User id (buyer_id), commodity id (goods_id), collection date (dt)
用户id 商品id 收藏日期
10181 1000481 2010-04-04 16:54:31
20001 1001597 2010-04-07 15:07:52
20001 1001560 2010-04-07 15:08:27
20042 1001368 2010-04-08 08:20:30
20067 1002061 2010-04-08 16:45:33
20056 1003289 2010-04-12 10:50:55
20056 1003290 2010-04-12 11:57:35
20056 1003292 2010-04-12 12:05:29
20054 1002420 2010-04-14 15:24:12
20055 1001679 2010-04-14 19:46:04
20054 1010675 2010-04-14 15:23:53
20054 1002429 2010-04-14 17:52:45
20076 1002427 2010-04-14 19:35:39
20054 1003326 2010-04-20 12:54:44
20056 1002420 2010-04-15 11:24:49
20064 1002422 2010-04-15 11:35:54
20056 1003066 2010-04-15 11:43:01
20056 1003055 2010-04-15 11:43:06
20056 1010183 2010-04-15 11:45:24
20056 1002422 2010-04-15 11:45:49
20056 1003100 2010-04-15 11:45:54
20056 1003094 2010-04-15 11:45:57
20056 1003064 2010-04-15 11:46:04
20056 1010178 2010-04-15 16:15:20
20076 1003101 2010-04-15 16:37:27
20076 1003103 2010-04-15 16:37:05
20076 1003100 2010-04-15 16:37:18
20076 1003066 2010-04-15 16:37:31
20054 1003103 2010-04-15 16:40:14
20054 1003100 2010-04-15 16:40:16
Users are now required statistical data collection, the number of users for each collection of goods.
1. On Linux, create / data / spark3 / wordcount directory required for storing experimental data.
mkdir -p /data/spark3/wordcount
Directory to the switching / data / spark3 / wordcount, and experimental data from http://192.168.1.100:60000/allfiles/spark3/wordcount/buyer_favorite download.
cd /data/spark3/wordcount
wget http://192.168.1.100:60000/allfiles/spark3/wordcount/buyer_favorite
2. Using jps see if Hadoop and Spark related process has been started, if not start the execution start command.
jps
/apps/hadoop/sbin/start-all.sh
/apps/spark/sbin/start-all.sh
The local Linux / data / spark3 / wordcount / buyer_favorite file, upload it to / myspark3 on HDFS / wordcount directory. If the on HDFS / myspark3 directory does not exist you need to create in advance.
hadoop fs -mkdir -p /myspark3/wordcount
hadoop fs -put /data/spark3/wordcount/buyer_favorite /myspark3/wordcount
3. Start spark-shell
spark-shell
4. Write Scala statement, statistical collection of user data, the number of users for each collection of goods.
In the first spark-shell, the load data.
val rdd = sc.textFile("hdfs://localhost:9000/myspark3/wordcount/buyer_favorite");
And perform statistical output.
rdd.map(line=> (line.split('\t')(0),1)).reduceByKey(_+_).collect
De-emphasis: Use spark-shell, the above experiment, the collection of user data file statistics. According to de heavy item ID, a statistical collection of user data which has been collected by commodity.
1. On Linux, create / data / spark3 / distinct, for storing experimental data.
mkdir -p /data/spark3/distinct
Switch to next / data / spark3 / distinct directory, and from experimental data http://192.168.1.100:60000/allfiles/spark3/distinct/buyer_favorite download.
cd /data/spark3/distinct
wget http://192.168.1.100:60000/allfiles/spark3/distinct/buyer_favorite
2. Using jps View Hadoop, Spark process. Ensure Hadoop, Spark framework related processes is started state.
3. / data / spark3 / distinct / buyer_favorite file, upload it to / myspark3 on HDFS / distinct directory. If the directory does not exist, create HDFS.
hadoop fs -mkdir -p /myspark3/distinct
hadoop fs -put /data/spark3/distinct/buyer_favorite /myspark3/distinct
4. In the window Spark, written in Scala statement, statistical collection of user data, which have been collected by commodity.
To load data, create RDD.
val rdd = sc.textFile("hdfs://localhost:9000/myspark3/distinct/buyer_favorite");
RDD statistics on the results printout.
rdd.map(line => line.split('\t')(1)).distinct.collect
Sort: electricity supplier site visits will be on commodity statistics, a goods_visit existing file, storage of various commodities and electricity supplier website clicks this individual commodities.
Product id (goods_id) clicks (click_num)
商品ID 点击次数
1010037 100
1010102 100
1010152 97
1010178 96
1010280 104
1010320 103
1010510 104
1010603 96
1010637 97
Now sorted according to the number of clicks goods, and the output of all goods.
The output style:
点击次数 商品ID
96 1010603
96 1010178
97 1010637
97 1010152
100 1010102
100 1010037
103 1010320
104 1010510
104 1010280
1. On Linux, create / data / spark3 / sort, for storing experimental data.
mkdir -p /data/spark3/sort
Switch to next / data / spark3 / sort directory, and from experimental data http://192.168.1.100:60000/allfiles/spark3/sort/goods_visit download.
cd /data/spark3/sort
wget http://192.168.1.100:60000/allfiles/spark3/sort/goods_visit
2. / data / spark3 / sort / goods_visit file, upload it to / spark3 / sort / directory on HDFS. If HDFS directory does not exist you need to create in advance.
hadoop fs -mkdir -p /myspark3/sort
hadoop fs -put /data/spark3/sort/goods_visit /myspark3/sort
3. Spark window, load the data, the data is converted to RDD.
val rdd1 = sc.textFile("hdfs://localhost:9000/myspark3/sort/goods_visit");
RDD statistics on the results printout.
rdd1.map(line => ( line.split('\t')(1).toInt, line.split('\t')(0) ) ).sortByKey(true).collect
4.输出结果样式为:
Join:现有某电商在2011年12月15日的部分交易数据。数据有订单表orders和订单明细表order_items,表结构及数据分别为:
orders表:(订单id order_id, 订单号 order_number, 买家ID buyer_id, 下单日期 create_dt)
Order ID Order number under a single user ID, date
52304 111 215 052 630 176 474 2011-12-15 04:58:21
52303 2011-12-15 04:45:31 111 215 052 629 178 350
52 302 111 215 052 628 172 296 2011-12-15 03:12:23
52 301 111 215 052 627 2011-12-15 02:37:32 178 348
52300 111 215 052 626 174 893 2011-12-15 02:18:56
52299 2011-12-15 01:33:46 111 215 052 625 169 471
52298 111 215 052 624 178 345 2011-12-15 01:04:41
52297 111 215 052 623 176 369 2011-12-15 01:02:20
52296 2011-12-15 00:38:02 111 215 052 622 178 343
52 295 111 215 052 621 178 342 2011-12-15 00:18:43
52294 111 215 052 620 178341 2011-12-15 00:14 : 37
52 293 111 215 052 619 178 338 2011-12-15 00:13:07
ORDER_ITEMS table :( details ID item_id, order ID order_id, product ID goods_id)
Details ID Order ID product ID
252 578 52 293 1.01684 million
252579 52293 1.01404 million
252.58 thousand 52294 1.0142 million
252 581 52294 1001012
252 582 52294 1022245
252 583 52294 1014724
252 584 52294 1010731
252 586 52 295 1,023,399
252 587 52 295 1.01684 million
252 592 52296 1021134
252 593 52296 1021133
252 585 52 295 1.02184 million
252 588 52 295 1.01404 million
252 589 52296 1.01404 million
252.59 thousand 52296 1019043
`` `
the orders table and order_items table, by associating the order id, is one to many relationship.
The following open spark-shell, query on the day the electricity supplier site, which users have to buy what goods.
1. On Linux, create / data / spark3 / join, for storing experimental data.
mkdir -p /data/spark3/join
Directory to the switching / data / spark3 / join directory, and downloaded from http://192.168.1.100:60000/allfiles/spark3/join/order_items and http://192.168.1.100:60000/allfiles/spark3/join/orders Experimental data.
cd /data/spark3/join
wget http://192.168.1.100:60000/allfiles/spark3/join/order_items
wget http://192.168.1.100:60000/allfiles/spark3/join/orders
2. Create / myspark3 / join the directory on HDFS, data and next on Linux / data / spark3 / join the directory, uploaded to the HDFS.
hadoop fs -mkdir -p /myspark3/join
hadoop fs -put /data/spark3/join/orders /myspark3/join
hadoop fs -put /data/spark3/join/order_items /myspark3/join
3. Create two RDD in Spark window, orders order_items file and data files are loaded.
val rdd1 = sc.textFile("hdfs://localhost:9000/myspark3/join/orders");
val rdd2 = sc.textFile("hdfs://localhost:9000/myspark3/join/order_items");
4. Our aim is to query each user what to buy merchandise. So for rdd1 and rdd2 be mapped map, draw two columns of critical data.
Scala `` `
Val rdd11 = rdd1.map (Line => (line.split ( '\ T') (0), line.split ( '\ T') (2)))
Val rdd22 = rdd2.map (Line => (line.split ( '\ T') (. 1), line.split ( '\ T') (2)))
5. the data and the rdd22 in rdd11, according to the Key value for Join correlation, to give a final result.
val rddresult = rdd11 join rdd22
6. Finally, the result output, see the output.
rddresult.collect
The final results of the implementation as follows:
7. The data output format:
(52294,(178341,1014200)),
(52294,(178341,1001012)),
(52294,(178341,1022245)),
(52294,(178341,1014724)),
(52294,(178341,1010731)),
(52296,(178343,1021134)),
(52296,(178343,1021133)),
(52296,(178343,1014040)),
(52296,(178343,1019043)),
(52295,(178342,1023399)),
(52295,(178342,1016840)),
(52295,(178342,1021840)),
(52295,(178342,1014040)),
(52293,(178338,1016840)),
(52293,(178338,1014040))
After the data association can be seen above, a total of three, respectively, order ID, a user ID, an item ID.
Averaging: Electric's Web site visits will be on commodity statistics. Goods_visit an existing file is stored clicks all the goods and all commodities. There is also a file goods, record basic information goods. Two table data structure is as follows:
goods Table: Product ID (goods_id), commodity status (goods_status), Category id (cat_id), score (goods_score)
goods_visit table: Product ID (goods_id), goods clicks (click_num)
Commodity table (goods) and merchandise available in the table (goods_visit) can be associated by product id. Now statistics for each classification, the average number of clicks is the number of goods?
1. On Linux, create the directory / data / spark3 / avg, for storing experimental data.
mkdir -p /data/spark3/avg
Switch to next / data / spark3 / avg directory, and from two http://192.168.1.100:60000/allfiles/spark3/avg/goods and http://192.168.1.100:60000/allfiles/spark3/avg/goods_visit experimental data downloaded from the website.
cd /data/spark3/avg
wget http://192.168.1.100:60000/allfiles/spark3/avg/goods
wget http://192.168.1.100:60000/allfiles/spark3/avg/goods_visit
2. Create a directory on HDFS / myspark3 / avg, data and Linux / data / spark3 under / avg directory, uploaded to the HDFS / myspark3 / avg.
hadoop fs -mkdir -p /myspark3/avg
hadoop fs -put /data/spark3/avg/goods /myspark3/avg
hadoop fs -put /data/spark3/avg/goods_visit /myspark3/avg
3. Create two RDD in Spark window, goods and goods_visit file data files are loaded.
val rdd1 = sc.textFile("hdfs://localhost:9000/myspark3/avg/goods")
val rdd2 = sc.textFile("hdfs://localhost:9000/myspark3/avg/goods_visit")
4. Our aim is to count each category, the average number of clicks goods, we can do in three steps.
First, rdd1 and rdd2 be mapped map, draw two columns of critical data.
val rdd11 = rdd1.map(line=> (line.split('\t')(0), line.split('\t')(2)) )
val rdd22 = rdd2.map(line=> (line.split('\t')(0), line.split('\t')(1)) )
() Method to start the program with collect.
rdd11.collect
View rdd11 results are as follows:
rdd11.collect
res2: Array[(String, String)] = Array((1000002,52137), (1000003,52137), (1000004,52137), (1000006,52137),
(1000007,52137), (1000008,52137), (1000010,52137), (1000011,52137), (1000015,52137), (1000018,52137),
(1000020,52137), (1000021,52137), (1000025,52137), (1000028,52137), (1000030,52137), (1000033,52137),
(1000035,52137), (1000037,52137), (1000041,52137), (1000044,52137), (1000048,52137), (1000050,52137),
(1000053,52137), (1000057,52137), (1000059,52137), (1000063,52137), (1000065,52137), (1000067,52137),
(1000071,52137), (1000073,52137), (1000076,52137), (1000078,52137), (1000080,52137), (1000082,52137),
(1000084,52137), (1000086,52137), (1000087,52137), (1000088,52137), (1000090,52137), (1000091,52137),
(1000094,52137), (1000098,52137), (1000101,52137), (1000103,52137), (1000106,52...
scala>>
() Method to start the program with collect.
rdd22.collect
View rdd22 results are as follows:
rdd22.collect
res3: Array[(String, String)] = Array((1010000,4), (1010001,0), (1010002,0), (1010003,0), (1010004,0),
(1010005,0), (1010006,74), (1010007,0), (1010008,0), (1010009,1081), (1010010,0), (1010011,0), (1010012,0),
(1010013,44), (1010014,1), (1010018,0), (1010019,542), (1010020,1395), (1010021,18), (1010022,13), (1010023,27),
(1010024,22), (1010025,295), (1010026,13), (1010027,1), (1010028,410), (1010029,2), (1010030,8), (1010031,6),
(1010032,729), (1010033,72), (1010034,3), (1010035,328), (1010036,153), (1010037,100), (1010038,4), (1010039,3),
(1010040,69), (1010041,1), (1010042,1), (1010043,21), (1010044,268), (1010045,11), (1010046,1), (1010047,1),
(1010048,59), (1010049,15), (1010050,19), (1010051,424), (1010052,462), (1010053,9), (1010054,41), (1010055,64),
(1010056,10), (1010057,3), (...
Scala>
then rdd11 and correlate data in accordance rdd22 item ID, a key value that is, to give a large table. Commodity table structure becomes :( id, (commodity classification, goods clicks))
Copy Plain View
Val rddjoin = rdd11 rdd22 the Join
() method starts procedures collect.
Copy Plain View
rddjoin.collect
view rddjoin results are as follows:
view plain copy
rddjoin.collect
res4: Array[(String, (String, String))] = Array((1013900,(52137,0)), (1010068,(52007,1316)), (1018970,(52006,788)),
(1020975,(52091,68)), (1019960,(52111,0)), (1019667,(52045,16)), (1010800,(52137,6)), (1019229,(52137,20)), (1022649,
(52119,90)), (1020382,(52137,0)), (1022667,(52021,150)), (1017258,(52086,0)), (1021963,(52072,83)), (1015809,(52137,285)),
(1024340,(52084,0)), (1011043,(52132,0)), (1011762,(52137,2)), (1010976,(52132,34)), (1010512,(52090,8)), (1023965,(52095,0)),
(1017285,(52069,41)), (1020212,(52026,46)), (1010743,(52137,0)), (1020524,(52064,52)), (1022577,(52090,13)), (1021974,(52069,22)),
(1010543,(52137,0)), (1010598,(52136,53)), (1017212,(52108,45)), (1010035,(52006,328)), (1010947,(52089,8)), (1020964,(52071,86)),
(1024001, (52063,0)), (1,020,191, (52046,0)), (1,015,739, (...
Scala>
Finally, based on a large table on the statistics. Click on each category to obtain an average, commodities frequency.
Copy Plain View
rddjoin.map (X => {(x._2._1, (x._2._2.toLong,. 1))}). reduceByKey ((X, Y) => {(Y + x._1. _1, x._2 + y._2)}) . map (x => {(x._1, x._2._1 * 1.0 / x._2._2)}). collect
outputs the result to view output.
view plain copy
scala> rddjoin.map(x=>{(x._2._1, (x._2._2.toLong, 1))}).reduceByKey((x,y)=>{(x._1+y._1, x._2+y._2)}).map(x=>
{(x._1, x._2._1*1.0/x._2._2)}).collect
res40: Array[(String, Double)] = Array((52009,463.3642857142857), (52135,36.69230769230769), (52128,9.0), (52072,42.8),
(52078,16.5), (52137,34.735241502683365), (52047,20.96551724137931), (52050,0.0), (52056,24.57894736842105),
(52087,17.008928571428573), (52085,31.17142857142857), (52007,547.3076923076923), (52052,19.6), (52081,50.833333333333336),
(52016,106.75), (52058,34.23170731707317), (52124,0.0), (52092,28.453703703703702), (52065,8.644444444444444), (52106,22.5),
(52120,96.7843137254902), (52027,114.7), (52089,17.81159420289855), (52098,57.793103448275865), (52038,74.2), (52061,52.609375),
(52104,49.0), (52014,45.4), (52012 , 53.26), (52100,22.0), (52043,23.0), (52030,532.48), (52023,150.0), (52083,57.857142857142854),
(52041,40.0), (52049,18.058823529411764), (52074,33.17647058 ...
scale>