spark shell operation

RDD There are two types of operation, namely Transformation (returns a new RDD) and Action (return values).

1.Transformation: Create a new RDD RDD based on the existing data set build

(1) map (func): Each element RDD data calls to map the set make use of func, and then returns a new RDD, the returned data set is distributed data sets.

(2) filter (func): calling filter for each element of the data set are used RDD func, and then returns a func so as to true RDD element configuration.

(3) flatMap (func): map and the like, but the result is a plurality of flatMap generated.

(4) mapPartitions (func): map and the like, but each map Element, and mapPartitions each partition.

(5) mapPartitionsWithSplit (func): and mapPartitions like, but the role of func is on one of the split, so there should be func index.

(6)sample(withReplacement,faction,seed):抽样。

(7) union (otherDataset): Returns a new dataset, comprising a source of a given dataset and the dataset set of elements.

(8) distinct ([numTasks]): Returns a new dataset, the dataset containing the element distinct source of the dataset.

(9) groupByKey (numTasks): return (K, Seq [V]), i.e. in Hadoop reduce function accepts a key-valuelist.

(10) reduceByKey (func, [numTasks]): that is, with a given role then reduce func groupByKey generated (K, Seq [V]), such as sum, average.

(11) sortByKey ([ascending], [numTasks]): according to sort key is ascending or descending, Ascending a boolean type.

2.Action: After running computing RDD data set, return a result value or the write external storage

(1) reduce (func): is gathered, it is passed two parameter input function returns a value, this function must be commutative and associative law.

(2) collect (): generally small enough or when the filter result, and then return to collect a package array.

(3) count (): returns the number of the element in the dataset.

(4) first (): returns the first element in the dataset.

(5) take (n): before returning the n elements.

(6) takeSample (withReplacement, num, seed): Returns a sample of the dataset num elements, random seed seed.

(7) saveAsTextFile (path): the dataset wrote in a textfile, or HDFS, or HDFS file system support, Spark put each record are converted to a row, and then written to the file in.

(8) saveAsSequenceFile (path): it can only be used for a key-value, and then generates SequenceFile written to the local file system or Hadoop.

(9) countByKey (): returns the number corresponding to a map key acting on a RDD.

(10) foreach (func): for each element in the dataset used func.

User id (buyer_id), commodity id (goods_id), collection date (dt)

用户id   商品id    收藏日期  
10181   1000481   2010-04-04 16:54:31  
20001   1001597   2010-04-07 15:07:52  
20001   1001560   2010-04-07 15:08:27  
20042   1001368   2010-04-08 08:20:30  
20067   1002061   2010-04-08 16:45:33  
20056   1003289   2010-04-12 10:50:55  
20056   1003290   2010-04-12 11:57:35  
20056   1003292   2010-04-12 12:05:29  
20054   1002420   2010-04-14 15:24:12  
20055   1001679   2010-04-14 19:46:04  
20054   1010675   2010-04-14 15:23:53  
20054   1002429   2010-04-14 17:52:45  
20076   1002427   2010-04-14 19:35:39  
20054   1003326   2010-04-20 12:54:44  
20056   1002420   2010-04-15 11:24:49  
20064   1002422   2010-04-15 11:35:54  
20056   1003066   2010-04-15 11:43:01  
20056   1003055   2010-04-15 11:43:06  
20056   1010183   2010-04-15 11:45:24  
20056   1002422   2010-04-15 11:45:49  
20056   1003100   2010-04-15 11:45:54  
20056   1003094   2010-04-15 11:45:57  
20056   1003064   2010-04-15 11:46:04  
20056   1010178   2010-04-15 16:15:20  
20076   1003101   2010-04-15 16:37:27  
20076   1003103   2010-04-15 16:37:05  
20076   1003100   2010-04-15 16:37:18  
20076   1003066   2010-04-15 16:37:31  
20054   1003103   2010-04-15 16:40:14  
20054   1003100   2010-04-15 16:40:16  

Users are now required statistical data collection, the number of users for each collection of goods.

1. On Linux, create / data / spark3 / wordcount directory required for storing experimental data.

mkdir -p /data/spark3/wordcount  

Directory to the switching / data / spark3 / wordcount, and experimental data from http://192.168.1.100:60000/allfiles/spark3/wordcount/buyer_favorite download.

cd /data/spark3/wordcount  
wget http://192.168.1.100:60000/allfiles/spark3/wordcount/buyer_favorite  

2. Using jps see if Hadoop and Spark related process has been started, if not start the execution start command.

jps  
/apps/hadoop/sbin/start-all.sh  
/apps/spark/sbin/start-all.sh  

The local Linux / data / spark3 / wordcount / buyer_favorite file, upload it to / myspark3 on HDFS / wordcount directory. If the on HDFS / myspark3 directory does not exist you need to create in advance.

hadoop fs -mkdir -p /myspark3/wordcount  
hadoop fs -put /data/spark3/wordcount/buyer_favorite /myspark3/wordcount  

3. Start spark-shell

spark-shell  

4. Write Scala statement, statistical collection of user data, the number of users for each collection of goods.

In the first spark-shell, the load data.

val rdd = sc.textFile("hdfs://localhost:9000/myspark3/wordcount/buyer_favorite");  

And perform statistical output.

rdd.map(line=> (line.split('\t')(0),1)).reduceByKey(_+_).collect  

De-emphasis: Use spark-shell, the above experiment, the collection of user data file statistics. According to de heavy item ID, a statistical collection of user data which has been collected by commodity.

1. On Linux, create / data / spark3 / distinct, for storing experimental data.

mkdir -p /data/spark3/distinct  

Switch to next / data / spark3 / distinct directory, and from experimental data http://192.168.1.100:60000/allfiles/spark3/distinct/buyer_favorite download.

cd /data/spark3/distinct  
wget http://192.168.1.100:60000/allfiles/spark3/distinct/buyer_favorite  

2. Using jps View Hadoop, Spark process. Ensure Hadoop, Spark framework related processes is started state.

3. / data / spark3 / distinct / buyer_favorite file, upload it to / myspark3 on HDFS / distinct directory. If the directory does not exist, create HDFS.

hadoop fs -mkdir -p /myspark3/distinct  
hadoop fs -put /data/spark3/distinct/buyer_favorite /myspark3/distinct  

4. In the window Spark, written in Scala statement, statistical collection of user data, which have been collected by commodity.

To load data, create RDD.

val rdd = sc.textFile("hdfs://localhost:9000/myspark3/distinct/buyer_favorite");  

RDD statistics on the results printout.

rdd.map(line => line.split('\t')(1)).distinct.collect  

Sort: electricity supplier site visits will be on commodity statistics, a goods_visit existing file, storage of various commodities and electricity supplier website clicks this individual commodities.

Product id (goods_id) clicks (click_num)

商品ID  点击次数  
1010037 100  
1010102 100  
1010152 97  
1010178 96  
1010280 104  
1010320 103  
1010510 104  
1010603 96  
1010637 97  

Now sorted according to the number of clicks goods, and the output of all goods.

The output style:

点击次数 商品ID  
96  1010603  
96  1010178  
97  1010637  
97  1010152  
100 1010102  
100 1010037  
103 1010320  
104 1010510  
104 1010280  

1. On Linux, create / data / spark3 / sort, for storing experimental data.

mkdir -p /data/spark3/sort  

Switch to next / data / spark3 / sort directory, and from experimental data http://192.168.1.100:60000/allfiles/spark3/sort/goods_visit download.

cd /data/spark3/sort  
wget http://192.168.1.100:60000/allfiles/spark3/sort/goods_visit  

2. / data / spark3 / sort / goods_visit file, upload it to / spark3 / sort / directory on HDFS. If HDFS directory does not exist you need to create in advance.

hadoop fs -mkdir -p /myspark3/sort  
hadoop fs -put /data/spark3/sort/goods_visit /myspark3/sort  

3. Spark window, load the data, the data is converted to RDD.

val rdd1 = sc.textFile("hdfs://localhost:9000/myspark3/sort/goods_visit");  

RDD statistics on the results printout.

rdd1.map(line => ( line.split('\t')(1).toInt, line.split('\t')(0) ) ).sortByKey(true).collect  
4.输出结果样式为:


Join:现有某电商在2011年12月15日的部分交易数据。数据有订单表orders和订单明细表order_items,表结构及数据分别为:

orders表:(订单id order_id, 订单号 order_number, 买家ID buyer_id, 下单日期 create_dt)

Order ID Order number under a single user ID, date
52304 111 215 052 630 176 474 2011-12-15 04:58:21
52303 2011-12-15 04:45:31 111 215 052 629 178 350
52 302 111 215 052 628 172 296 2011-12-15 03:12:23
52 301 111 215 052 627 2011-12-15 02:37:32 178 348
52300 111 215 052 626 174 893 2011-12-15 02:18:56
52299 2011-12-15 01:33:46 111 215 052 625 169 471
52298 111 215 052 624 178 345 2011-12-15 01:04:41
52297 111 215 052 623 176 369 2011-12-15 01:02:20
52296 2011-12-15 00:38:02 111 215 052 622 178 343
52 295 111 215 052 621 178 342 2011-12-15 00:18:43
52294 111 215 052 620 178341 2011-12-15 00:14 : 37
52 293 111 215 052 619 178 338 2011-12-15 00:13:07
ORDER_ITEMS table :( details ID item_id, order ID order_id, product ID goods_id)

Details ID Order ID product ID
252 578 52 293 1.01684 million
252579 52293 1.01404 million
252.58 thousand 52294 1.0142 million
252 581 52294 1001012
252 582 52294 1022245
252 583 52294 1014724
252 584 52294 1010731
252 586 52 295 1,023,399
252 587 52 295 1.01684 million
252 592 52296 1021134
252 593 52296 1021133
252 585 52 295 1.02184 million
252 588 52 295 1.01404 million
252 589 52296 1.01404 million
252.59 thousand 52296 1019043
`` `
the orders table and order_items table, by associating the order id, is one to many relationship.

The following open spark-shell, query on the day the electricity supplier site, which users have to buy what goods.

1. On Linux, create / data / spark3 / join, for storing experimental data.

mkdir -p /data/spark3/join  

Directory to the switching / data / spark3 / join directory, and downloaded from http://192.168.1.100:60000/allfiles/spark3/join/order_items and http://192.168.1.100:60000/allfiles/spark3/join/orders Experimental data.

cd /data/spark3/join  
wget http://192.168.1.100:60000/allfiles/spark3/join/order_items  
wget http://192.168.1.100:60000/allfiles/spark3/join/orders  

2. Create / myspark3 / join the directory on HDFS, data and next on Linux / data / spark3 / join the directory, uploaded to the HDFS.

hadoop fs -mkdir -p /myspark3/join  
hadoop fs -put /data/spark3/join/orders /myspark3/join  
hadoop fs -put /data/spark3/join/order_items /myspark3/join  

3. Create two RDD in Spark window, orders order_items file and data files are loaded.

val rdd1 = sc.textFile("hdfs://localhost:9000/myspark3/join/orders");  
val rdd2 = sc.textFile("hdfs://localhost:9000/myspark3/join/order_items");  

4. Our aim is to query each user what to buy merchandise. So for rdd1 and rdd2 be mapped map, draw two columns of critical data.

Scala `` `
Val rdd11 = rdd1.map (Line => (line.split ( '\ T') (0), line.split ( '\ T') (2)))
Val rdd22 = rdd2.map (Line => (line.split ( '\ T') (. 1), line.split ( '\ T') (2)))
5. the data and the rdd22 in rdd11, according to the Key value for Join correlation, to give a final result.

val rddresult = rdd11 join rdd22  

6. Finally, the result output, see the output.

rddresult.collect  

The final results of the implementation as follows:

7. The data output format:

(52294,(178341,1014200)),  
(52294,(178341,1001012)),  
(52294,(178341,1022245)),  
(52294,(178341,1014724)),  
(52294,(178341,1010731)),  
(52296,(178343,1021134)),  
(52296,(178343,1021133)),  
(52296,(178343,1014040)),  
(52296,(178343,1019043)),  
(52295,(178342,1023399)),  
(52295,(178342,1016840)),  
(52295,(178342,1021840)),  
(52295,(178342,1014040)),  
(52293,(178338,1016840)),  
(52293,(178338,1014040)) 

After the data association can be seen above, a total of three, respectively, order ID, a user ID, an item ID.

Averaging: Electric's Web site visits will be on commodity statistics. Goods_visit an existing file is stored clicks all the goods and all commodities. There is also a file goods, record basic information goods. Two table data structure is as follows:

goods Table: Product ID (goods_id), commodity status (goods_status), Category id (cat_id), score (goods_score)

goods_visit table: Product ID (goods_id), goods clicks (click_num)

Commodity table (goods) and merchandise available in the table (goods_visit) can be associated by product id. Now statistics for each classification, the average number of clicks is the number of goods?

1. On Linux, create the directory / data / spark3 / avg, for storing experimental data.

mkdir -p /data/spark3/avg  

Switch to next / data / spark3 / avg directory, and from two http://192.168.1.100:60000/allfiles/spark3/avg/goods and http://192.168.1.100:60000/allfiles/spark3/avg/goods_visit experimental data downloaded from the website.

cd /data/spark3/avg  
wget http://192.168.1.100:60000/allfiles/spark3/avg/goods  
wget http://192.168.1.100:60000/allfiles/spark3/avg/goods_visit  

2. Create a directory on HDFS / myspark3 / avg, data and Linux / data / spark3 under / avg directory, uploaded to the HDFS / myspark3 / avg.

hadoop fs -mkdir -p /myspark3/avg  
hadoop fs -put /data/spark3/avg/goods /myspark3/avg  
hadoop fs -put /data/spark3/avg/goods_visit /myspark3/avg  

3. Create two RDD in Spark window, goods and goods_visit file data files are loaded.

val rdd1 = sc.textFile("hdfs://localhost:9000/myspark3/avg/goods")  
val rdd2 = sc.textFile("hdfs://localhost:9000/myspark3/avg/goods_visit")  

4. Our aim is to count each category, the average number of clicks goods, we can do in three steps.

First, rdd1 and rdd2 be mapped map, draw two columns of critical data.

val rdd11 = rdd1.map(line=> (line.split('\t')(0), line.split('\t')(2)) )  
val rdd22 = rdd2.map(line=> (line.split('\t')(0), line.split('\t')(1)) )  

() Method to start the program with collect.

rdd11.collect  

View rdd11 results are as follows:

rdd11.collect  
res2: Array[(String, String)] = Array((1000002,52137), (1000003,52137), (1000004,52137), (1000006,52137),  
(1000007,52137), (1000008,52137), (1000010,52137), (1000011,52137), (1000015,52137), (1000018,52137),  
(1000020,52137), (1000021,52137), (1000025,52137), (1000028,52137), (1000030,52137), (1000033,52137),  
(1000035,52137), (1000037,52137), (1000041,52137), (1000044,52137), (1000048,52137), (1000050,52137),  
(1000053,52137), (1000057,52137), (1000059,52137), (1000063,52137), (1000065,52137), (1000067,52137),  
(1000071,52137), (1000073,52137), (1000076,52137), (1000078,52137), (1000080,52137), (1000082,52137),  
(1000084,52137), (1000086,52137), (1000087,52137), (1000088,52137), (1000090,52137), (1000091,52137),  
(1000094,52137), (1000098,52137), (1000101,52137), (1000103,52137), (1000106,52...  
scala>>  

() Method to start the program with collect.

rdd22.collect  

View rdd22 results are as follows:

rdd22.collect  

res3: Array[(String, String)] = Array((1010000,4), (1010001,0), (1010002,0), (1010003,0), (1010004,0),
(1010005,0), (1010006,74), (1010007,0), (1010008,0), (1010009,1081), (1010010,0), (1010011,0), (1010012,0),
(1010013,44), (1010014,1), (1010018,0), (1010019,542), (1010020,1395), (1010021,18), (1010022,13), (1010023,27),
(1010024,22), (1010025,295), (1010026,13), (1010027,1), (1010028,410), (1010029,2), (1010030,8), (1010031,6),
(1010032,729), (1010033,72), (1010034,3), (1010035,328), (1010036,153), (1010037,100), (1010038,4), (1010039,3),
(1010040,69), (1010041,1), (1010042,1), (1010043,21), (1010044,268), (1010045,11), (1010046,1), (1010047,1),
(1010048,59), (1010049,15), (1010050,19), (1010051,424), (1010052,462), (1010053,9), (1010054,41), (1010055,64),
(1010056,10), (1010057,3), (...
Scala>
then rdd11 and correlate data in accordance rdd22 item ID, a key value that is, to give a large table. Commodity table structure becomes :( id, (commodity classification, goods clicks))

Copy Plain View
Val rddjoin = rdd11 rdd22 the Join
() method starts procedures collect.

Copy Plain View
rddjoin.collect
view rddjoin results are as follows:

view plain copy
rddjoin.collect

res4: Array[(String, (String, String))] = Array((1013900,(52137,0)), (1010068,(52007,1316)), (1018970,(52006,788)),
(1020975,(52091,68)), (1019960,(52111,0)), (1019667,(52045,16)), (1010800,(52137,6)), (1019229,(52137,20)), (1022649,
(52119,90)), (1020382,(52137,0)), (1022667,(52021,150)), (1017258,(52086,0)), (1021963,(52072,83)), (1015809,(52137,285)),
(1024340,(52084,0)), (1011043,(52132,0)), (1011762,(52137,2)), (1010976,(52132,34)), (1010512,(52090,8)), (1023965,(52095,0)),
(1017285,(52069,41)), (1020212,(52026,46)), (1010743,(52137,0)), (1020524,(52064,52)), (1022577,(52090,13)), (1021974,(52069,22)),
(1010543,(52137,0)), (1010598,(52136,53)), (1017212,(52108,45)), (1010035,(52006,328)), (1010947,(52089,8)), (1020964,(52071,86)),
(1024001, (52063,0)), (1,020,191, (52046,0)), (1,015,739, (...
Scala>
Finally, based on a large table on the statistics. Click on each category to obtain an average, commodities frequency.

Copy Plain View
rddjoin.map (X => {(x._2._1, (x._2._2.toLong,. 1))}). reduceByKey ((X, Y) => {(Y + x._1. _1, x._2 + y._2)}) . map (x => {(x._1, x._2._1 * 1.0 / x._2._2)}). collect
outputs the result to view output.

view plain copy
scala> rddjoin.map(x=>{(x._2._1, (x._2._2.toLong, 1))}).reduceByKey((x,y)=>{(x._1+y._1, x._2+y._2)}).map(x=>
{(x._1, x._2._1*1.0/x._2._2)}).collect
res40: Array[(String, Double)] = Array((52009,463.3642857142857), (52135,36.69230769230769), (52128,9.0), (52072,42.8),
(52078,16.5), (52137,34.735241502683365), (52047,20.96551724137931), (52050,0.0), (52056,24.57894736842105),
(52087,17.008928571428573), (52085,31.17142857142857), (52007,547.3076923076923), (52052,19.6), (52081,50.833333333333336),
(52016,106.75), (52058,34.23170731707317), (52124,0.0), (52092,28.453703703703702), (52065,8.644444444444444), (52106,22.5),
(52120,96.7843137254902), (52027,114.7), (52089,17.81159420289855), (52098,57.793103448275865), (52038,74.2), (52061,52.609375),
(52104,49.0), (52014,45.4), (52012 , 53.26), (52100,22.0), (52043,23.0), (52030,532.48), (52023,150.0), (52083,57.857142857142854),
(52041,40.0), (52049,18.058823529411764), (52074,33.17647058 ...
scale>

Guess you like

Origin www.cnblogs.com/hannahzhao/p/11959999.html