Processing with text files in spark

Business description:

In the project, the user's holding file needs to be processed, converted into an internal format, and an internal ID is generated for key business items (such as security). For simplicity, the ID is set to UUID here. The file example is as follows, with "|" segmentation

20170630|c003a949bce2ed94346c8579a33891b2|123456790|A000AD7| 5620.88000|00000001.00000000|||
20170630|c003a949bce2ed94346c8579a33891b2|23355L106|D043158| 10.0000|00000076.72000000|||
20170630|c003a949bce2ed94346c8579a33891b2|03524A108|A027192| 126.00000|00000017.48000000|||
20170630|478abaeebf564df0cb0b4232053e5129|29278N103|E019306| 474.47000|00000001.00000000|||
20170630|478abaeebf564df0cb0b4232053e5129|219350105|C695958| 50.0000|00000030.05000000|||
20170630|db34e5a988b322a32e9a54607126e10b|123456790|A000AD7| 105773.99000|00000001.00000000|||
20170630|db34e5a988b322a32e9a54607126e10b|29278N103|E019306| 750.0000|00000020.39000000|||
20170630|db34e5a988b322a32e9a54607126e10b|35472P406|F001419| 3813.46300|00000015.36000000|||
20170630|db34e5a988b322a32e9a54607126e10b|345370860|F004464| 1500.0000|00000011.19000000|||
20170630|db34e5a988b322a32e9a54607126e10b|33616C860|F018217| 1000.00000|00000026.85000000|||
20170630|d4efe3d884712369e3fa0d0ebeec1264|33616C860|F018217| 1267.48000|00000001.00000000|||
20170630|d4efe3d884712369e3fa0d0ebeec1264|254709108|D010597|          116.00000|00000062.19000000|||
20170630|d4efe3d884712369e3fa0d0ebeec1264|617446448|M004728|          233.00000|00000044.56000000|||
20170630|93404e788eb4dc9ae8367a96149b86cd|608919726|A000CV9| 17145.68000|00000001.00000000|||
20170630|93404e788eb4dc9ae8367a96149b86cd|045519402|A007023| 280.0000|00000038.13700000|||
20170630|93404e788eb4dc9ae8367a96149b86cd|35472P406|F001419| 1668.00000|00000010.97300000|||
20170630|93404e788eb4dc9ae8367a96149b86cd|G1151C101|A024853| 155.00000|00000123.68000000|||
20170630|93404e788eb4dc9ae8367a96149b86cd|03524A108|A027192| 154.00000|00000110.36000000|||

For this file, we only focus on the 3rd and 4th columns, which represent the cusip and symbol of security respectively.

 

1: Redis preparation

Since redis needs to be used in spark, first install the redis server on this machine, please refer to the article: Redis installation

Find the Redis installation directory, copy jedis-2.9.0.jar to spark's jars directory, and then start the redis service in the command line window: redis-server redis.windows.conf

Also open a redis client command line window: redis-cli -h 127.0.0.1 -p 6379

2:启动spark shell:spark-shell --jars ..\jars\jedis-2.9.0.jar

3: Program code

Execute the following code in spark shell:

import java.util.UUID
import redis.clients.jedis.Jedis

val txtFile = sc.textFile("D:\\temp\\holdings.txt",2)

val rdd1 = txtFile.map(line => line.split("\\|"))
val rdd2 = rdd1.map(x => ((x(2)+"_"+x(3)).hashCode, x(2),x(3)))
val rdd3 = rdd2.distinct

Each of the above rdd operations will generate a new rdd. Here, for simplicity and ease of understanding, it is assumed that the file is read into two partitions, and each partition is usually 128M in size.

In rdd3, the list of tuples of security is stored in two partitions. Because the data has been shuffled during rdd2.distinct, there is no overlap. Now what we need to do is to each security tuple, in Find out whether a record already exists in redis, if not, create a KV record in redis for a day and generate a UUID

val jd = new Jedis("127.0.0.1",6379)
rdd3.foreach(x => if(jd.get(x._2) ==null) jd.set(x._2,UUID.randomUUID().toString()))

Execute the above code, expect to generate kv in redis, use the second element of the tuple as the key, and the generated UUID as the value, but an error Task not serializable is reported during execution:

In the spark shell, a default SparkContext sc is started, and the jd of the above code is created in the Driver, but since the operation of rdd is distributed in the Executor, not in the Driver, when doing foreach, you need to Serialize the command text and distribute it to the corresponding Worker Node machine, but the TCP link of redis has been bound to the Driver process and cannot be distributed to each node for execution, so a serialization error is displayed, please see the Spark architecture diagram:

Modify the code to link the database in the reduce partition and use the foreachPartition function instead of foreach

rdd3.foreachPartition(it => {
  val jd = new Jedis("127.0.0.1",6379)
  it.foreach(x => if(jd.get(x._2) ==null) jd.set(x._2,UUID.randomUUID().toString()))
})

execution succeed,

In the redis client window, you can scan the Redis database and use the get command to obtain the corresponding kv

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324739700&siteId=291194637