Use PySpark to process data

Use PySpark to process data


1. Data preparation

This article is mainly used as a starting example of PySpark, the data source network. Two data files are mainly used: action.txt and document.txt. The following table is action.txt, data format: userid-docid-behaivor-time-ip, that is: user code-document encoding-behavior-date-IP address
Insert picture description here
The table below is document.txt, data format: docid-channelname-source- keyword: score, namely: document coding-category (large category)-topic (fine category)-keyword: weight
Insert picture description here

2. User click-through rate

The user click rate is the number of 1 divided by the number of 0 in the behaivor column of each user in the action.txt file.

1. Create a SparkSession object

Insert picture description here

2. Read split data

Split the data according to'~' to get two columns of userid and behavior
Insert picture description here

3. Count the number of various behaviors of users

Insert picture description here

4. Convert to DataFrame format

Take out userid, behavior and quantity as 3 columns and convert them to DataFrame format
Insert picture description here

5.behavior column processing

Group according to userId, and rotate the behavior column data as the column index value as cnt. And replace 0 and 1 of behavior with "browse" and "click".
Insert picture description here

6. Fill in missing values

Insert picture description here

7. Add the calculated data to the data as a new column

Insert picture description here

8. Save and close

Save the last processed data locally, close the SparkSession and
Insert picture description here
finally save the locally saved data into multiple files, the format of each file is as follows:
Insert picture description here

3. User click-through rate

Use themes (sub-categories) to tag users

1. Read the data

Read docunment.txt to get the two columns of docid and source, namely the two columns of document code and subject (detailed category)
Insert picture description here
Insert picture description here

2. Create a temporary view of the two DataFrames

Insert picture description here

3. Make related queries

Insert picture description here

4. Save and close

Save the last processed data locally, and the
Insert picture description here
exported data after closing SparkSession is as follows:
Insert picture description here

Step on the thunder

1. During code development, each operation can be followed by an action for easy viewing of data. When running batches, you do not need to follow each one, only the last action is needed, otherwise it will add a lot of work to the machine.
2. The DataFrame generated in the intermediate process must first establish a temporary view before it can be used later, otherwise an error will be reported.

Guess you like

Origin blog.csdn.net/wh672843916/article/details/111824205