Use PySpark to process data
Article Directory
1. Data preparation
This article is mainly used as a starting example of PySpark, the data source network. Two data files are mainly used: action.txt and document.txt. The following table is action.txt, data format: userid-docid-behaivor-time-ip, that is: user code-document encoding-behavior-date-IP address
The table below is document.txt, data format: docid-channelname-source- keyword: score, namely: document coding-category (large category)-topic (fine category)-keyword: weight
2. User click-through rate
The user click rate is the number of 1 divided by the number of 0 in the behaivor column of each user in the action.txt file.
1. Create a SparkSession object
2. Read split data
Split the data according to'~' to get two columns of userid and behavior
3. Count the number of various behaviors of users
4. Convert to DataFrame format
Take out userid, behavior and quantity as 3 columns and convert them to DataFrame format
5.behavior column processing
Group according to userId, and rotate the behavior column data as the column index value as cnt. And replace 0 and 1 of behavior with "browse" and "click".
6. Fill in missing values
7. Add the calculated data to the data as a new column
8. Save and close
Save the last processed data locally, close the SparkSession and
finally save the locally saved data into multiple files, the format of each file is as follows:
3. User click-through rate
Use themes (sub-categories) to tag users
1. Read the data
Read docunment.txt to get the two columns of docid and source, namely the two columns of document code and subject (detailed category)
2. Create a temporary view of the two DataFrames
3. Make related queries
4. Save and close
Save the last processed data locally, and the
exported data after closing SparkSession is as follows:
Step on the thunder
1. During code development, each operation can be followed by an action for easy viewing of data. When running batches, you do not need to follow each one, only the last action is needed, otherwise it will add a lot of work to the machine.
2. The DataFrame generated in the intermediate process must first establish a temporary view before it can be used later, otherwise an error will be reported.