The previous article has listed the basic ideas, and now a training data table has been processed according to the data preprocessing scheme in the first article
training data | 11.22~11.27UI collection behavior data |
Corresponding big data table name: temp_fin.temp_tianchi_train1_data
verify the data | 11.29~12.04UI collection behavior data |
Corresponding big data table name:
Data preprocessing ideas:
1. Screen some abnormal training data (UI combinations that are only bought but not viewed, and UI combinations that are only viewed but not bought)
2. Adjust the proportion of positive and negative samples in the training data
Model building ideas:
1. Choose different classification algorithms, choose random forest and gradient boosting tree (insensitive to the ratio of positive and negative samples), train models with different hyperparameters, statistical accuracy, F1 value. Make the accuracy local optimum
2. Use the verification data for verification. If the accuracy rate does not change much, it means that the model is available, and then use the prediction data to predict the final result.
----------------------------------
The first coding:
Using the training data directly, using the random forest model, the training situation:
from pyspark.context import SparkContext from pyspark.conf import SparkConf from pyspark.sql import SparkSession import pandas as pd import numpy as np from pyspark.ml.feature import MaxAbsScaler from pyspark.ml.feature import VectorAssembler from pyspark.ml.classification import RandomForestClassifier from pyspark.ml.evaluation import MulticlassClassificationEvaluator #Random Forest (RF) and Gradient Boosting Tree (GBDT) #Insensitive to the ratio of positive and negative examples, but too many negative samples will affect training and consume resources sparkconf = SparkConf () sparkconf.setAppName("ronaldo0412") \ .set("spark.cores.max",str(8)) \ .setExecutorEnv("JAVA_HOME", os.environ["JAVA_HOME"]) \ .setExecutorEnv("HADOOP_HDFS_HOME", os.environ["HADOOP_HOME"]) \ .setExecutorEnv("LD_LIBRARY_PATH", os.environ["JAVA_HOME"] + "/jre/lib/amd64/server:" + os.environ["HADOOP_HOME"] + "/lib/native" ) \ # Create a spark session object, enable hive support spark = SparkSession.builder.enableHiveSupport().config(conf=sparkconf).getOrCreate() pydf =spark.sql("select * from temp_fin.temp_tianchi_train1_data") pydf2 =spark.sql("select * from temp_fin.temp_tianchi_train2_data") # results=pydf.collect() # array_data = np.array(results, dtype=object) # columns =['user_id','item_id','item_category','u_b_count','u_b1_count','u_b2_count',\ # 'u_b3_count','u_b4_count','u_b4_rate','i_u_count','i_b4_rate','c_u_count',\ # 'c_b4_rate','ui_b_count','uc_b_count','flag'] # df = pd.DataFrame(array_data,columns=columns) #Convert to a column of multidimensional vectors assembler = VectorAssembler(inputCols=['u_b_count',\ 'u_b1_count','u_b2_count','u_b3_count','u_b4_count','u_b4_rate','i_u_count',\ 'i_b4_rate','c_u_count','c_b4_rate','ui_b_count','uc_b_count'], \ outputCol="features") assembled = assembler.transform(pydf) assembler2 = VectorAssembler(inputCols=['u_b_count',\ 'u_b1_count','u_b2_count','u_b3_count','u_b4_count','u_b4_rate','i_u_count',\ 'i_b4_rate','c_u_count','c_b4_rate','ui_b_count','uc_b_count'], \ outputCol="features") assembled2 = assembler.transform(pydf2) # eigenvalue normalization #Use MaxAbsScaler, keep 0 data maScaler = MaxAbsScaler(inputCol="features", outputCol="scaled") model = maScaler.fit(assembled) df_train =model.transform(assembled) maScaler2 = MaxAbsScaler(inputCol="features", outputCol="scaled") model2 = maScaler.fit(assembled2) df_test =model.transform(assembled2) print('Eigenvalue processing completed') #build model rf= RandomForestClassifier(numTrees=100, maxDepth=10, seed=42,featuresCol='scaled',labelCol='flag') model=rf.fit(df_train) print('Model has been created') resultDF=model.transform(df_test) resultDF.select('user_id','item_id','scaled','flag','prediction').write.mode("overwrite").saveAsTable('temp_fin.temp_tianchi_train_test_result') print('Test data has been processed') # evaluator = MulticlassClassificationEvaluator().setLabelCol("flag").setPredictionCol("prediction").setMetricName("accuracy") # predictionAccuracy = evaluator.evaluate(resultDF) # print("Testing Accuracy is %s " % (predictionAccuracy * 100) + "%")
The above did not adjust the proportion of positive and negative samples, and did not screen the prediction of the training data. The prediction results of the test data were saved in the big data table, and it was found that the prediction results were all 0. Adjust the ratio of positive and negative samples to 1 to 30 (using the k-means sampling method) and then do the test
----------------------------------------------------
Specific operation:
1. The data in temp_fin.temp_tianchi_train1_data, the data volume of flag=0 (negative example) is 1707539, and the data volume of flag=1 (positive example) is 1445. Positive and negative ratio = 1:1181. Obviously, there are too many counter-example data, and the amount of counter-example data needs to be reduced. Referring to other articles, there are various sampling methods and randomization.
List two: 1. k-means classification (divide into specific clusters, and then randomly grab a certain negative sample from each cluster); 2. Randomly grab
Obviously the first way is more scientific. However, after the final sampling, the proportion of positive and negative examples should be appropriate. This is a relatively large issue. In this experiment, two ratios of 1:10 and 1:30 were selected for testing.
K-means clustering algorithm implementation: