[Offline Competition] Attempt in Tianchi Rookie Combat Competition (2)

The previous article has listed the basic ideas, and now a training data table has been processed according to the data preprocessing scheme in the first article

training data 11.22~11.27UI collection behavior data

Corresponding big data table name: temp_fin.temp_tianchi_train1_data

verify the data 11.29~12.04UI collection behavior data

Corresponding big data table name:

 

Data preprocessing ideas:

1. Screen some abnormal training data (UI combinations that are only bought but not viewed, and UI combinations that are only viewed but not bought)

2. Adjust the proportion of positive and negative samples in the training data

Model building ideas:

1. Choose different classification algorithms, choose random forest and gradient boosting tree (insensitive to the ratio of positive and negative samples), train models with different hyperparameters, statistical accuracy, F1 value. Make the accuracy local optimum

2. Use the verification data for verification. If the accuracy rate does not change much, it means that the model is available, and then use the prediction data to predict the final result.

----------------------------------

The first coding:

Using the training data directly, using the random forest model, the training situation:

 

from pyspark.context import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np
from pyspark.ml.feature import MaxAbsScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

#Random Forest (RF) and Gradient Boosting Tree (GBDT)
#Insensitive to the ratio of positive and negative examples, but too many negative samples will affect training and consume resources

sparkconf = SparkConf ()
sparkconf.setAppName("ronaldo0412") \
  .set("spark.cores.max",str(8)) \
  .setExecutorEnv("JAVA_HOME", os.environ["JAVA_HOME"]) \
  .setExecutorEnv("HADOOP_HDFS_HOME", os.environ["HADOOP_HOME"]) \
  .setExecutorEnv("LD_LIBRARY_PATH", os.environ["JAVA_HOME"] + "/jre/lib/amd64/server:" + os.environ["HADOOP_HOME"] + "/lib/native" ) \
# Create a spark session object, enable hive support
spark = SparkSession.builder.enableHiveSupport().config(conf=sparkconf).getOrCreate()
pydf =spark.sql("select * from temp_fin.temp_tianchi_train1_data")
pydf2 =spark.sql("select * from temp_fin.temp_tianchi_train2_data")
# results=pydf.collect()
# array_data = np.array(results, dtype=object)
# columns =['user_id','item_id','item_category','u_b_count','u_b1_count','u_b2_count',\
#           'u_b3_count','u_b4_count','u_b4_rate','i_u_count','i_b4_rate','c_u_count',\
#           'c_b4_rate','ui_b_count','uc_b_count','flag']
# df = pd.DataFrame(array_data,columns=columns)

#Convert to a column of multidimensional vectors
assembler = VectorAssembler(inputCols=['u_b_count',\
        'u_b1_count','u_b2_count','u_b3_count','u_b4_count','u_b4_rate','i_u_count',\
          'i_b4_rate','c_u_count','c_b4_rate','ui_b_count','uc_b_count'], \
                            outputCol="features")
assembled = assembler.transform(pydf)

assembler2 = VectorAssembler(inputCols=['u_b_count',\
        'u_b1_count','u_b2_count','u_b3_count','u_b4_count','u_b4_rate','i_u_count',\
          'i_b4_rate','c_u_count','c_b4_rate','ui_b_count','uc_b_count'], \
                            outputCol="features")
assembled2 = assembler.transform(pydf2)

# eigenvalue normalization
#Use MaxAbsScaler, keep 0 data
maScaler = MaxAbsScaler(inputCol="features", outputCol="scaled")
model = maScaler.fit(assembled)
df_train =model.transform(assembled)

maScaler2 = MaxAbsScaler(inputCol="features", outputCol="scaled")
model2 = maScaler.fit(assembled2)
df_test =model.transform(assembled2)

print('Eigenvalue processing completed')
#build model
rf= RandomForestClassifier(numTrees=100, maxDepth=10, seed=42,featuresCol='scaled',labelCol='flag')
model=rf.fit(df_train)
print('Model has been created')

resultDF=model.transform(df_test)
resultDF.select('user_id','item_id','scaled','flag','prediction').write.mode("overwrite").saveAsTable('temp_fin.temp_tianchi_train_test_result')
print('Test data has been processed')
# evaluator = MulticlassClassificationEvaluator().setLabelCol("flag").setPredictionCol("prediction").setMetricName("accuracy")
# predictionAccuracy = evaluator.evaluate(resultDF)
# print("Testing Accuracy is %s " % (predictionAccuracy * 100) + "%")




 The above did not adjust the proportion of positive and negative samples, and did not screen the prediction of the training data. The prediction results of the test data were saved in the big data table, and it was found that the prediction results were all 0. Adjust the ratio of positive and negative samples to 1 to 30 (using the k-means sampling method) and then do the test

 

 

 

----------------------------------------------------

Specific operation:

1. The data in temp_fin.temp_tianchi_train1_data, the data volume of flag=0 (negative example) is 1707539, and the data volume of flag=1 (positive example) is 1445. Positive and negative ratio = 1:1181. Obviously, there are too many counter-example data, and the amount of counter-example data needs to be reduced. Referring to other articles, there are various sampling methods and randomization.

List two: 1. k-means classification (divide into specific clusters, and then randomly grab a certain negative sample from each cluster); 2. Randomly grab

Obviously the first way is more scientific. However, after the final sampling, the proportion of positive and negative examples should be appropriate. This is a relatively large issue. In this experiment, two ratios of 1:10 and 1:30 were selected for testing.

K-means clustering algorithm implementation:

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326081303&siteId=291194637