I previously released a web crawler for job search information. After that, I did some machine learning exploration. During this time, the project was basically introduced, and it was sorted out and released for everyone to communicate.
3 Design of job classifier based on logistic regression
3.1 ; Introduction to the logistic regression algorithm
Suppose the dataset has n independent features, and x1 to xn are the n features of the sample . The goal of a conventional regression algorithm is to fit a polynomial function that minimizes the error between the predicted value and the true value:
And we hope that such f(x) can have good logical judgment properties, and it is best to directly express the probability that a sample with feature x is classified into a certain class. For example , when f(x)>0.5 , it can indicate that x is divided into positive class, and f(x)<0.5 means it is divided into negative class. And we want f(x) to always be between [0, 1] . Is there such a function?
The sigmoid function appears. This function is defined as follows:
Let’s take a look at it intuitively. The image of the sigmoid function is as follows:
The sigmoid function has all the beautiful properties we need, its domain is all real numbers, its value range is between [0, 1] , and its value at 0 is 0.5 .
So, how to convert f(x) into a sigmoid function? Let p(x)=1 be the probability that a sample with feature x is assigned to class 1 , then p(x)/[1-p(x)] is defined as the odds ratio . Introduce logarithms:
The above equation can be easily solved for p(x) to get the following equation:
Now, we have the sigmoid function we need. Next, just like the usual linear regression, fit the n parameters c in the formula .
3.2: Algorithm Flow
3.3: Algorithm application
1 Data cleaning
Data cleaning is an indispensable part of the entire data mining process, and the quality of data cleaning directly affects the subsequent model effects and final results. The data crawled by crawlers from 52job and Zhaopin.com is noisy, incomplete and inconsistent. Fill in missing data, remove abnormal data, smooth noise data and correct inconsistent data in the process of data cleaning.
2. Chinese word segmentation
In English, computers can easily distinguish English words by spaces. However, Chinese is different from English. Chinese is based on characters, and words are composed of two or more characters. It is difficult for the computer to understand which Chinese words are, so the Chinese character sequence is divided into meaningful words, which is Chinese word segmentation. In word segmentation, the tools we use are as follows:
word segmentation tool |
word segmentation granularity |
error |
part-of-speech tagging |
Authentication method |
interface |
Nltk |
multiple choice |
have |
have |
without |
Multilingual Tools |
jieba participle |
multiple choice |
without |
have |
without |
Python library |
3. Go for stop words
In text mining, in order to improve the efficiency of the algorithm and save storage space, some words that are too common or meaningless are generally removed. These words are stop words. In our system, a Chinese stop word list is used to remove stop words.
In addition, in the word frequency statistics, we found a large number of words with low contribution rate to the classification from the distribution of high-frequency words, so we artificially added a stop word list
category |
Example of document after word segmentation to stop words |
Big Data |
Hadoop 数据挖掘 Hive HBase Spark Storm Flume hadoop Map Reduce... |
cloud computing |
Openstack TCP IP socket CCNA CCNP Citrix VMware SDN NFV ..... |
artificial intelligence |
Machine LearningDeep LearningComputer VisionNatural Language ProcessingSpeech ProcessingNeural Networks.. |
Internet of Things |
Smart IoT manufacturing socket FTP communication DLNA AireKiss mqtt 5G era... |
4 TF-IDF algorithm word vectorization
After segmenting the recruitment information to remove stop words, we use the TF-IDF feature extraction algorithm to convert the word document into a vector matrix for logistic regression training.
TF-IDF (TermFrequency-Inverse Document Frequency) is a common weighting technique for information retrieval and information mining. The TF-IDF value represents the relative importance of a word in the body document.
|
The number of documents (K) that contain the word |
IDF |
TFIDF |
hadoop |
37.8 |
3.421 |
0.0898 |
Openstack |
14.3 |
2.713 |
0.0643 |
computer vision |
2.4 |
0.603 |
0.0121 |
5G communication |
5.5 |
2.410 |
0.0482 |
5 Logistic regression
We use the machine learning algorithm logistic regression to classify and train the obtained recruitment information matrix, retain the training matrix, and then on the web page, job seekers input their vocational skills, and perform algorithmic intelligent classification of their input information. In the feature engineering of the logistic regression algorithm, we set the word contribution rate interval max_df=0.95 to remove words such as 'data' and 'company' that have a high probability of occurrence but have a low contribution rate to classification, min_df=2 to remove the frequency of occurrence Words with less than 2 degrees. A corpus containing 30,000 feature words was established, and words composed of 1-4 characters were selected for the composition of feature words.
Here is the example code:
NBmain module (main module):
import them import pandas as pd import nltk from NBtools import proc_text, split_train_test, get_word_list_from_data, \ extract_feat_from_data, cal_acc from nltk.text import TextCollection from sklearn.linear_model import LogisticRegression import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import cross_val_score import matplotlib.pylab as plt dataset_path = './dataset' # csv file of raw data output_text_filename = '51job.csv' # Cleaned text data file output_cln_text_filename = 'clean_job_text.csv' # It takes a long time to process and clean text data, configure it by setting is_first_run # If the first run needs to process and clean the original text data, it needs to be set to True # If the text data has been processed before and the cleaned text data has been saved, set it to False is_first_run = False def run_main(): """ main function """ # 1. Data reading, processing, cleaning, preparation if is_first_run: print('Processing cleaning text data...', end=' ') # If it is the first run, the original text data needs to be processed and cleaned # Read raw text data, save labels and text data as csv # Read the processed csv file and construct the dataset text_df = pd.read_csv(os.path.join(dataset_path, output_text_filename),usecols = ['job classification','job requirements'],encoding = 'gbk').dropna() # #Dictionary mapping processing category # title = {'big data': 0, 'cloud computing': 1 , 'artificial intelligence': 2 , 'internet of things': 3} # text_df['job classification'] = text_df['job classification'].map(title) # process text data text_df['job requirements'] = text_df['job requirements'].apply(proc_text) # filter empty strings text_df = text_df[text_df['job requirements'] != ''] # save the processed text data text_df.to_csv(os.path.join(dataset_path, output_cln_text_filename), index=None, encoding='utf-8') print('Complete and save the result.') # 2. Split training set and test set print('Load processed text data') clean_text_df = pd.read_csv(os.path.join(dataset_path, output_cln_text_filename), encoding='gbk') # split training set and test set train_text_df, test_text_df = split_train_test(clean_text_df) # View the basic information of the training set and test set print('The number of data of various types in the training set:', train_text_df.groupby('job classification').size()) print('The number of data in the test set: ', test_text_df.groupby('post classification').size()) vectorizer = TfidfVectorizer( max_df=0.90, min_df=2, sublinear_tf=True, token_pattern=r'\w{1,}', ngram_range=(1, 4), max_features=30000) vectorizer.fit(clean_text_df['job requirements']) train_features = vectorizer.transform(train_text_df['job requirements']) test_features = vectorizer.transform(test_text_df['job requirements']) print('TfidfVectorizer done.... start train') train_target = train_text_df['job classification'] classifier = LogisticRegression(solver='sag') cv_score = np.mean(cross_val_score( classifier, train_features, train_target, cv=3)) print('The accuracy of {} is {}'.format('Post classification', cv_score)) # # 3. Feature extraction # # Calculate word frequency # n_common_words = 200 # # Take out the words in the training set and count the word frequency # print('Count word frequency...') # all_words_in_train = get_word_list_from_data(train_text_df) # fdisk = nltk.FreqDist(all_words_in_train) # common_words_freqs = fdisk.most_common(n_common_words) # print('The most frequent {} words are: '.format(n_common_words)) # for word, count in common_words_freqs: # print('{}: {}次'.format(word, count)) # print() # # Extract features on the training set # text_collection = TextCollection(train_text_df['job requirements'].values.tolist()) # print('Training sample extraction features...', end=' ') # train_X, train_y = extract_feat_from_data(train_text_df, text_collection, common_words_freqs) # print('Done') # print() # print('Test sample extraction features...', end=' ') # test_X, test_y = extract_feat_from_data(test_text_df, text_collection, common_words_freqs) # print('Done') # #4. Train the model Naive Bayes # print('Training model...', end=' ') # #gnb = GaussianNB() # gnb = LogisticRegression(solver='sag') # gnb.fit(train_X, train_y) # print('Done') # print() # # 5. Forecast # print('Test model...', end=' ') # test_pred = gnb.predict(test_X) # print('Done') # # output accuracy # print('Logistic regression accuracy:', cal_acc(test_y, test_pred))
NBtools import module code:
import re import jieba.posseg as pseg import pandas as pd import math import numpy as np # Load common stop words stopwords1 = [line.rstrip() for line in open('./Chinese stopwords.txt', 'r', encoding='utf-8')] stopwords2 = [line.rstrip() for line in open('./Harbin Institute of Technology stopwords.txt', 'r', encoding='utf-8')] stopwords = stopwords1 + stopwords2 def proc_text(raw_line): """ Process each line of text data Return word segmentation results """ # # 1. Use regular expressions to remove non-Chinese characters # filter_pattern = re.compile('[^\u4E00-\u9FD5]+') # chinese_only = filter_pattern.sub('', raw_line) # 2. Stutter participle + part-of-speech tagging words_lst = pseg.cut(raw_line) # 3. Remove stop words meaninful_words = [] for word, flag in words_lst: # if (word not in stopwords) and (flag == 'v'): # Can also remove non-verbs, etc. according to part of speech if word not in stopwords: meaninful_words.append(word) return ' '.join(meaninful_words) def split_train_test(text_df, size=0.8): """ Split training and test sets """ # In order to ensure that the data in each class can be in the same proportion in the training set and test set, it is necessary to process each class in turn train_text_df = pd.DataFrame() test_text_df = pd.DataFrame() #labels = ['big data', 'cloud computing', 'artificial intelligence', 'internet of things'] labels = [0, 1, 2, 3] for label in labels: # Find the record of label text_df_w_label = text_df[text_df['post classification'] == label] # Reset the index to ensure that the records of each class are indexed from 0, which is convenient for subsequent splitting text_df_w_label = text_df_w_label.reset_index() # By default, split by 80% training set and 20% test set # In order to simplify the operation here, the first 80% are put into the training set, and the last 20% are put into the test set # Of course, you can also randomly split 80%, 20% (try to achieve random splitting in the next DataFrame) # The number of rows of this type of data n_lines = text_df_w_label.shape[0] split_line_no = math.floor(n_lines * size) text_df_w_label_train = text_df_w_label.iloc[:split_line_no, :] text_df_w_label_test = text_df_w_label.iloc[split_line_no:, :] # Put into the overall training set, test set train_text_df = train_text_df.append(text_df_w_label_train) test_text_df = test_text_df.append(text_df_w_label_test) train_text_df = train_text_df.reset_index() test_text_df = test_text_df.reset_index() return train_text_df, test_text_df def get_word_list_from_data(text_df): """ Put the words in the dataset into a list """ word_list = [] for _, r_data in text_df.iterrows(): word_list += r_data['job requirements'].split(' ') return word_list def extract_feat_from_data(text_df, text_collection, common_words_freqs): """ Feature extraction """ # Here only TF-IDF features are selected as examples # Consider using word frequency or other text features as additional features n_sample = text_df.shape[0] n_feat = len(common_words_freqs) common_words = [word for word, _ in common_words_freqs] # initialize X = np.zeros([n_sample, n_feat]) y = np.zeros(n_sample) print('Extract features...') for i, r_data in text_df.iterrows(): if (i + 1) % 5000 == 0: print('The feature extraction of {} samples has been completed'.format(i + 1)) text = r_data['job requirements'] feat_vec = [] for word in common_words: if word in text: # If in high frequency words, calculate TF-IDF value tf_idf_val = text_collection.tf_idf(word, text) else: tf_idf_val = 0 feat_vec.append(tf_idf_val) # assign X[i, :] = np.array(feat_vec) y[i] = int(r_data['job classification']) return X, y def cal_acc(true_labels, pred_labels): """ Calculate accuracy """ n_total = len(true_labels) correct_list = [true_labels[i] == pred_labels[i] for i in range(n_total)] acc = sum(correct_list) / n_total return accThe code may have a lot of shortcomings and can be optimized. please give pointers