Machine Learning Model for Zhaopin Recruitment Information

I previously released a web crawler for job search information. After that, I did some machine learning exploration. During this time, the project was basically introduced, and it was sorted out and released for everyone to communicate.

3 Design of job classifier based on logistic regression

 

3.1 ; Introduction to the logistic regression algorithm

Suppose the dataset has n independent features, and x1 to xn are the n features of the sample . The goal of a conventional regression algorithm is to fit a polynomial function that minimizes the error between the predicted value and the true value:


And we hope that such f(x) can have good logical judgment properties, and it is best to directly express the probability that a sample with feature x is classified into a certain class. For example , when f(x)>0.5 , it can indicate that x is divided into positive class, and f(x)<0.5 means it is divided into negative class. And we want f(x) to always be between [0, 1] . Is there such a function?

The sigmoid function appears. This function is defined as follows:


Let’s take a look at it intuitively. The image of the sigmoid function is as follows:


The sigmoid function has all the beautiful properties we need, its domain is all real numbers, its value range is between [0, 1] , and its value at 0 is 0.5 .

So, how to convert f(x) into a sigmoid function? Let p(x)=1 be the probability that a sample with feature x is assigned to class 1 , then p(x)/[1-p(x)] is defined as the odds ratio . Introduce logarithms:

 

The above equation can be easily solved for p(x) to get the following equation:

Now, we have the sigmoid function we need. Next, just like the usual linear regression, fit the n parameters c in the formula .

 

3.2: Algorithm Flow

 

3.3: Algorithm application

 

1 Data cleaning

Data cleaning is an indispensable part of the entire data mining process, and the quality of data cleaning directly affects the subsequent model effects and final results. The data crawled by crawlers from 52job and Zhaopin.com is noisy, incomplete and inconsistent. Fill in missing data, remove abnormal data, smooth noise data and correct inconsistent data in the process of data cleaning.

 

2. Chinese word segmentation  

In English, computers can easily distinguish English words by spaces. However, Chinese is different from English. Chinese is based on characters, and words are composed of two or more characters. It is difficult for the computer to understand which Chinese words are, so the Chinese character sequence is divided into meaningful words, which is Chinese word segmentation. In word segmentation, the tools we use are as follows:

word segmentation tool

word segmentation granularity

error

part-of-speech tagging

Authentication method

interface

Nltk

multiple choice

have

have

without

Multilingual Tools

jieba participle

multiple choice

without

have

without

Python library

 

3. Go for stop words

In text mining, in order to improve the efficiency of the algorithm and save storage space, some words that are too common or meaningless are generally removed. These words are stop words. In our system, a Chinese stop word list is used to remove stop words.

In addition, in the word frequency statistics, we found a large number of words with low contribution rate to the classification from the distribution of high-frequency words, so we artificially added a stop word list

category

        Example of document after word segmentation to stop words

Big Data

Hadoop 数据挖掘 Hive HBase Spark Storm Flume hadoop Map Reduce...

cloud computing

Openstack  TCP IP socket CCNA CCNP Citrix VMware SDN NFV .....

artificial intelligence

Machine LearningDeep LearningComputer VisionNatural Language ProcessingSpeech ProcessingNeural Networks..

Internet of Things

Smart IoT manufacturing socket FTP communication DLNA AireKiss mqtt 5G era...

 

4 TF-IDF algorithm word vectorization

After segmenting the recruitment information to remove stop words, we use the TF-IDF feature extraction algorithm to convert the word document into a vector matrix for logistic regression training.

TF-IDF (TermFrequency-Inverse Document Frequency) is a common weighting technique for information retrieval and information mining. The TF-IDF value represents the relative importance of a word in the body document.

 

 

The number of documents (K) that contain the word

IDF

TFIDF

hadoop

37.8

3.421

0.0898

Openstack

14.3

2.713

0.0643

computer vision

2.4

0.603

0.0121

5G communication

5.5

2.410

0.0482

 

5 Logistic regression

   We use the machine learning algorithm logistic regression to classify and train the obtained recruitment information matrix, retain the training matrix, and then on the web page, job seekers input their vocational skills, and perform algorithmic intelligent classification of their input information. In the feature engineering of the logistic regression algorithm, we set the word contribution rate interval max_df=0.95 to remove words such as 'data' and 'company' that have a high probability of occurrence but have a low contribution rate to classification, min_df=2 to remove the frequency of occurrence Words with less than 2 degrees. A corpus containing 30,000 feature words was established, and words composed of 1-4 characters were selected for the composition of feature words.


Here is the example code:

NBmain module (main module):

import them
import pandas as pd
import nltk
from NBtools import proc_text, split_train_test, get_word_list_from_data, \
    extract_feat_from_data, cal_acc
from nltk.text import TextCollection
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
import matplotlib.pylab as plt


dataset_path = './dataset'
# csv file of raw data
output_text_filename = '51job.csv'

# Cleaned text data file
output_cln_text_filename = 'clean_job_text.csv'

# It takes a long time to process and clean text data, configure it by setting is_first_run
# If the first run needs to process and clean the original text data, it needs to be set to True
# If the text data has been processed before and the cleaned text data has been saved, set it to False
is_first_run = False

def run_main():
    """
        main function
    """
    # 1. Data reading, processing, cleaning, preparation
    if is_first_run:
        print('Processing cleaning text data...', end=' ')
        # If it is the first run, the original text data needs to be processed and cleaned

        # Read raw text data, save labels and text data as csv

        # Read the processed csv file and construct the dataset
        text_df = pd.read_csv(os.path.join(dataset_path, output_text_filename),usecols = ['job classification','job requirements'],encoding = 'gbk').dropna()
		
		# #Dictionary mapping processing category
		# title = {'big data': 0, 'cloud computing': 1 , 'artificial intelligence': 2 , 'internet of things': 3}
		# text_df['job classification'] = text_df['job classification'].map(title)
        
        # process text data
        text_df['job requirements'] = text_df['job requirements'].apply(proc_text)

        # filter empty strings
        text_df = text_df[text_df['job requirements'] != '']
        


        # save the processed text data
        text_df.to_csv(os.path.join(dataset_path, output_cln_text_filename),
                       index=None, encoding='utf-8')
        print('Complete and save the result.')

    # 2. Split training set and test set
    print('Load processed text data')
    clean_text_df = pd.read_csv(os.path.join(dataset_path, output_cln_text_filename),
                                encoding='gbk')


    # split training set and test set
    train_text_df, test_text_df = split_train_test(clean_text_df)
    # View the basic information of the training set and test set
    print('The number of data of various types in the training set:', train_text_df.groupby('job classification').size())
    print('The number of data in the test set: ', test_text_df.groupby('post classification').size())

    
    


    vectorizer = TfidfVectorizer(
                max_df=0.90, min_df=2,
                sublinear_tf=True,
                token_pattern=r'\w{1,}',
                ngram_range=(1, 4),
                max_features=30000)


    vectorizer.fit(clean_text_df['job requirements'])
    train_features = vectorizer.transform(train_text_df['job requirements'])
    test_features = vectorizer.transform(test_text_df['job requirements'])
    print('TfidfVectorizer done.... start train')
    train_target = train_text_df['job classification']
    classifier = LogisticRegression(solver='sag')

    cv_score = np.mean(cross_val_score(
            classifier, train_features, train_target, cv=3))

         
    print('The accuracy of {} is {}'.format('Post classification', cv_score))
    # # 3. Feature extraction
    # # Calculate word frequency
    # n_common_words = 200

    # # Take out the words in the training set and count the word frequency
    # print('Count word frequency...')
    # all_words_in_train = get_word_list_from_data(train_text_df)
    # fdisk = nltk.FreqDist(all_words_in_train)
    # common_words_freqs = fdisk.most_common(n_common_words)
    # print('The most frequent {} words are: '.format(n_common_words))
    # for word, count in common_words_freqs:
    #     print('{}: {}次'.format(word, count))
    # print()

    # # Extract features on the training set
    # text_collection = TextCollection(train_text_df['job requirements'].values.tolist())
    # print('Training sample extraction features...', end=' ')
    # train_X, train_y = extract_feat_from_data(train_text_df, text_collection, common_words_freqs)
    # print('Done')
    # print()

    # print('Test sample extraction features...', end=' ')
    # test_X, test_y = extract_feat_from_data(test_text_df, text_collection, common_words_freqs)
    # print('Done')

    # #4. Train the model Naive Bayes
    # print('Training model...', end=' ')
    # #gnb = GaussianNB()
    # gnb =  LogisticRegression(solver='sag')
    # gnb.fit(train_X, train_y)
    # print('Done')
    # print()

    # # 5. Forecast
    # print('Test model...', end=' ')
    # test_pred = gnb.predict(test_X)
    # print('Done')

    # # output accuracy
    # print('Logistic regression accuracy:', cal_acc(test_y, test_pred))

NBtools import module code:

import re
import jieba.posseg as pseg
import pandas as pd
import math
import numpy as np

# Load common stop words
stopwords1 = [line.rstrip() for line in open('./Chinese stopwords.txt', 'r', encoding='utf-8')]
stopwords2 = [line.rstrip() for line in open('./Harbin Institute of Technology stopwords.txt', 'r', encoding='utf-8')]
stopwords = stopwords1 + stopwords2



def proc_text(raw_line):
    """
        Process each line of text data
        Return word segmentation results
    """
    # # 1. Use regular expressions to remove non-Chinese characters
    # filter_pattern = re.compile('[^\u4E00-\u9FD5]+')
    # chinese_only = filter_pattern.sub('', raw_line)

    # 2. Stutter participle + part-of-speech tagging
    words_lst = pseg.cut(raw_line)

    # 3. Remove stop words
    meaninful_words = []
    for word, flag in words_lst:
        # if (word not in stopwords) and (flag == 'v'):
            # Can also remove non-verbs, etc. according to part of speech
        if word not in stopwords:
            meaninful_words.append(word)

    return ' '.join(meaninful_words)


def split_train_test(text_df, size=0.8):
    """
        Split training and test sets
    """
    # In order to ensure that the data in each class can be in the same proportion in the training set and test set, it is necessary to process each class in turn
    train_text_df = pd.DataFrame()
    test_text_df = pd.DataFrame()
    
    #labels = ['big data', 'cloud computing', 'artificial intelligence', 'internet of things']
    labels = [0, 1, 2, 3]
    for label in labels:
        # Find the record of label
        text_df_w_label = text_df[text_df['post classification'] == label]
        # Reset the index to ensure that the records of each class are indexed from 0, which is convenient for subsequent splitting
        text_df_w_label = text_df_w_label.reset_index()

        # By default, split by 80% training set and 20% test set
        # In order to simplify the operation here, the first 80% are put into the training set, and the last 20% are put into the test set
        # Of course, you can also randomly split 80%, 20% (try to achieve random splitting in the next DataFrame)

        # The number of rows of this type of data
        n_lines = text_df_w_label.shape[0]
        split_line_no = math.floor(n_lines * size)
        text_df_w_label_train = text_df_w_label.iloc[:split_line_no, :]
        text_df_w_label_test = text_df_w_label.iloc[split_line_no:, :]

        # Put into the overall training set, test set
        train_text_df = train_text_df.append(text_df_w_label_train)
        test_text_df = test_text_df.append(text_df_w_label_test)

    train_text_df = train_text_df.reset_index()
    test_text_df = test_text_df.reset_index()
    return train_text_df, test_text_df


def get_word_list_from_data(text_df):
    """
        Put the words in the dataset into a list
    """
    word_list = []
    for _, r_data in text_df.iterrows():
        word_list += r_data['job requirements'].split(' ')
    return word_list





def extract_feat_from_data(text_df, text_collection, common_words_freqs):
    """
        Feature extraction
    """
    # Here only TF-IDF features are selected as examples
    # Consider using word frequency or other text features as additional features

    n_sample = text_df.shape[0]
    n_feat = len(common_words_freqs)
    common_words = [word for word, _ in common_words_freqs]

    # initialize
    X = np.zeros([n_sample, n_feat])
    y = np.zeros(n_sample)

    print('Extract features...')
    for i, r_data in text_df.iterrows():
        if (i + 1) % 5000 == 0:
            print('The feature extraction of {} samples has been completed'.format(i + 1))

        text = r_data['job requirements']

        feat_vec = []
        for word in common_words:
            if word in text:
                # If in high frequency words, calculate TF-IDF value
                tf_idf_val = text_collection.tf_idf(word, text)
            else:
                tf_idf_val = 0

            feat_vec.append(tf_idf_val)

        # assign
        X[i, :] = np.array(feat_vec)
        y[i] = int(r_data['job classification'])

    return X, y


def cal_acc(true_labels, pred_labels):
    """
        Calculate accuracy
    """
    n_total = len(true_labels)
    correct_list = [true_labels[i] == pred_labels[i] for i in range(n_total)]

    acc = sum(correct_list) / n_total
    return acc
The code may have a lot of shortcomings and can be optimized. please give pointers

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324505338&siteId=291194637