Python machine learning based name prediction gender web app development

foreword

The reason for doing this project is that csdn recommended a question and answer to me before: the development of mobile app based on machine learning name prediction gender. I clicked in and found that someone had already answered it. I clicked on the link and took a look. Good guy, isn’t this looking up the table to calculate the probability? It has half a dime relationship with machine learning. And I think it's nonsense to use names to predict gender. I checked it out and found that a well-known patriotic company and a foreign company have APIs that provide names to predict gender. It seems that you can try it.

The complaints are over, and the final result is first:
insert image description here
the accuracy is still possible, and it supports batch query of names, and the response speed of the entire webpage is also very fast.

environment

I am using python3.10 and need to install the following packages:
numpy
pandas
pypinyin
tensorflow-cpu
plotly (webpage)
dash (webpage)

method

1. How to get the dataset of names and genders . At the beginning, I searched where the database used behind the table lookup article came from. I found its frequency map of names on github, but I couldn’t find the original database. It is said that it was taken from some leaked kaifangjilu. I thought this tm is not illegal, is there no legal means to obtain such data set resources? I still found a data set of 1.2 million people's names and genders on github:
click in and find the Chinese_Names_Corpus_Gender (120W).txt file
2. How to extract features from Chinese names and convert them into languages ​​that machines can understand. After much deliberation, I decided to convert Chinese to pinyin without phonetic notation first, and then convert each letter into an alphabetic number. It turns out that the effect is indeed good.

Specifically as shown in the figure below:
insert image description here

the code

The code is divided into three parts, data preparation code, data training code, and web page app code.

data preparation code

First, convert the downloaded txt file into a csv file:
insert image description here
remove the front part, and then change the file suffix, it becomes a classic csv file separated by commas .
insert image description here
Next, start reading and processing data:

import pandas as pd
df = pd.read_csv("test.csv")
df

Here I run it in the notebook and the result is as follows:

insert image description here

We first need to convert the gender to 0,1 for male and female:

df['sex'].replace(['男', '女','未知'],
                        [0, 1, 2], inplace=True)

Then batch convert the names and save to a new csv file:

from pypinyin import lazy_pinyin
import time
count = 0
a1 = time.time()
for x in df['dict']:
    list_pinyin = lazy_pinyin(x) #["a","zuo"]
    c = ''.join(list_pinyin) #["azuo"]
    num_pinyin = [max(0.0, ord(char)-96.0) for char in c]
    num_pinyin_pad = num_pinyin + [0.0] * max(0, 20 - len(num_pinyin))
    df['dict'][count] = num_pinyin_pad[:15] #为了使输入向量固定长度,取前15个字符。
    count+=1
    a2 = time.time()
    if count % 10000 == 0:
        print(a2-a1)
df.to_csv('after_2.csv')

It takes a long time here, because the amount of data is large, it takes about half an hour. I let it print the running time every 10,000 data, which can be removed. Then there is a detail that because the model needs to be input, the length of the vector must be fixed, that is, the short name is filled with 0, and the long name is cut off, and the first fifteen letters are always taken. You can exit after saving the csv.

data training code

First read the data in, because it is found that the performance of the binary classification is better, so we exclude names with a gender of 2, which is unknown.

import pandas as pd
import numpy as np
df = pd.read_csv('after_2.csv')
df_binary = df[df['sex']!=2]

Prepare the input vector:

import json
test_list = df_binary['dict'].values.tolist()
for i in range(len(test_list)):
    test_list[i] = eval(test_list[i])
X = np.array(test_list,dtype = np.float32)
y = np.asarray(df_binary['sex'].values.tolist())

The shape of X is (1050353, 15), and the shape of y is (1050353,).

Divide the training set and test set:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,test_size=0.2,random_state=0)

Prepare the model:

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
from tensorflow.keras.optimizers import Adam

def lstm_model(num_alphabets=27, name_length=15, embedding_dim=256):
    model = Sequential([
        Embedding(num_alphabets, embedding_dim, input_length=15),
        Bidirectional(LSTM(units=128, recurrent_dropout=0.2, dropout=0.2)),
        Dense(1, activation="sigmoid")
    ])

    model.compile(loss='binary_crossentropy',
                  optimizer=Adam(learning_rate=0.001),
                  metrics=['accuracy'])

    return model

There is only one layer of LSTM, and the cpu can also be easily trained (referring to training an epoch takes half an hour)

train:

import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
# Step 1: Instantiate the model
model = lstm_model(num_alphabets=27, name_length=15, embedding_dim=256)

# Step 2: Split Training and Test Data

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=0)

# Step 3: Train the model
callbacks = [
    EarlyStopping(monitor='val_accuracy',
                  min_delta=1e-3,
                  patience=5,
                  mode='max',
                  restore_best_weights=True,
                  verbose=1),
]

history = model.fit(x=X_train,
                    y=y_train,
                    batch_size=64,
                    epochs=3,
                    validation_data=(X_test, y_test),
                    callbacks=callbacks)

# Step 4: Save the model
model.save('boyorgirl.h5')

# Step 5: Plot accuracies
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='val')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

The network part is a wheel built by others. I used it directly. Because the training speed is relatively slow, I only trained for three rounds here. The accuracy rate of the training set and the test set are both improving, which is close to 0.88, which means continue training. There is still room for improvement:

insert image description here

web development code

This piece is relatively long, so I won’t explain it. It should be noted that I set the app at the end to run on port 3000 on 0.0.0.0. That is to say, if you install it on the server, you can directly access the server’s ip address and add Port number to access the web page. If it is a local view, set 127.0.0.1 . In addition, you need to prepare a faq.md explanation file
under the same directory , which is placed at the bottom of the webpage for explanation, such as the following picture:
insert image description here

import os
import pandas as pd
import numpy as np
import re
from tensorflow.keras.models import load_model
from pypinyin import lazy_pinyin
import plotly.express as px
import dash
from dash import dash_table
import dash_bootstrap_components as dbc
from dash import dcc
from dash import html
from dash.dependencies import Input, Output, State

pred_model = load_model('boyorgirl.h5')

# Setup the Dash App
external_stylesheets = [dbc.themes.LITERA]
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)

# Server
server = app.server

# FAQ section
with open('faq.md', 'r') as file:
    faq = file.read()

# App Layout
app.layout = html.Table([
    html.Tr([
        html.H1(html.Center(html.B('男孩或者女孩?'))),
        html.Div(
            html.Center("根据名字预测性别"),
            style={
    
    'fontSize': 20}),
        html.Br(),
        html.Div(
            dbc.Input(id='names',
                      value='李泽,李倩',
                      placeholder='输入多个名字请用逗号或者空格分开',
                      style={
    
    'width': '700px'})),
        html.Br(),
        html.Center(children=[
            dbc.Button('提交',
                       id='submit-button',
                       n_clicks=0,
                       color='primary',
                       type='submit'),
            dbc.Button('重置',
                       id='reset-button',
                       color='secondary',
                       type='submit',
                       style={
    
    "margin-left": "50px"})
        ]),
        html.Br(),
        dcc.Loading(id='table-loading',
                    type='default',
                    children=html.Div(id='predictions',
                                      children=[],
                                      style={
    
    'width': '700px'})),
        dcc.Store(id='selected-names'),
        html.Br(),
        dcc.Loading(id='chart-loading',
                    type='default',
                    children=html.Div(id='bar-plot', children=[])),
        html.Br(),
        html.Div(html.Center(html.B('关于该项目')),
                 style={
    
    'fontSize': 20}),
        dcc.Markdown(faq, style={
    
    'width': '700px'})
    ])
],
                        style={
    
    
                            'marginLeft': 'auto',
                            'marginRight': 'auto'
                        })


# Callbacks
@app.callback([Output('submit-button', 'n_clicks'),
               Output('names', 'value')], Input('reset-button', 'n_clicks'),
              State('names', 'value'))
def update(n_clicks, value):
    if n_clicks is not None and n_clicks > 0:
        return -1, ''
    else:
        return 0, value


@app.callback(
    [Output('predictions', 'children'),
     Output('selected-names', 'data')], Input('submit-button', 'n_clicks'),
    State('names', 'value'))
def predict(n_clicks, value):
    if n_clicks >= 0:
        # Split on all non-alphabet characters
        # Restrict to first 10 names only
        names = re.findall(r"\w+", value)

        # Convert to dataframe
        pred_df = pd.DataFrame({
    
    'name': names})
        list_list = []
        # Preprocess
        for x in names:
            list_pinyin = lazy_pinyin(x)
            c = ''.join(list_pinyin)
            num_pinyin = [max(0.0, ord(char)-96.0) for char in c]
            num_pinyin_pad = num_pinyin + [0.0] * max(0, 20 - len(num_pinyin))
            list_list.append(num_pinyin_pad[:15])
        # Predictions
        result = pred_model.predict(list_list).squeeze(axis=1)
        pred_df['男还是女'] = [
            '女' if logit > 0.5 else '男' for logit in result
        ]
        pred_df['可能性'] = [
            logit if logit > 0.5 else 1.0 - logit for logit in result
        ]

        # Format the output
        pred_df['name'] = names
        pred_df.rename(columns={
    
    'name': '名字'}, inplace=True)
        pred_df['可能性'] = pred_df['可能性'].round(2)
        pred_df.drop_duplicates(inplace=True)

        return [
            dash_table.DataTable(
                id='pred-table',
                columns=[{
    
    
                    'name': col,
                    'id': col,
                } for col in pred_df.columns],
                data=pred_df.to_dict('records'),
                filter_action="native",
                filter_options={
    
    "case": "insensitive"},
                sort_action="native",  # give user capability to sort columns
                sort_mode="single",  # sort across 'multi' or 'single' columns
                page_current=0,  # page number that user is on
                page_size=10,  # number of rows visible per page
                style_cell={
    
    
                    'fontFamily': 'Open Sans',
                    'textAlign': 'center',
                    'padding': '10px',
                    'backgroundColor': 'rgb(255, 255, 204)',
                    'height': 'auto',
                    'font-size': '16px'
                },
                style_header={
    
    
                    'backgroundColor': 'rgb(128, 128, 128)',
                    'color': 'white',
                    'textAlign': 'center'
                },
                export_format='csv')
        ], names
    else:
        return [], ''


@app.callback(Output('bar-plot', 'children'), [
    Input('submit-button', 'n_clicks'),
    Input('predictions', 'children'),
    Input('selected-names', 'data')
])
def bar_plot(n_clicks, data, selected_names):
    if n_clicks >= 0:
        # Bar Chart
        data = pd.DataFrame(data[0]['props']['data'])
        fig = px.bar(data,
                     x="可能性",
                     y="名字",
                     color='男还是女',
                     orientation='h',
                     color_discrete_map={
    
    
                         '男': 'dodgerblue',
                         '女': 'lightcoral'
                     })

        fig.update_layout(title={
    
    
            'text': '预测正确的可能性',
            'x': 0.5
        },
                          yaxis={
    
    
                              'categoryorder': 'array',
                              'categoryarray': selected_names,
                              'autorange': 'reversed',
                          },
                          xaxis={
    
    'range': [0, 1]},
                          font={
    
    'size': 14},
                          width=700)

        return [dcc.Graph(figure=fig)]
    else:
        return []


if __name__ == '__main__':
    app.run_server(host='0.0.0.0', port='3000', proxy=None, debug=False)

Guess you like

Origin blog.csdn.net/weixin_43945848/article/details/130223519