foreword
The reason for doing this project is that csdn recommended a question and answer to me before: the development of mobile app based on machine learning name prediction gender. I clicked in and found that someone had already answered it. I clicked on the link and took a look. Good guy, isn’t this looking up the table to calculate the probability? It has half a dime relationship with machine learning. And I think it's nonsense to use names to predict gender. I checked it out and found that a well-known patriotic company and a foreign company have APIs that provide names to predict gender. It seems that you can try it.
The complaints are over, and the final result is first:
the accuracy is still possible, and it supports batch query of names, and the response speed of the entire webpage is also very fast.
environment
I am using python3.10 and need to install the following packages:
numpy
pandas
pypinyin
tensorflow-cpu
plotly (webpage)
dash (webpage)
method
1. How to get the dataset of names and genders . At the beginning, I searched where the database used behind the table lookup article came from. I found its frequency map of names on github, but I couldn’t find the original database. It is said that it was taken from some leaked kaifangjilu. I thought this tm is not illegal, is there no legal means to obtain such data set resources? I still found a data set of 1.2 million people's names and genders on github:
click in and find the Chinese_Names_Corpus_Gender (120W).txt file
2. How to extract features from Chinese names and convert them into languages that machines can understand. After much deliberation, I decided to convert Chinese to pinyin without phonetic notation first, and then convert each letter into an alphabetic number. It turns out that the effect is indeed good.
Specifically as shown in the figure below:
the code
The code is divided into three parts, data preparation code, data training code, and web page app code.
data preparation code
First, convert the downloaded txt file into a csv file:
remove the front part, and then change the file suffix, it becomes a classic csv file separated by commas .
Next, start reading and processing data:
import pandas as pd
df = pd.read_csv("test.csv")
df
Here I run it in the notebook and the result is as follows:
We first need to convert the gender to 0,1 for male and female:
df['sex'].replace(['男', '女','未知'],
[0, 1, 2], inplace=True)
Then batch convert the names and save to a new csv file:
from pypinyin import lazy_pinyin
import time
count = 0
a1 = time.time()
for x in df['dict']:
list_pinyin = lazy_pinyin(x) #["a","zuo"]
c = ''.join(list_pinyin) #["azuo"]
num_pinyin = [max(0.0, ord(char)-96.0) for char in c]
num_pinyin_pad = num_pinyin + [0.0] * max(0, 20 - len(num_pinyin))
df['dict'][count] = num_pinyin_pad[:15] #为了使输入向量固定长度,取前15个字符。
count+=1
a2 = time.time()
if count % 10000 == 0:
print(a2-a1)
df.to_csv('after_2.csv')
It takes a long time here, because the amount of data is large, it takes about half an hour. I let it print the running time every 10,000 data, which can be removed. Then there is a detail that because the model needs to be input, the length of the vector must be fixed, that is, the short name is filled with 0, and the long name is cut off, and the first fifteen letters are always taken. You can exit after saving the csv.
data training code
First read the data in, because it is found that the performance of the binary classification is better, so we exclude names with a gender of 2, which is unknown.
import pandas as pd
import numpy as np
df = pd.read_csv('after_2.csv')
df_binary = df[df['sex']!=2]
Prepare the input vector:
import json
test_list = df_binary['dict'].values.tolist()
for i in range(len(test_list)):
test_list[i] = eval(test_list[i])
X = np.array(test_list,dtype = np.float32)
y = np.asarray(df_binary['sex'].values.tolist())
The shape of X is (1050353, 15), and the shape of y is (1050353,).
Divide the training set and test set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
y,test_size=0.2,random_state=0)
Prepare the model:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
from tensorflow.keras.optimizers import Adam
def lstm_model(num_alphabets=27, name_length=15, embedding_dim=256):
model = Sequential([
Embedding(num_alphabets, embedding_dim, input_length=15),
Bidirectional(LSTM(units=128, recurrent_dropout=0.2, dropout=0.2)),
Dense(1, activation="sigmoid")
])
model.compile(loss='binary_crossentropy',
optimizer=Adam(learning_rate=0.001),
metrics=['accuracy'])
return model
There is only one layer of LSTM, and the cpu can also be easily trained (referring to training an epoch takes half an hour)
train:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
# Step 1: Instantiate the model
model = lstm_model(num_alphabets=27, name_length=15, embedding_dim=256)
# Step 2: Split Training and Test Data
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.2,
random_state=0)
# Step 3: Train the model
callbacks = [
EarlyStopping(monitor='val_accuracy',
min_delta=1e-3,
patience=5,
mode='max',
restore_best_weights=True,
verbose=1),
]
history = model.fit(x=X_train,
y=y_train,
batch_size=64,
epochs=3,
validation_data=(X_test, y_test),
callbacks=callbacks)
# Step 4: Save the model
model.save('boyorgirl.h5')
# Step 5: Plot accuracies
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='val')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
The network part is a wheel built by others. I used it directly. Because the training speed is relatively slow, I only trained for three rounds here. The accuracy rate of the training set and the test set are both improving, which is close to 0.88, which means continue training. There is still room for improvement:
web development code
This piece is relatively long, so I won’t explain it. It should be noted that I set the app at the end to run on port 3000 on 0.0.0.0. That is to say, if you install it on the server, you can directly access the server’s ip address and add Port number to access the web page. If it is a local view, set 127.0.0.1 . In addition, you need to prepare a faq.md explanation file
under the same directory , which is placed at the bottom of the webpage for explanation, such as the following picture:
import os
import pandas as pd
import numpy as np
import re
from tensorflow.keras.models import load_model
from pypinyin import lazy_pinyin
import plotly.express as px
import dash
from dash import dash_table
import dash_bootstrap_components as dbc
from dash import dcc
from dash import html
from dash.dependencies import Input, Output, State
pred_model = load_model('boyorgirl.h5')
# Setup the Dash App
external_stylesheets = [dbc.themes.LITERA]
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
# Server
server = app.server
# FAQ section
with open('faq.md', 'r') as file:
faq = file.read()
# App Layout
app.layout = html.Table([
html.Tr([
html.H1(html.Center(html.B('男孩或者女孩?'))),
html.Div(
html.Center("根据名字预测性别"),
style={
'fontSize': 20}),
html.Br(),
html.Div(
dbc.Input(id='names',
value='李泽,李倩',
placeholder='输入多个名字请用逗号或者空格分开',
style={
'width': '700px'})),
html.Br(),
html.Center(children=[
dbc.Button('提交',
id='submit-button',
n_clicks=0,
color='primary',
type='submit'),
dbc.Button('重置',
id='reset-button',
color='secondary',
type='submit',
style={
"margin-left": "50px"})
]),
html.Br(),
dcc.Loading(id='table-loading',
type='default',
children=html.Div(id='predictions',
children=[],
style={
'width': '700px'})),
dcc.Store(id='selected-names'),
html.Br(),
dcc.Loading(id='chart-loading',
type='default',
children=html.Div(id='bar-plot', children=[])),
html.Br(),
html.Div(html.Center(html.B('关于该项目')),
style={
'fontSize': 20}),
dcc.Markdown(faq, style={
'width': '700px'})
])
],
style={
'marginLeft': 'auto',
'marginRight': 'auto'
})
# Callbacks
@app.callback([Output('submit-button', 'n_clicks'),
Output('names', 'value')], Input('reset-button', 'n_clicks'),
State('names', 'value'))
def update(n_clicks, value):
if n_clicks is not None and n_clicks > 0:
return -1, ''
else:
return 0, value
@app.callback(
[Output('predictions', 'children'),
Output('selected-names', 'data')], Input('submit-button', 'n_clicks'),
State('names', 'value'))
def predict(n_clicks, value):
if n_clicks >= 0:
# Split on all non-alphabet characters
# Restrict to first 10 names only
names = re.findall(r"\w+", value)
# Convert to dataframe
pred_df = pd.DataFrame({
'name': names})
list_list = []
# Preprocess
for x in names:
list_pinyin = lazy_pinyin(x)
c = ''.join(list_pinyin)
num_pinyin = [max(0.0, ord(char)-96.0) for char in c]
num_pinyin_pad = num_pinyin + [0.0] * max(0, 20 - len(num_pinyin))
list_list.append(num_pinyin_pad[:15])
# Predictions
result = pred_model.predict(list_list).squeeze(axis=1)
pred_df['男还是女'] = [
'女' if logit > 0.5 else '男' for logit in result
]
pred_df['可能性'] = [
logit if logit > 0.5 else 1.0 - logit for logit in result
]
# Format the output
pred_df['name'] = names
pred_df.rename(columns={
'name': '名字'}, inplace=True)
pred_df['可能性'] = pred_df['可能性'].round(2)
pred_df.drop_duplicates(inplace=True)
return [
dash_table.DataTable(
id='pred-table',
columns=[{
'name': col,
'id': col,
} for col in pred_df.columns],
data=pred_df.to_dict('records'),
filter_action="native",
filter_options={
"case": "insensitive"},
sort_action="native", # give user capability to sort columns
sort_mode="single", # sort across 'multi' or 'single' columns
page_current=0, # page number that user is on
page_size=10, # number of rows visible per page
style_cell={
'fontFamily': 'Open Sans',
'textAlign': 'center',
'padding': '10px',
'backgroundColor': 'rgb(255, 255, 204)',
'height': 'auto',
'font-size': '16px'
},
style_header={
'backgroundColor': 'rgb(128, 128, 128)',
'color': 'white',
'textAlign': 'center'
},
export_format='csv')
], names
else:
return [], ''
@app.callback(Output('bar-plot', 'children'), [
Input('submit-button', 'n_clicks'),
Input('predictions', 'children'),
Input('selected-names', 'data')
])
def bar_plot(n_clicks, data, selected_names):
if n_clicks >= 0:
# Bar Chart
data = pd.DataFrame(data[0]['props']['data'])
fig = px.bar(data,
x="可能性",
y="名字",
color='男还是女',
orientation='h',
color_discrete_map={
'男': 'dodgerblue',
'女': 'lightcoral'
})
fig.update_layout(title={
'text': '预测正确的可能性',
'x': 0.5
},
yaxis={
'categoryorder': 'array',
'categoryarray': selected_names,
'autorange': 'reversed',
},
xaxis={
'range': [0, 1]},
font={
'size': 14},
width=700)
return [dcc.Graph(figure=fig)]
else:
return []
if __name__ == '__main__':
app.run_server(host='0.0.0.0', port='3000', proxy=None, debug=False)