Simple machine learning engineering process

1. Confirm requirements (build problems)

What do we need to do?

For example, based on some input data, predict a certain value?

For example, input some features to determine what kind of animal it is?

 Here we can try to analyze, what problem are we going to deal with?

Classification problem? regression problem?

What solutions are currently in place to deal with this problem? Like logistic regression? SVMs? Neural Networks? random forest?

Confirm Characteristics (Get Data)

To confirm which features we need, and how should the data of these features be obtained?

The most important thing is, when we want to make model predictions or actual combat, what can we get?

Such as database access? Read from a file (txt, excel, etc.)? And do simple processing on the data, such as removing the default value, etc.

3. Feature processing

Feature encoding (why encoding? Because many features are strings, we have to convert them to numbers or binary to calculate)

More commonly used:

onehot encoding

# pandas进行onehot编码
import pandas as pd
df = pd.DataFrame([
    ["green","M",20,"class1"],
    ["red","L",21,"class2"],
    ["blue","XL",30,"class3"],
])
df.columns = ["color","size","weight","class label"]
df2 = pd.get_dummies(df["class label"])


# sklearn工具类进行onehot编码
from sklearn.feature_extraction import DictVectorizer
alist = [
    {"city":"beijing","temp":33},
    {"city":"GZ","temp":42},
    {"city":"SH","temp":40},
]
d = DictVectorizer(sparse=False)
feature = d.fit_transform(alist)
print(d.get_feature_names())
print(feature)

Label Encoding

But only one column can be processed at a time, it needs to be processed by for

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df[‘Sex’]=le.fit_transform(df[‘Sex’])

Note: It should be noted that the input and output of your encoding process model are also encoded. The above two encodings are encoded based on the category of the column value, so you need to save the encoded category every time you train, and use the same category data for encoding when predicting the input data:

We can directly save the mapping relationship between old_dataand encoder_dataand, either in a dictionary or in the csv format below.

for col in beat_sparse_cols:                   # sparse_feature encoder
    lbe = LabelEncoder()
    # 直接在原来的表上进行修改
    beat_data[col] = lbe.fit_transform(beat_data[col])
    # # method 2: save dict(selected), 为每个lbe保存一个对应的字典
    name = "encoding_" + str(col) + "_dict"
    locals()[name] = {}
    for i in list(lbe.classes_):
        # encoding[i] = lbe.transform([i])[0]
        locals()[name][i] = lbe.transform([i])[0]
    # save the lbe dict, note the index
    df = pd.DataFrame(locals()[name], index = [0])
    # df = pd.DataFrame(list(my_dict.items()), columns=['key', 'value'])   # 否则默认保存的key是str
    df.to_csv(save_dir + "/" + str(col) + "lbe_dict.csv", index = False)

When predicting new data, load it, find the category, and encode the new input. If there is no category, special treatment is required, such as:

# train and test are pandas.DataFrame's and c is whatever column
le = LabelEncoder()
le.fit(train[c])
test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s)
le.classes_ = np.append(le.classes_, '<unknown>')
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])	

 Normalization (used when all data weights are the same)

# 归一化
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler(feature_range=(0,1))
data = [
    [90,2,10,40],
    [60,5,15,45],
    [73,3,13,45]
]
data = mm.fit_transform(data)

Standardization (used when the data has huge outliers)

from sklearn.preprocessing import StandardScaler
ss=StandardScaler()
data = [
    [90,2,10,40],
    [60,5,15,45],
    [73,3,13,45]
]
data =ss.fit_transform(data)
print(data)

Variance filtering and PCA

# Filter过滤式(方差过滤)
from sklearn.feature_selection import VarianceThreshold
v = VarianceThreshold(threshold=2)
a=v.fit_transform([[0,2,4,3],[0,3,7,3],[0,9,6,3]])


# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
a = pca.fit_transform([[0,2,4,3],[0,3,7,3],[0,9,6,3]])

Including PCA and standardization is the same as encoding. How to proceed when considering inputting a single data?

How to denormalize etc.

4. Select the algorithm and train the model

There is no more to say about the selection algorithm.

It is necessary to do cross-validation of parameters, etc., so that it is convenient to see which operator of which algorithm performs best.

 model_selection.cross_val_score

[sklearn] Cross-validation in sklearn_sklearn cross-validation_L Whale and Sea Blog-CSDN Blog

After training, save the model:

[Sklearn] 3 file formats and calling methods for saving models_sklearn export model_Artificial Intelligence Blog-CSDN Blog

5. Engineering (application)

Choose a framework django activity flask for webization

[python] Django_Artificial Intelligence Blog-CSDN Blog

6. Deploy and go online

Django is relatively slow in a single thread, it can be deployed on a web container, such as gunicorn

[Django] How to use gunicorn to deploy django programs - Programmer Sought

Guess you like

Origin blog.csdn.net/qq_35789269/article/details/131775476