A simple decision tree based identification codes

principle

The core idea: similar input will produce similar output.

Principle: First, selecting a first feature from the training sample matrix is ​​divided, so that the value of the feature of each sub-table are all the same (such as the first feature is men and women, can be divided into two sub-tables, the male form and female form) , then select in each sub-table next feature according to the same rules continue to divide smaller sub-tables (such as a second feature of age, I can be divided into three sub-tables (of course vary depending on the circumstances), is less than 18, less than 60 greater than 18, greater than 60, then there are three sub-tables for men and women in the table, the characteristic values ​​of each sub-table are the same), is repeated until all of the features of all used up, at which time they will have sub-leaf stage table in which the feature value of all samples are all the same.

Explanation: A decision tree is a classification method, wherein a sample classification. After completion of the classification and the result obtained is the same class (alternatively referred to as table) is substantially the same as all of the features, and then (classification) obtained by averaging an output (return), or to vote a certain category based on all samples. So, when a new sample needs to be predicted predicted output, I just need to know which class (table) sample belongs.

Engineering Optimization (prune): all features need not be exhausted, allowing the child table leaf mix different characteristic values, thereby reducing the number of layers of the tree, at the expense of acceptable given accuracy, improve the performance of the model. In general, preference may reduce the amount that the maximum entropy as a basis for dividing the feature sub-table. (Popular talk is some characteristic value does not distinguish, for example, the first feature men and women, I do not split into two tables, but on a table, this usually has little effect on the output characteristics of the men and women ), and how to distinguish useful feature of useless features or little effect on the characteristics of it? Distinguished by the information entropy or Gini index. It may reduce the dimension of the feature to operate with PCA and ICA like.

sklearn api

class sklearn.tree.DecisionTreeClassifier()
参数

  • criterion: triage value { "gini", "entropy"}, i.e., entropy and Gini index, default 'gini'
  • splitter: triage value { 'best', 'random'}, default 'best', random to prevent overfitting
  • max_depth: The maximum depth of the tree, is not given if all of the features will be constructed into a tree, or to meet the parameters stop min_samples_split
  • min_samples_split: node into the minimum number of samples, and may be int float, float indicates ceil (min_samples_split * n_samples), i.e., the fractional proportion of the total sample
  • min_samples_leaf: The minimum number of samples for each node, and may float int
  • min_weight_fraction_leaf: float, Default = 0.0, the minimum sum of the weights in the weighted score all leaf nodes (all input samples) the weight. If no sample_weight, the right to equal weight sample
  • max_features: The maximum number of features considered, int, float, or { "Auto", "sqrt", "Iog2"}
    1. If int, a max_features considered wherein each split.
    2. If float, max_features was decimal, and consider the elements at each split. int (* max_features n_features)
    3. If "auto", then max_features = sqrt (n_features).
    4. If the "sqrt", the max_features = sqrt (n_features).
    5. If "log2", was max_features = log2 (n_features).
    6. If is None, then max_features = n_features.
  • random_state: random seed, int or RandomState. In order to prevent over-fitting, do not know the principles of
  • max_leaf_nodes: maximum leaf nodes, depending on the circumstances specific values ​​debugging
  • min_impurity_decrease: gain size restriction information, information gain is less than the set value M. does not occur.
  • min_impurity_split: 0.19 Before use, is replaced by min_impurity_decrease
  • class_weight: sample weights
  • ccp_alpha: can not read
    property
  • classes_: tag array
  • feature_importances_: characteristic importance (based on information entropy and Gini index)
  • max_features_: estimation model uses the maximum number of eigenvalues
  • n_classes_: number of samples
  • n_features_: characteristic number
  • n_outputs_:
  • tree_: tree Object
    methods
  • apply (X [, check_input]): Returns the index of the leaves to be predicted X
  • cost_complexity_pruning_path (X, y [, ...]): did not understand
  • decision_path (X [, check_input]): Returns the path decision tree
  • fit (X, y [, sample_weight, ...]): training
  • get_depth (): Gets the depth model
  • get_n_leaves (): Gets the number of leaves model
  • get_params ([deep]): Get the model parameters
  • predict (X [, check_input]): Prediction
  • predict_log_proba (X): X log probability of forecast
  • predict_proba (X [, check_input]): the predicted probability of X
  • score (X, y [, sample_weight]): Returns the correct proportion of the predicted output y and y
  • set_params (params): model parameter setting

Identification codes

And category codes corresponding to features previously used too obvious, so we choose another interface codes, i.e., 70x25 size, as follows:
Here Insert Picture Description
although the same is very simple, but the character was added.
As pre-processing and digital codes as normal codes -> grayscale -> binarization -> Cutting -> label. However, after tests found that no matter how I adjust parameters, the accuracy rate relatively low. Read all the characters discovered, though not inclined character image distortion but differentiated bold and thin body, and when I did not strictly marked so that the number of samples as bold and fine body. And the position of the character is not in the middle of the picture, the characters are not the same size, some bias on some partial lower, some small, some also too large. Even re-labeling accuracy is difficult to meet the standards I want.

For this dividing line and the edge of the obvious character code, we can extract the characters from the picture after cutting out, that is, remove the outer edge of the blank, and then are adjusted to the same size. This interference is removed and the size of the character position algorithm, as bold and fine body, as long as the same number of training samples of these two it. code show as below:

def img_preprocess(file):
    img1 = Image.open(file)
    pix = np.array(img1)
    pix = (pix > 180) * 255
    width, height = pix.shape
    for i in range(width):
        if np.sum(pix[i]==0):
            xstart = i
            break
    for i in range(width-1, 0, -1):
        if np.sum(pix[i]==0):
            xend = i + 1
            break
    for i in range(height):
        if np.sum(pix[:,i]==0):
            ystart = i
            break
    for i in range(height-1, 0, -1):
        if np.sum(pix[:,i]==0):
            yend = i + 1
            break
    new_pix = pix[xstart:xend, ystart:yend]
    img = Image.fromarray(new_pix).convert('L')
    if new_pix.size != (8, 10):
        img = img.resize((8, 10), resample=Image.NEAREST)
    img.save(file)

Then we re-use decision tree training samples and adjust the parameters, we look at max_depth this parameter, as follows:

from sklearn.tree import DecisionTreeClassifier
import os
from PIL import Image
import numpy as np
import matplotlib.pyplot as mp


def func(k):
    x = []
    y = []
    for label in os.listdir('train'):
        for file in os.listdir(f'train/{label}'):
            im = Image.open(f'train/{label}/{file}')
            pix = np.array(im)
            pix = (pix > 180) * 1
            pix = pix.ravel()
            x.append(list(pix))
            y.append(label)
    train_x = np.array(x)
    train_y = np.array(y)
    model = DecisionTreeClassifier(max_depth=k)
    model.fit(train_x, train_y)
    x = []
    y = []
    for label in os.listdir('test'):
        for file in os.listdir(f'test/{label}'):
            im = Image.open(f'test/{label}/{file}')
            pix = np.array(im)
            pix = (pix > 180) * 1
            pix = pix.ravel()
            x.append(list(pix))
            y.append(label)

    test_x = np.array(x)
    test_y = np.array(y)
    
    score = model.score(test_x, test_y)
    return score
    

if __name__ == "__main__":
    os.chdir('G:\\knn\\字符验证码\\')
    x = list(range(1, 15))
    y = [func(i) for i in x]
    mp.scatter(x, y)
    mp.show()

The result:
Here Insert Picture Description
you can see when max_depth = 8, when accuracy is very close to 1, so we will directly take max_depth 8 on the line. Since recognition accuracy rate is close to 1, the transfer does not adjust other parameters did not seem important, but because this is the identification code verification, there have been cases is not easy to fit, in other cases, if the accuracy rate close to 1 more going random adjustment parameters (random_state and Splitter) and pruning parameters (min_samples_leaf etc.) to prevent over-fitting. I also try to adjust a bit behind other parameters, we found little change in the accuracy of the model, default.

Training test data set: https://www.lanzous.com/i8joo0f

Finally, I am learning some of the machine learning algorithms, for the record I need something I will share to blog and micro-channel public number (python PATHS), welcome attention. Usually, then the general share some content crawlers or Python.
lUE1wd.jpg

Guess you like

Origin www.cnblogs.com/kanadeblisst/p/12170636.html