Python and Data Science Experiments (Exp9)

Experiment Multi-category handwritten digit recognition experiment

1. Experimental data

( 1 ) Training set

The given data has a total of 42,000 grayscale images (resolution 28*28), which are currently given in the train_data.csv file. The image content covers 10 handwritten numbers 0-9. An example image is shown here:

The first 10 lines of the train_data.csv file are shown in the figure

  

       (the first 10 rows of data in the training set file, the label column represents the value of the number, and pixel0 to pixel783 are the gray value of the pixel)

The original data is given in CSV format, each line is a picture, the first column is the digital value, and the rest are the pixel gray value of the picture. Please note that in general for recognition problems, it may be necessary to standardize the gray value range of different pictures for images, such as making the gray value range of each picture the same. However, the data we have given has not yet completed this step.

( 2 ) Test set

    There are another 1000 test data pictures with unknown labels (the resolution is also 28*28), which are saved in the "test_data.csv" file. Each line is the gray value of a given picture, and the value represented by each picture is to be identified by modeling.

2. Purpose of the experiment

(1) Design the feature vector of the sample and have certain feature engineering capabilities. For example, it can be considered to perform dimension reduction and other processing (PCA, etc.) on the gray value of the image pixel;

(2) Utilize the machine learning classification algorithm to train a classifier model for handwritten digit recognition based on the training set;

(3) Then apply the constructed classifier model to the test set, and give the classification results of all samples with unknown labels.

3. Experiment ideas

(1) A visual routine (render.py, please put the program and data file train_data in the appendix

.csv in the same folder for testing), you can learn how to read data from this program.

(2) The specific machine learning algorithm is not limited, and the goal is to achieve the best prediction effect, and the higher the accuracy rate, the better; you can try the integration of multiple learning models.

(3) Perform feature engineering-related preprocessing such as data transformation and dimensionality reduction for the gray value of the given image, and the implementation method is not limited.

(4) For multi-category classifiers, please learn by yourself, such as KNN, GNB, Logistic Regression, decision tree, SVC of svm (from sklearn.svm import SVC), etc.

4. Experimental requirements

(1) Save the prediction results in a text file named "preds.txt", the content is 1000 lines, and each line has only one number from 0-9, which represents the prediction result of your algorithm on the test data. The order of the predicted data must be consistent with the order of the samples in the test set "test_data.csv".

(2) Package the result file "preds.txt" and the code, and submit it to Xuexuetong as an attachment, without submitting the experiment report file.

(3) The evaluation of the experimental results adopts the competition mechanism. Since this experiment is a multi-category problem, we will calculate the accuracy rate of each student's prediction result, and then sort and evaluate the corresponding experimental results from high to low.

Note: Accuracy refers to the ratio of the number of correctly classified test samples to the total number of test samples.

Guess you like

Origin blog.csdn.net/qq_51314244/article/details/130690515