Starfruit Python Machine Learning 4-Making and Loading Dataset Files

My CSDN blog column: https://blog.csdn.net/yty_7

Github address: https://github.com/yot777/

 

Data set file

In the first few lessons, we used S = np.array () to manually input the data set.

It is barely feasible in the case of a small amount of data, but once the amount of data increases, it is difficult to manually enter in this way.

We can use the method of Python to read the data set file mentioned in the advanced lecture to load the data set.

The data set we used in the previous section is as follows:

sample  Feature x1  Feature x2  label
1 1 2 1
2 4 5 0
3 2 1 1
4 4 2 1
5 6 1 0
6 3 3 1
7 5 2 0
8 4 5 0
9 2 7 0
10 2 6 1

Generally speaking, the data set file does not need a title, and the serial number of each row is also unnecessary. It only needs to include all the feature columns and label columns .

Between multiple features, the features and columns are generally separated by Tab key or space or comma.

Therefore, we can rewrite the data set into the following test.txt data set file :


1    2    1
4    5    0
2    1    1
4    2    1
6    1    0
3    3    1
5    2    0
4    5    0
2    7    0
2    6    1

 

Data set file loading

We can create a loadDataSet () function to load the data set file .

This function needs to do the following things:

(1) Open the data set file (please note that the Python code and data set file need to be in the same directory, if not, you need to specify the full path of the data set file)

(2) Traverse all lines of the file, i from the first line to the last line

(3) Remove the carriage return and line feed characters at the end of each line of the data set file

(4) Divide each line into several elements according to the separator between each line element of the data set file

(5) Add the elements other than the last element of the i-th row to the i-th row of the feature matrix

(6) Add the last element of the i-th row to the i-th element of the label vector (also the i-th row)

(7) Close the file

code show as below:

def loadDataSet(fileName):
    #创建空特征矩阵
    featureMat = []
    #创建空标签向量
    labelMat = []
    #打开文件
    fr = open(fileName)
    #按行遍历读取文件
    for line in fr.readlines(): 
        #每一行先去掉回车换行符,再以Tab键为元素之间的分隔符号,把每一行分割成若干个元素
        lineArr = line.strip().split('\t')
        print("当前行是:", lineArr)
        #向特征矩阵featureMat添加元素,即lineArr当前行的第0个元素和第1个元素
        #特征矩阵featureMat实际上是二维列表,注意添加元素的方法和一维列表稍有不同
        featureMat.append([lineArr[0], lineArr[1]])
        print("当前的特征矩阵featureMat是:", featureMat)
        #向标签向量labelMat添加元素,即lineArr当前行的最后1个元素
        labelMat.append(lineArr[-1])
        print("当前的标签向量labelMat是:", labelMat)
        #当前行的元素已添加到特征矩阵featureMat和标签向量labelMat,进入下一行继续
    #所有行都读取完毕后关闭文件
    fr.close()
    #整个loadDataSet()函数返回特征矩阵featureMat和标签向量labelMat
    return featureMat, labelMat

if __name__ == '__main__':
    #调用loadDataSet()函数
    X, y = loadDataSet('test.txt')
    print("最终得到的特征矩阵X是:", X)
    print("最终得到的标签向量y是:", y)

Please study the above code carefully. The loadDataSet () function completely implements the above 7 steps.

In order to facilitate everyone's understanding, a lot of print () functions have been added to the code to output the results of the intermediate steps of the program. In actual programming, these print () are completely unnecessary.

The results are as follows:

当前行是: ['1', '2', '1']
当前的特征矩阵featureMat是: [['1', '2']]
当前的标签向量labelMat是: ['1']
当前行是: ['4', '5', '0']
当前的特征矩阵featureMat是: [['1', '2'], ['4', '5']]
当前的标签向量labelMat是: ['1', '0']
当前行是: ['2', '1', '1']
当前的特征矩阵featureMat是: [['1', '2'], ['4', '5'], ['2', '1']]
当前的标签向量labelMat是: ['1', '0', '1']
当前行是: ['4', '2', '1']
当前的特征矩阵featureMat是: [['1', '2'], ['4', '5'], ['2', '1'], ['4', '2']]
当前的标签向量labelMat是: ['1', '0', '1', '1']
当前行是: ['6', '1', '0']
当前的特征矩阵featureMat是: [['1', '2'], ['4', '5'], ['2', '1'], ['4', '2'], ['6', '1']]
当前的标签向量labelMat是: ['1', '0', '1', '1', '0']
当前行是: ['3', '3', '1']
当前的特征矩阵featureMat是: [['1', '2'], ['4', '5'], ['2', '1'], ['4', '2'], ['6', '1'], ['3', '3']]
当前的标签向量labelMat是: ['1', '0', '1', '1', '0', '1']
当前行是: ['5', '2', '0']
当前的特征矩阵featureMat是: [['1', '2'], ['4', '5'], ['2', '1'], ['4', '2'], ['6', '1'], ['3', '3'], ['5', '2']]
当前的标签向量labelMat是: ['1', '0', '1', '1', '0', '1', '0']
当前行是: ['4', '5', '0']
当前的特征矩阵featureMat是: [['1', '2'], ['4', '5'], ['2', '1'], ['4', '2'], ['6', '1'], ['3', '3'], ['5', '2'], ['4', '5']]
当前的标签向量labelMat是: ['1', '0', '1', '1', '0', '1', '0', '0']
当前行是: ['2', '7', '0']
当前的特征矩阵featureMat是: [['1', '2'], ['4', '5'], ['2', '1'], ['4', '2'], ['6', '1'], ['3', '3'], ['5', '2'], ['4', '5'], ['2', '7']]
当前的标签向量labelMat是: ['1', '0', '1', '1', '0', '1', '0', '0', '0']
当前行是: ['2', '6', '1']
当前的特征矩阵featureMat是: [['1', '2'], ['4', '5'], ['2', '1'], ['4', '2'], ['6', '1'], ['3', '3'], ['5', '2'], ['4', '5'], ['2', '7'], ['2', '6']]
当前的标签向量labelMat是: ['1', '0', '1', '1', '0', '1', '0', '0', '0', '1']
最终得到的特征矩阵X是: [['1', '2'], ['4', '5'], ['2', '1'], ['4', '2'], ['6', '1'], ['3', '3'], ['5', '2'], ['4', '5'], ['2', '7'], ['2', '6']]
最终得到的标签向量y是: ['1', '0', '1', '1', '0', '1', '0', '0', '0', '1']

Above, we have completed the production and loading of the data set file.

Then we can use the method in the previous section, it is easy to obtain:

Training set feature X_train , training set label y_train

Test set feature X_test , test set label y_test

code show as below:

def loadDataSet(fileName):
    #创建空特征矩阵
    featureMat = []
    #创建空标签向量
    labelMat = []
    #打开文件
    fr = open(fileName)
    #按行遍历读取文件
    for line in fr.readlines(): 
        #每一行先去掉回车换行符,再以Tab键为元素之间的分隔符号,把每一行分割成若干个元素
        lineArr = line.strip().split('\t')
        #向特征矩阵featureMat添加元素,即lineArr当前行的第0个元素和第1个元素
        #特征矩阵featureMat实际上是二维列表,注意添加元素的方法和一维列表稍有不同
        featureMat.append([lineArr[0], lineArr[1]])
        #向标签向量labelMat添加元素,即lineArr当前行的最后1个元素
        labelMat.append(lineArr[-1])
        #当前行的元素已添加到特征矩阵featureMat和标签向量labelMat,进入下一行继续
    #所有行都读取完毕后关闭文件
    fr.close()
    #整个loadDataSet()函数返回特征矩阵featureMat和标签向量labelMat
    return featureMat, labelMat

if __name__ == '__main__':
    #调用loadDataSet()函数
    X, y = loadDataSet('test.txt')
    #数据集80%为训练集,20%为测试集
    X_train = X[:8]
    print('训练集特征矩阵X_train是:',X_train)
    y_train = y[:8]
    print('训练集标签y_train是:',y_train)
    X_test = X[8:]
    print('测试集特征X_test是:',X_test)
    y_test = y[8:]
    print('测试集标签y_test是:',y_test)

The results are as follows:

训练集特征矩阵X_train是: [['1', '2'], ['4', '5'], ['2', '1'], ['4', '2'], ['6', '1'], ['3', '3'], ['5', '2'], ['4', '5']]
训练集标签y_train是: ['1', '0', '1', '1', '0', '1', '0', '0']
测试集特征X_test是: [['2', '7'], ['2', '6']]
测试集标签y_test是: ['0', '1']

to sum up

The data set file does not need a title, and the serial number of each row is also unnecessary. It only needs to include all feature columns and label columns .

Between multiple features, the features and columns are generally separated by Tab key or space or comma.

You can create a loadDataSet () function to load the data set file . This function needs to do the following things:

(1) Open the data set file (please note that the Python code and data set file need to be in the same directory, if not, you need to specify the full path of the data set file)

(2) Traverse all lines of the file, i from the first line to the last line

(3) Remove the carriage return and line feed characters at the end of each line of the data set file

(4) Divide each line into several elements according to the separator between each line element of the data set file

(5) Add the elements other than the last element of the i-th row to the i-th row of the feature matrix

(6) Add the last element of the i-th row to the i-th element of the label vector (also the i-th row)

(7) Close the file

 

My CSDN blog column: https://blog.csdn.net/yty_7

Github address: https://github.com/yot777/

If you think this chapter is helpful to you, welcome to follow, comment and like! Github welcomes your Follow and Star!

Published 55 original articles · won praise 16 · views 6111

Guess you like

Origin blog.csdn.net/yty_7/article/details/105162669