Label extraction and data classification in data preprocessing

The .csv file is processed. The first thing to do is to read the table data in the .csv file, using the read_csv function in pandas.

So the question is, what is the return value of reading the file in this way?

Let’s output it:

It is found that it is DataFrame type data, so what exactly is this data type?

After searching for information, the novice tutorial explained this: DataFrame is a tabular data structure, which contains a set of ordered columns, and each column can be a different value type (numeric value, string, Boolean value). DataFrame has both row and column indexes, and it can be viewed as a dictionary composed of Series (shared with one index).

Pandas data structure – DataFrame | Novice Tutorial (runoob.com)

Then the problem comes, I don’t remember the two data structures of Series and dictionary very well, so I will continue to learn.

Data structure - Series:

Pandas Data Structure – Series | Novice Tutorial (runoob.com)

Data Structure - Dictionary: Curly Brace

Python Dictionary (Dictionary) | Newbie Tutorial (runoob.com)

Next, we need to separate the data in the table according to labels, because some labels have too little data and need to expand the label data.

Since the label data of the data set is in the last column, shape[1] in numpy is used to read the number of columns, and shape[0] reads the number of rows, for two-dimensional data. shape[1]-1 is the column index number, because the index number starts from 0.

It is necessary to extract the tags into an unordered set of non-repeating elements. First use the iloc function to extract all the tag columns, remove the index from values, ravel converts the multi-dimensional array into a one-dimensional array, and then use the set function to create an unordered set of non-repeating elements. .

Separating the data by labels is:

The index of the label data group is the index. Add all the data of the label in the new array, and then resave the new data into a .csv file with the DataFrame data type according to the label number.

おすすめ

転載: blog.csdn.net/qq_46012097/article/details/129280474