Panda read_csv() turns the data in the first row into column names, how to deal with it

Preface

Sometimes, we will encounter a lot of such data, for example, the first row of this csv is not a column name as we imagined. That way, when we process data, there will be problems, the first one is inconsistent.


Solution 1

Call the csv library and rewrite the program to read the file by yourself.

The csv library is a library that comes with python.

If the data are all character types

Under such conditions, the problem is very simple, just call the iterator of csv.reader() to read it.

If there are numbers in the data in addition to strings

Below I give a solution.

def float_test(data: str):
    try:
        return float(data)
    except Exception:
        return data


def read(filename):
    """
    :param filename:
    :return:
    """
    values = []
    with open(filename) as f:
        r = csv.reader(f)
        for row in r:
            values.append(list(map(float_test, row)))
    *data, label = list(map(list, zip(*values)))
    return list(zip(*data)), label

This involves a previous article I wrote about machine learning algorithm [perceptron algorithm PLA] [read
in 5 minutes] In the above code, I need to read the model for training the perceptron, but I found the data for me There is no column name and I don't want to change the data, so I just have to encapsulate it first. In
this data, except for the last column which may be an element, the rest are floating-point numbers. So, I called the float_test function here to do the test.

The last two lines, and what is going on in the return? In fact, I want to separate the last column and restore the others to a two-dimensional matrix. Each row is a test X.


Solution 2

Setting parameters! !

Refer to the API explanation of read_csv given by pandas :
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

There is a sentence mentioned:

  • header : int or list of ints, default ‘infer’

    • Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
  • names : array-like, default None

    • List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list will cause a UserWarning to be issued.

Regarding the names parameter , when the file does not cover the header , then you need to clearly indicate it in the header parameter! !

This is the correct interpretation, so the correct operation is (**taking a 1.csv**file as an example)

import pandas as pd

df = pd.read_csv('1.csv', header=None, names=['test'])

Then this column without a column name will be set as the test column ~

Guess you like

Origin blog.csdn.net/u013066730/article/details/108573187