This article takes you quickly to get started | pandas common data processing operations after reading the table

This blog is actually derived from my own data mining course homework. By completing the homework assigned by the teacher, I feel that I have a deeper understanding of python pandas to read tabular data and operate. Here is a summary.

This article summarizes some operations by pandasreading tables and performing common data processing . For more detailed parameters, you should pay attention to the official parameter documents

Read data

1. Read 10 rows of data

header: Specify the row as the column name, the default is 0, that is, take the value of the first row as the column name, and the data is the data below the column name row; if the data does not contain the column name, set header = None.
sep: Specify the separator. If no parameters are specified, it will try to use commas to separate.
nrows: The number of lines to be read (counted from the beginning of the file)

tabledata = pandas.read_excel("./hotel.xlsx", header=None, sep=',', nrows=10)

print(tabledata)

2. Redefine the column name for the read data

names: A list of column names used for the results. If there is no column header row in the data file, you need to execute header=None

name_columns = [' ','名字','类型', '城市', '地区', '地点', '评分', '评分人数', '价格']

tabledata = pandas.read_excel("./hotel.xlsx", header=0, names=name_columns, sep=',', nrows=10)

print(tabledata)

3. Take out all the data with a specified value in a column

Here we can do a simple traversal operation. The function used to obtain the value is ix. For more detailed instructions, please refer to my other blog, ix | pandas reads the row and column value change operation after reading the table.

name_columns = [' ','名字','类型', '城市', '地区', '地点', '评分', '评分人数', '价格']

tabledata = pandas.read_excel("./hotel.xlsx", header=0, names=name_columns, sep=',')

hotel_name_list = []

for i in range(421):
    if tabledata.ix[i,2] == "商务出行":
        hotel_name_list.append(tabledata.ix[i,1])

print(hotel_name_list)

4. Take out the data with missing values ​​in a certain column

Missing values ​​begin to appear here . Let me mention two parameters related to missing values
na_values: by default, will be '-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A','#N/A', 'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''converted to NaN, and the na_values ​​parameter also supports the definition of other values ​​that should be treated as missing values

Original explanation:

na_values
: scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A','#N/A', 'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''

keep_default_na:Bool type, decide whether to switch to NaN automatically

name_columns = [' ','名字','类型', '城市', '地区', '地点', '评分', '评分人数', '价格']

tabledata = pandas.read_excel("./hotel.xlsx", header=0, names=name_columns, sep=',')

tableline = tabledata[tabledata['类型'].isnull()]

print(tableline)

change the data

5. Only modify the missing value of a certain column

fillnaFunctions are used to replace missing values. Common parameters are as follows:

  • valueThe parameter determines what value to use to fill in the missing value
  • axis: Determine the filling dimension, starting from the row or column
  • limit: Determine the number of filling, int type

Usually the limit parameter and axis can be used to replace the quantity direction control

According to our needs, the easiest way is to take out the column that needs to be modified for modification, and then re-assign the original data.

name_columns = [' ','名字','类型', '城市', '地区', '地点', '评分', '评分人数', '价格']

tabledata = pandas.read_excel("./hotel.xlsx", header=0, names=name_columns, sep=',')

tableline = tabledata['类型'].fillna(value='其他')

tabledata['类型'] = tableline

print(tabledata)

6, modify a column, replace missing values ​​with the average

The idea of ​​this is basically the same as the one above, the difference is that we need to find the average value on a line. The average value is definitely not required for missing values, so we first take out all the data of missing values ​​that do not exist in a certain column, then take out this column of data, and meanget the average value directly through the function

The same function uses:

  • mean() mean
  • median () median
  • max() maximum
  • min() minimum
  • sum() sum
  • std() standard deviation
  • Unique method of Series type : argmax() the position of the maximum value argmin() the position of the minimum value
name_columns = [' ','名字','类型', '城市', '地区', '地点', '评分', '评分人数', '价格']

tabledata = pandas.read_excel("./hotel.xlsx", header=0, names=name_columns, sep=',')

tableline = tabledata[tabledata['评分'].isnull().values==False]

score_avg = tableline['评分'].mean()

tableline = tabledata['评分'].fillna(value=score_avg)

tabledata['评分'] = tableline

print(tabledata)

Guess you like

Origin blog.csdn.net/wy_97/article/details/104787074