This blog is actually derived from my own data mining course homework. By completing the homework assigned by the teacher, I feel that I have a deeper understanding of python pandas to read tabular data and operate. Here is a summary.
This article summarizes some operations by pandas
reading tables and performing common data processing . For more detailed parameters, you should pay attention to the official parameter documents
Full-text catalog
Read data
1. Read 10 rows of data
header
: Specify the row as the column name, the default is 0, that is, take the value of the first row as the column name, and the data is the data below the column name row; if the data does not contain the column name, set header = None.
sep
: Specify the separator. If no parameters are specified, it will try to use commas to separate.
nrows
: The number of lines to be read (counted from the beginning of the file)
tabledata = pandas.read_excel("./hotel.xlsx", header=None, sep=',', nrows=10)
print(tabledata)
2. Redefine the column name for the read data
names
: A list of column names used for the results. If there is no column header row in the data file, you need to execute header=None
name_columns = [' ','名字','类型', '城市', '地区', '地点', '评分', '评分人数', '价格']
tabledata = pandas.read_excel("./hotel.xlsx", header=0, names=name_columns, sep=',', nrows=10)
print(tabledata)
3. Take out all the data with a specified value in a column
Here we can do a simple traversal operation. The function used to obtain the value is ix
. For more detailed instructions, please refer to my other blog, ix | pandas reads the row and column value change operation after reading the table.
name_columns = [' ','名字','类型', '城市', '地区', '地点', '评分', '评分人数', '价格']
tabledata = pandas.read_excel("./hotel.xlsx", header=0, names=name_columns, sep=',')
hotel_name_list = []
for i in range(421):
if tabledata.ix[i,2] == "商务出行":
hotel_name_list.append(tabledata.ix[i,1])
print(hotel_name_list)
4. Take out the data with missing values in a certain column
Missing values begin to appear here . Let me mention two parameters related to missing values
na_values
: by default, will be '-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A','#N/A', 'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''
converted to NaN
, and the na_values parameter also supports the definition of other values that should be treated as missing values
Original explanation:
na_values
: scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A','#N/A', 'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', ''
keep_default_na
:Bool type, decide whether to switch to NaN automatically
name_columns = [' ','名字','类型', '城市', '地区', '地点', '评分', '评分人数', '价格']
tabledata = pandas.read_excel("./hotel.xlsx", header=0, names=name_columns, sep=',')
tableline = tabledata[tabledata['类型'].isnull()]
print(tableline)
change the data
5. Only modify the missing value of a certain column
fillna
Functions are used to replace missing values. Common parameters are as follows:
value
The parameter determines what value to use to fill in the missing valueaxis
: Determine the filling dimension, starting from the row or columnlimit
: Determine the number of filling, int type
Usually the limit parameter and axis can be used to replace the quantity direction control
According to our needs, the easiest way is to take out the column that needs to be modified for modification, and then re-assign the original data.
name_columns = [' ','名字','类型', '城市', '地区', '地点', '评分', '评分人数', '价格']
tabledata = pandas.read_excel("./hotel.xlsx", header=0, names=name_columns, sep=',')
tableline = tabledata['类型'].fillna(value='其他')
tabledata['类型'] = tableline
print(tabledata)
6, modify a column, replace missing values with the average
The idea of this is basically the same as the one above, the difference is that we need to find the average value on a line. The average value is definitely not required for missing values, so we first take out all the data of missing values that do not exist in a certain column, then take out this column of data, and mean
get the average value directly through the function
The same function uses:
- mean() mean
- median () median
- max() maximum
- min() minimum
- sum() sum
- std() standard deviation
- Unique method of Series type : argmax() the position of the maximum value argmin() the position of the minimum value
name_columns = [' ','名字','类型', '城市', '地区', '地点', '评分', '评分人数', '价格']
tabledata = pandas.read_excel("./hotel.xlsx", header=0, names=name_columns, sep=',')
tableline = tabledata[tabledata['评分'].isnull().values==False]
score_avg = tableline['评分'].mean()
tableline = tabledata['评分'].fillna(value=score_avg)
tabledata['评分'] = tableline
print(tabledata)