Road python learning to read and write files --pandas

The one for the introduction of a simple data structure and substantially increase pandas falsification check, then use of the file read and write pandas.

1. read the file

The main use of pandas to read the file used functions are read_xx (), after reading the data structure dataframe, next to read_xx () one by one to explain.

1.1 excel file

pd.read_excel () can be used to read excel file, parameters are mainly related to:

(1) sheet_name: table excel file

(2) index_col: which column is used as a row index, from zero by default

(4) usecols: reading a table which columns must be indexed position

(5) header: Which row to the column index, the default is the first row, i.e., header = 0

(6) date_parser: function parses dates

(7) parse_dates: at attempt to resolve a date, the default is False. If True, all the attempts to resolve the row. In addition, the need to parse can also specify a set of row number or the column name

(8) names: column index

(9) engine: The default is C, such as when the presence of Chinese file path, engine = "python"

(10) encoding: default utf-8, may also be gbk

(11) skiprows: skip the first few lines to read the file, the default starting from 0

(12) nrows: how many rows of data read

(13) converters: dictionary mapping relationship between the column name with a function of composition

import pandas as pd
df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",sheet_nam = 0,index_col = 0,nrows = 5)
print(df)

性别  年龄  省内省外   消费金额  贷款与否
用户id                            
1       男  60     1  311.0     0
2     NaN  25     1  220.0     1
3       男  47     1  246.0     0
4       女  52     0    NaN     0
5       女  21     0  916.0     0

sheet_name to specify which table to read the file, it can be a table name or the location where the table, the default starting from 0;

index_col column as specified by a row index which may be a specific column name, a column position index may be a default starting from 0;

nrows Needless to say, the number of rows read only, for easy viewing of data structures, line 5 reads only exemplary.

df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",index_col = "用户id",usecols = [0,1,2,4,5],nrows = 5)
print(df)

性别  年龄   消费金额  贷款与否
用户id                      
1       男  60  311.0     0
2     NaN  25  220.0     1
3       男  47  246.0     0
4       女  52    NaN     0
5       女  21  916.0     0

usecols specified read column, a column position index is required, from zero by default;

header specifies which row to the column index, the default is 0, 1, or may also be None;

 #以header = 0,以第一行为列索引
df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",header = 0,nrows = 5)  
print(df)

用户id   性别  年龄  省内省外   消费金额  贷款与否
0     1    男  60     1  311.0     0
1     2  NaN  25     1  220.0     1
2     3    男  47     1  246.0     0
3     4    女  52     0    NaN     0
4     5    女  21     0  916.0     0

#以header = 1,以第二行为列索引
df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",header = 1,nrows = 5)   
print(df)

1    男  60  1.1    311  0
0  2  NaN  25    1  220.0  1
1  3    男  47    1  246.0  0
2  4    女  52    0    NaN  0
3  5    女  21    0  916.0  0
4  6    男  37    0  980.0  1

header = None, not to show the behavior of the index table columns, i.e. no header, from zero by default, such as the original index table has a column, the first row becomes the index of the original data;

df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",header = None,nrows = 5) 
print(df)

0    1   2     3     4     5
0  用户id   性别  年龄  省内省外  消费金额  贷款与否
1     1    男  60     1   311     0
2     2  NaN  25     1   220     1
3     3    男  47     1   246     0
4     4    女  52     0   NaN     0

Setting column name, header = None can be reset with parameter names, header = None or with df.columns () to set;

skiprows, skip reading the first few lines;

df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",header = None,nrows = 5,skiprows = 1,names = ["用户id","性别","年龄","省内外情况","消费情况","贷款情况"]) 
print(df)

用户id   性别  年龄  省内外情况   消费情况  贷款情况
0     1    男  60      1  311.0     0
1     2  NaN  25      1  220.0     1
2     3    男  47      1  246.0     0
3     4    女  52      0    NaN     0
4     5    女  21      0  916.0     0

Because the original table has the header, with the header = None therefore, the original table becomes the first row header data, using skiprows = 1, the reading of the first row is skipped.

df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",header = None,nrows = 5,skiprows = 1) 
df.columns = ["用户id","性别","年龄","省内外情况","消费情况","贷款情况"]
print(df)

用户id   性别  年龄  省内外情况   消费情况  贷款情况
0     1    男  60      1  311.0     0
1     2  NaN  25      1  220.0     1
2     3    男  47      1  246.0     0
3     4    女  52      0    NaN     0
4     5    女  21      0  916.0     0

1.2 csv file

csv file is comma delimited file, and read parameters substantially similar to excel, to excel not read the same place in case that the Chinese have to set the path engine parameter;

Csv gbk file format, if not set encoding parameters, will complain, and the file path with Chinese, need to set the engine parameters, or will an error;

df = pd.read_csv(r"D:\迅雷下载\示例gbk.csv",encoding = "gbk",engine = "python")
print(df)

用户id   性别    年龄  省内省外   消费金额  贷款与否
0      1    男  60.0     1  311.0   0.0
1      2  NaN  25.0     1  220.0   1.0
2      3    男  47.0     1  246.0   0.0
3      4    女  52.0     0    NaN   0.0
4      5    女  21.0     0  916.0   0.0
5      6    男  37.0     0  980.0   1.0
6      7    男  34.0     0  482.0   1.0
7      8    男   NaN     0  267.0   0.0
8      9    女  50.0     1    NaN   0.0
9     10    男  20.0     1  265.0   1.0
10    11    男  51.0     1  612.0   0.0
11    12    男  31.0     0  704.0   0.0
12    13    女   NaN     0  529.0   1.0
13    14    女  18.0     1  528.0   1.0
14    15    女  22.0     0  328.0   NaN
15    16    女  45.0     0  647.0   0.0
16    17  NaN  52.0     0  860.0   0.0
17    18    男  50.0     1  779.0   0.0
18    19    男  59.0     0  750.0   1.0
19    20    男  23.0     0  597.0   0.0

The default encoding is utf-8 format, may also be necessary gbk, gbk format is exemplary;

engine reads the engine specified, the default is c language, if there are Chinese path, should be set to python, python engine more comprehensive;

1.3 txt file

txt files are finger tabs \ t is delimited file, used when reading read_table to read, parameters and excel, csv substantially similar, different places that must be specified sep.

df = pd.read_csv(r"D:\迅雷下载\示例txt.txt",encoding = "gbk",engine = "python",sep = "\t",nrows= 5,index_col = "用户id")
print(df)

性别  年龄  省内省外   消费金额  贷款与否
用户id                            
1       男  60     1  311.0     0
2     NaN  25     1  220.0     1
3       男  47     1  246.0     0
4       女  52     0    NaN     0
5       女  21     0  916.0     0

2. Write file

2.1 write to file

excel, csv, txt file write mode substantially similar to the to_xx pandas () method writes; because three similar, where only excel format example.

(1) index: whether to keep the row index

(2) columns: column by specifying the desired column index

(3) sheet_name: Table Name

(4) encoding: encoding format, utf-8 or gbk

(5) na_rep: filling missing values

(6) inf_rep: Infinite-filling

(7) index_label: row index label

(8) header: default True, False, no column index, To change the column name, header = [ "Column 1", "2 columns", "Column 3"]

df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",sheet_nam = 0,nrows = 5)
df.to_excel(r"C:\Users\wenjianhua\Desktop\示例20190322.xlsx",index = False,columns = ["用户id","性别","消费金额","贷款与否"],
           encoding = "utf-8",sheet_name = "示例",na_rep = "Na",inf_rep = "Na")

to_csv need to remember when setting sep parameters!

Table 2.2 is written with multiple files

When the table needs to be written a plurality of data files simultaneously, the required to ExcelWriter pandas. Proceed as follows:

writer = pd.ExcelWriter(path,engine = "xlsxwriter")
df1.to_excel(writer,sheet_name = "XX")
df2.to_excel(writer,sheet_name = "XX")
df3.to_excel(writer,sheet_name = "XX")
writer.save()
##我们举例说明一下
df1 = pd.read_excel(r"D:\迅雷下载\示例.xlsx",sheet_nam = 0,nrows = 5)
df2 = pd.read_excel(r"D:\迅雷下载\示例.xlsx",sheet_nam = 0,nrows = 5)
df3 = pd.read_excel(r"D:\迅雷下载\示例.xlsx",sheet_nam = 0,nrows = 5)
writer = pd.ExcelWriter(r"D:\迅雷下载\示例1.xlsx",engine = "xlsxwriter")
df1.to_excel(writer,sheet_name = "示例1")
df2.to_excel(writer,sheet_name = "示例2")
df3.to_excel(writer,sheet_name = "示例3")
writer.save()

3. Batch read the file

Many times we will encounter similar data table structure, and even the same file structure. When a deal if one reads, it is too slow, does not meet the python our style, we can use a similar cycle multiple tables simultaneously read data, such as:

import os
import pandas as pd
frame = []
path = r"C:\Users\wenjianhua\Desktop\example"
for file in os.listdir(path):
    filepath = path + "\\" + file
    print(filepath)
    frame.append(pd.read_csv(filepath,usecols = [0,1,2,3]))

df = pd.concat(frame,ignore_index = True)
print(df.head(10))

C:\Users\wenjianhua\Desktop\example\order-14.3.csv
C:\Users\wenjianhua\Desktop\example\order.csv
       商品ID       类别ID  门店编号     单价
0  30006206  915000003  CDNL  25.23
1  30163281  914010000  CDNL   2.00
2  30200518  922000000  CDNL  19.62
3  29989105  922000000  CDNL   2.80
4  30179558  915000100  CDNL  47.41
5  30022232  960000000  CDNL   0.30
6  30179520  915000100  CDNL  77.52
7  30184351  915000106  CDNL  15.57
8  30184351  915000106  CDNL  15.58
9  29989059  922000003  CDNL   1.98
Published 33 original articles · won praise 30 · views 30000 +

Guess you like

Origin blog.csdn.net/d345389812/article/details/88780624