The one for the introduction of a simple data structure and substantially increase pandas falsification check, then use of the file read and write pandas.
1. read the file
The main use of pandas to read the file used functions are read_xx (), after reading the data structure dataframe, next to read_xx () one by one to explain.
1.1 excel file
pd.read_excel () can be used to read excel file, parameters are mainly related to:
(1) sheet_name: table excel file
(2) index_col: which column is used as a row index, from zero by default
(4) usecols: reading a table which columns must be indexed position
(5) header: Which row to the column index, the default is the first row, i.e., header = 0
(6) date_parser: function parses dates
(7) parse_dates: at attempt to resolve a date, the default is False. If True, all the attempts to resolve the row. In addition, the need to parse can also specify a set of row number or the column name
(8) names: column index
(9) engine: The default is C, such as when the presence of Chinese file path, engine = "python"
(10) encoding: default utf-8, may also be gbk
(11) skiprows: skip the first few lines to read the file, the default starting from 0
(12) nrows: how many rows of data read
(13) converters: dictionary mapping relationship between the column name with a function of composition
import pandas as pd
df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",sheet_nam = 0,index_col = 0,nrows = 5)
print(df)
性别 年龄 省内省外 消费金额 贷款与否
用户id
1 男 60 1 311.0 0
2 NaN 25 1 220.0 1
3 男 47 1 246.0 0
4 女 52 0 NaN 0
5 女 21 0 916.0 0
sheet_name to specify which table to read the file, it can be a table name or the location where the table, the default starting from 0;
index_col column as specified by a row index which may be a specific column name, a column position index may be a default starting from 0;
nrows Needless to say, the number of rows read only, for easy viewing of data structures, line 5 reads only exemplary.
df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",index_col = "用户id",usecols = [0,1,2,4,5],nrows = 5)
print(df)
性别 年龄 消费金额 贷款与否
用户id
1 男 60 311.0 0
2 NaN 25 220.0 1
3 男 47 246.0 0
4 女 52 NaN 0
5 女 21 916.0 0
usecols specified read column, a column position index is required, from zero by default;
header specifies which row to the column index, the default is 0, 1, or may also be None;
#以header = 0,以第一行为列索引
df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",header = 0,nrows = 5)
print(df)
用户id 性别 年龄 省内省外 消费金额 贷款与否
0 1 男 60 1 311.0 0
1 2 NaN 25 1 220.0 1
2 3 男 47 1 246.0 0
3 4 女 52 0 NaN 0
4 5 女 21 0 916.0 0
#以header = 1,以第二行为列索引
df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",header = 1,nrows = 5)
print(df)
1 男 60 1.1 311 0
0 2 NaN 25 1 220.0 1
1 3 男 47 1 246.0 0
2 4 女 52 0 NaN 0
3 5 女 21 0 916.0 0
4 6 男 37 0 980.0 1
header = None, not to show the behavior of the index table columns, i.e. no header, from zero by default, such as the original index table has a column, the first row becomes the index of the original data;
df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",header = None,nrows = 5)
print(df)
0 1 2 3 4 5
0 用户id 性别 年龄 省内省外 消费金额 贷款与否
1 1 男 60 1 311 0
2 2 NaN 25 1 220 1
3 3 男 47 1 246 0
4 4 女 52 0 NaN 0
Setting column name, header = None can be reset with parameter names, header = None or with df.columns () to set;
skiprows, skip reading the first few lines;
df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",header = None,nrows = 5,skiprows = 1,names = ["用户id","性别","年龄","省内外情况","消费情况","贷款情况"])
print(df)
用户id 性别 年龄 省内外情况 消费情况 贷款情况
0 1 男 60 1 311.0 0
1 2 NaN 25 1 220.0 1
2 3 男 47 1 246.0 0
3 4 女 52 0 NaN 0
4 5 女 21 0 916.0 0
Because the original table has the header, with the header = None therefore, the original table becomes the first row header data, using skiprows = 1, the reading of the first row is skipped.
df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",header = None,nrows = 5,skiprows = 1)
df.columns = ["用户id","性别","年龄","省内外情况","消费情况","贷款情况"]
print(df)
用户id 性别 年龄 省内外情况 消费情况 贷款情况
0 1 男 60 1 311.0 0
1 2 NaN 25 1 220.0 1
2 3 男 47 1 246.0 0
3 4 女 52 0 NaN 0
4 5 女 21 0 916.0 0
1.2 csv file
csv file is comma delimited file, and read parameters substantially similar to excel, to excel not read the same place in case that the Chinese have to set the path engine parameter;
Csv gbk file format, if not set encoding parameters, will complain, and the file path with Chinese, need to set the engine parameters, or will an error;
df = pd.read_csv(r"D:\迅雷下载\示例gbk.csv",encoding = "gbk",engine = "python")
print(df)
用户id 性别 年龄 省内省外 消费金额 贷款与否
0 1 男 60.0 1 311.0 0.0
1 2 NaN 25.0 1 220.0 1.0
2 3 男 47.0 1 246.0 0.0
3 4 女 52.0 0 NaN 0.0
4 5 女 21.0 0 916.0 0.0
5 6 男 37.0 0 980.0 1.0
6 7 男 34.0 0 482.0 1.0
7 8 男 NaN 0 267.0 0.0
8 9 女 50.0 1 NaN 0.0
9 10 男 20.0 1 265.0 1.0
10 11 男 51.0 1 612.0 0.0
11 12 男 31.0 0 704.0 0.0
12 13 女 NaN 0 529.0 1.0
13 14 女 18.0 1 528.0 1.0
14 15 女 22.0 0 328.0 NaN
15 16 女 45.0 0 647.0 0.0
16 17 NaN 52.0 0 860.0 0.0
17 18 男 50.0 1 779.0 0.0
18 19 男 59.0 0 750.0 1.0
19 20 男 23.0 0 597.0 0.0
The default encoding is utf-8 format, may also be necessary gbk, gbk format is exemplary;
engine reads the engine specified, the default is c language, if there are Chinese path, should be set to python, python engine more comprehensive;
1.3 txt file
txt files are finger tabs \ t is delimited file, used when reading read_table to read, parameters and excel, csv substantially similar, different places that must be specified sep.
df = pd.read_csv(r"D:\迅雷下载\示例txt.txt",encoding = "gbk",engine = "python",sep = "\t",nrows= 5,index_col = "用户id")
print(df)
性别 年龄 省内省外 消费金额 贷款与否
用户id
1 男 60 1 311.0 0
2 NaN 25 1 220.0 1
3 男 47 1 246.0 0
4 女 52 0 NaN 0
5 女 21 0 916.0 0
2. Write file
2.1 write to file
excel, csv, txt file write mode substantially similar to the to_xx pandas () method writes; because three similar, where only excel format example.
(1) index: whether to keep the row index
(2) columns: column by specifying the desired column index
(3) sheet_name: Table Name
(4) encoding: encoding format, utf-8 or gbk
(5) na_rep: filling missing values
(6) inf_rep: Infinite-filling
(7) index_label: row index label
(8) header: default True, False, no column index, To change the column name, header = [ "Column 1", "2 columns", "Column 3"]
df = pd.read_excel(r"D:\迅雷下载\示例.xlsx",sheet_nam = 0,nrows = 5)
df.to_excel(r"C:\Users\wenjianhua\Desktop\示例20190322.xlsx",index = False,columns = ["用户id","性别","消费金额","贷款与否"],
encoding = "utf-8",sheet_name = "示例",na_rep = "Na",inf_rep = "Na")
to_csv need to remember when setting sep parameters!
Table 2.2 is written with multiple files
When the table needs to be written a plurality of data files simultaneously, the required to ExcelWriter pandas. Proceed as follows:
writer = pd.ExcelWriter(path,engine = "xlsxwriter")
df1.to_excel(writer,sheet_name = "XX")
df2.to_excel(writer,sheet_name = "XX")
df3.to_excel(writer,sheet_name = "XX")
writer.save()
##我们举例说明一下
df1 = pd.read_excel(r"D:\迅雷下载\示例.xlsx",sheet_nam = 0,nrows = 5)
df2 = pd.read_excel(r"D:\迅雷下载\示例.xlsx",sheet_nam = 0,nrows = 5)
df3 = pd.read_excel(r"D:\迅雷下载\示例.xlsx",sheet_nam = 0,nrows = 5)
writer = pd.ExcelWriter(r"D:\迅雷下载\示例1.xlsx",engine = "xlsxwriter")
df1.to_excel(writer,sheet_name = "示例1")
df2.to_excel(writer,sheet_name = "示例2")
df3.to_excel(writer,sheet_name = "示例3")
writer.save()
3. Batch read the file
Many times we will encounter similar data table structure, and even the same file structure. When a deal if one reads, it is too slow, does not meet the python our style, we can use a similar cycle multiple tables simultaneously read data, such as:
import os
import pandas as pd
frame = []
path = r"C:\Users\wenjianhua\Desktop\example"
for file in os.listdir(path):
filepath = path + "\\" + file
print(filepath)
frame.append(pd.read_csv(filepath,usecols = [0,1,2,3]))
df = pd.concat(frame,ignore_index = True)
print(df.head(10))
C:\Users\wenjianhua\Desktop\example\order-14.3.csv
C:\Users\wenjianhua\Desktop\example\order.csv
商品ID 类别ID 门店编号 单价
0 30006206 915000003 CDNL 25.23
1 30163281 914010000 CDNL 2.00
2 30200518 922000000 CDNL 19.62
3 29989105 922000000 CDNL 2.80
4 30179558 915000100 CDNL 47.41
5 30022232 960000000 CDNL 0.30
6 30179520 915000100 CDNL 77.52
7 30184351 915000106 CDNL 15.57
8 30184351 915000106 CDNL 15.58
9 29989059 922000003 CDNL 1.98