excel the operation of the application pandas

1, pandas basic read and write excel spreadsheet

1.1, pandas read excel spreadsheet

import pandas as pd
# 创建空DataFrame实例,DataFrame就是数据帧
df = pd.DataFrame({'id':[1,2,3],'name':['张三','李四','王五']})
# pandas默认第一个列会自动创建索引,使用set_index()方法指定索引
df = df.set_index('id')
df.to_excel("d:/output.xlsx")
print('Done!')

1.2, pandas write excel spreadsheet

import pandas as pd
# index_col参数表示读取时就指定索引列
people = pd.read_excel("D:\py_workspace\pandas_excel\People.xlsx",index_col='ID')

# header=1表示跳过第一行,从第二行开始;当第一行位脏数据时可以这么设置,如果第一行每一列全部为空,那么pandas自动从第二行开始
# people = pd.read_excel("D:\py_workspace\pandas_excel\People.xlsx",header=1)
# shape表示属性,返回元组,元组第一个表示行,第二表示列

# 如果我们excel表格没有列名,需要把header设置为None,那么默认给列名是0,1,2...形式
# people = pd.read_excel("D:\py_workspace\pandas_excel\People.xlsx",header=None)
# 设置列名
# people.columns= ['ID','Type','Title','FirstName','MiddleName','LastName']
# 设置ID列为索引,inplace=True表示直接修改这个DataFrame
# people.set_index('ID',inplace=True) # 注意区别people = people.set_index('ID') 代码
print(people.shape) #输出结果:(19972, 6)
# columns属性是列,类型<class 'pandas.core.indexes.base.Index'>
print(people.columns) # 输出结果:Index(['ID', 'Type', 'Title', 'FirstName', 'MiddleName', 'LastName'], dtype='object')
print("前五行\n",people.head(5))
print("后三行\n",people.tail(3))
# 由于读取pepole文件时index_col指定了索引列,所以新产生的索引列就是ID,不会自动生成一个索引列
people.to_excel("D:\outpepole.xlsx")

【note】

index_col read_excel attribute specifies the index column can (1) read excel spreadsheet () method, which does not automatically generate the index column; attribute header is used to specify the title of the first method can be provided that there is no header = None column name, it will automatically generate the column names such as 0,1,2 ..... etc.

The number of output lines (2) DataFarme a shape attribute and a tuple consisting of columns, and the columns attribute DataFrame all column names are displayed, DataFarme represents the index property index numbers;


2, pandas operation Excel column / row / cell

We need to pay attention to pandas operation excel spreadsheet is actually DataFrame, and DataFrame is a two-dimensional; where we can excel in understanding each column for the Series sequence objects, and can have DataFrame Series sequence objects.

2.1, column / row / operation unit lattice of the present

[Example 1]

import pandas as pd
# 生成一个空的序列对象
# s1 = pd.Series()
# s1.index表示series的名字,s1.name表示series的索引,series的data属性在以后版本将过期
d = {"x":100,"y":200,"z":300}
s1 = pd.Series(d) # x/y/z是索引,100/200/300是列值
print(s1)
print(s1.index,s1.name) # x/y/z,None

[Example 2]

import pandas as pd
# 列表
L1 = [400,500,600]
L2 = ['a','b','c']
# L1为data,L2为索引
s2 = pd.Series(L1,index=L2)
print(s2)

[Example 3]

import pandas as pd
s_a = pd.Series([1,2,3],index=[1,2,3],name='A')
s_b = pd.Series([10,20,30],index=[1,2,3],name='B')
s_c = pd.Series([100,200,300],index=[1,2,3],name='C')

df = pd.DataFrame({s_a.name:s_a.data,s_b.name:s_b.data,s_c.name:s_c.data})
print(df)

[Example 4]

import pandas as pd
s_a = pd.Series([1,2,3],index=[1,2,3],name='A')
s_b = pd.Series([10,20,30],index=[1,2,3],name='B')
s_c = pd.Series([100,200,300],index=[1,2,3],name='C')

# 这种方法是把每个序列作为一行放入到DataFrame中
df_1 = pd.DataFrame([s_a,s_b,s_c])
print(df_1)

2.2, pandas achieve excel spreadsheet is automatically populated

import pandas as pd
from datetime import date,timedelta
def add_month(d,md):
    """
    计算月份
    :param d: 
    :param md: 
    :return: 
    """
    yd = md//12
    m = d.month +md%12
    if m!=12:
        yd +=m//12
        m = m % 12
    return date(d.year+yd,m,d.day)

# skiprows=3跳过前三个空行;
# usecols="C:F"表示从C列到F列,也可以写成usecols="C,D,E,F"
# dtype={'ID':str} 设置ID列为字符串类型


books = pd.read_excel("./Books.xlsx",skiprows=3,usecols="C:F",index_col=None,dtype={'ID':str,'InStore':str,'Date':str}) # index_col列索引
# 其实books的某一列就是一个Series对象
# type(books['ID'])就是个series序列对象
# books['ID'].at[0]=100
start_date = date(2018,1,13)
for i in books.index: # books.index就是0到19
    # books.at[i,'ID']等效于books['ID'].at[i]
    books['ID'].at[i] = i+1
    books['InStore'].at[i] = 'Yes' if i%2==0 else 'No'
    # 日期天数自动加1,timedelta只能加天、小时、分钟、秒
    # books['Date'].at[i] = start_date + timedelta(days=i)
    # 加年份
    # books['Date'].at[i] = date(start_date.year+i,start_date.month,start_date.day)
    books['Date'].at[i] = add_month(start_date,i)
# excel默认空值是NaN就是float64类型

print(books)
# 设置ID列为索引,不使用自动产生的索引
books = books.set_index("ID") # books.set_index("ID",inplace=True)这个是直接修改
books.to_excel("./output.xlsx")

【note】

(1) When reading excel spreadsheet, if starting from the read lines may be used read_excel () method parameter skiprows to skip ahead how many rows; and usecols parameter indicates the read from a few to a few columns; parameters dtype by each column index to specify the column data type is what is, and is dtype dictionary type;

(2) DataFarme index attribute indicates the data frame number indexed from 0, DataFarme the AT [] is to get the value of a location; DataFarme the LOC [] Method represents a target position; DataFrame data frame set_index ( ) represents the set index.

2.3, to excel in the calculation

import pandas as pd
books = pd.read_excel("./Books.xlsx",index_col="ID") # index_col指定索引列
# 计算Price的值等于ListPrice列和Discount列相乘
# pandas会自动对齐单元格进行计算
# books['Price'] = books['ListPrice'] * books['Discount']
for i in books.index: # books.index表示索引,就是对每一行就行处理
    books['Price'].at[i] = books['ListPrice'].at[i] * books['Discount'].at[i] # 等同于上面的计算

# 给每本书价格涨价2元
def add_price(x):
    return x+2
# print(books['ListPrice'].dtype)
# books['ListPrice'] = books['ListPrice'] +2
# books['ListPrice'] = books['ListPrice'].apply(add_price)
# pandas中的apply()方法是把DataFrame的一列或几列遍历计算
books['ListPrice'] = books['ListPrice'].apply(lambda x:x+2)
print(books)

2.4, to excel in the sorting or sorting multiple

pandas to excel crawl using DataFrame of sort_values ​​() method:

## 参数
sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, by=None)
#### 参数说明
axis:0按照行名排序;1按照列名排序
level:默认None,否则按照给定的level顺序排列---貌似并不是,文档
ascending:默认True升序排列;False降序排列
inplace:默认False,否则排序之后的数据直接替换原来的数据框
kind:默认quicksort,排序的方法
na_position:缺失值默认排在最后{"first","last"}
by:按照那一列数据进行排序,但是by参数貌似不建议使用
import pandas as pd
products  = pd.read_excel("./List.xlsx", index_col="ID")
# sort_values()方法是对DataFrame进行排序,其中的by表示指定排序字段
# sort_values()方法的ascending参数默认是从小到大爬行,默认值是True
products.sort_values(by='Price',inplace=True,ascending=False) # inplace=True表示直接修改当前DataFrame不会新增一个
print(products)

print("="*30,"多重排序","="*30)
products_2  = pd.read_excel("./List.xlsx", index_col="ID")
# 多重排序时sort_values()方法的by属性是一个列表,不是一个字符串
# 此时的ascending时一个列表,其实这个列表跟by属性的列表是一一对应的
products_2.sort_values(by=['Worthy','Price'],inplace=True,ascending=[True,False])
print(products_2)

[Note] sort_values ​​when a single sort () in by a string assigned to that column on the line, if the need for multiple ordering the sort_values ​​() method by parameters for the type of list you need to specify multiple columns, and ascending is a list of types and by the correspondence between the columns.

2.5, excel in data filtering and the brush is selected from

DataFrame in loc method using a column index column names and index names positioning

import pandas as pd
students = pd.read_excel("./Students.xlsx",index_col="ID")
def age18_to_30(age):
    # return age>=18 and age<30
    return 18<=age<30 # 这种写法只限python语言

def level_score(score):
    return 85<= score <=100
# loc需要指定列表形式数据,指定某一列;loc会新生成一个DataFrame
# 3apply方法是Serises序列对象的方法
# students = students.loc[students['Age'].apply(age18_to_30)].loc[students['Score'].apply(level_score)]
# 上面的第二种写法,这是pandas特有写法
# students = students.loc[students.Age.apply(age18_to_30)].loc[students.Score.apply(level_score)]
# 第三种lambda表达式写法
students = students.loc[students['Age'].apply( lambda x: 18 <= x < 30)]\
    .loc[students['Score'].apply(lambda y: 85 <= y <= 100)]

print(students)

 

Guess you like

Origin blog.csdn.net/u013089490/article/details/91422685