pandas common command collection

1. Reading files
1. When reading csv and txt formats:
sep is a separator with two characters, and the default is a comma. If the read data is not divided properly, it may be a separator problem, just replace it with the correct separator.
Encoding is the encoding, sometimes other encodings are used. If you want to display correctly without garbled characters, it will reduce a lot of accidents.
Engine is the reading machine, without this field, sometimes the data cannot be read correctly.

path1 = r'C:\Users\MAYAN\Downloads\人工对话详情8.1.csv'
data1 = pd.read_csv(path1,sep=',',encoding='utf-8',engine='python')
data1.head()

2.
When reading excel format: When reading a certain table in a certain workbook, directly add sheet_names equal to the name of the table to be read.

path1 = r'C:\Users\MAYAN\Downloads\人工对话详情8.1.csv'
data1 = pd.read_excel(path1,encoding='utf-8',sheet_names='Sheet1')
data1.head()

3. Combine multiple files in the same folder into one file in
txt format:

import os

# 获取目标文件夹的路径
path = "F:/bj/新建文件夹/内容"
# 获取当前文件夹中的文件名称列表
filenames = os.listdir(path)
result = "result/merge.txt"
# 打开当前目录下的result.txt文件,如果没有则创建
file = open(result, 'w+', encoding="utf-8")
# 向文件中写入字符
 
# 先遍历文件名
for filename in filenames:
    filepath = path + '/'
    filepath = filepath + filename
    # 遍历单个文件,读取行数
    for line in open(filepath, encoding="utf-8"):
        file.writelines(line)
    file.write('\n')
# 关闭文件
file.close()

excel format:

import os
import pandas as pd
 
# 将文件读取出来放一个列表里面
 
pwd = r'F:/bj/新建文件夹/内容' # 获取文件目录路径
 
# 新建列表,存放文件名
file_list = []
 
# 新建列表存放每个文件数据(依次读取多个相同结构的Excel文件并创建DataFrame)
dfs = []
for root,dirs,files in os.walk(pwd): # 第一个为起始路径,第二个为起始路径下的文件夹,第三个是起始路径下的文件。
  for file in files:
    file_path = os.path.join(root, file)
    file_list.append(file_path) # 使用os.path.join(dirpath, name)得到全路径
    df = pd.read_excel(file_path) # 将excel转换成DataFrame
    dfs.append(df)
 
# 将多个DataFrame合并为一个
df = pd.concat(dfs)
 
# 写入excel文件,不包含索引数据
df.to_excel('test\\result.xls', index=False)

2. Remove duplicates
1. Duplicated to determine whether it is duplicated, the repeated display is True, and the non-repetitive display is Fales
2. drop_duplicates delete the duplicate value, keep =first, the first row of the duplicate is retained by default, and all the remaining rows are deleted, if keep= False, delete all the duplicates and leave none. Specific examples are as follows:

from pandas import Series,DataFrame
s = pd.Series([1,1,1,1,2,2,2,3,4,5,5,5,5])
print(s.duplicated())
#duplicated()得到重复值判断的布尔值,再选择布尔值为False的既为非重复值
print(s[s.duplicated()==False])
#或者直接采用drop_duplicates()去除重复值,返回唯一值
print(s.drop_duplicates())

3. Show the unique value of a column unique()
4. Calculate the number of unique values ​​in a column nunique()
Insert picture description here
3. Replace
1. df.replace(to_replace, value, inplace=False)

df.replace(['A','29.54'],['B',100]) #整体替换
df['aa'].replace('花花','草草',inpalce=True) #局部替换,inplace设置为True,否则不生效
df.replace('[A-Z]','变电站',regex=True,inplace=True) #正则表达式
df['aa'].str.repalce('ab','AB')#替换部分字符串

Four, empty value fill
fillna () to replace the empty value

df.fillna("0") #用0替换空值,全局生效

df2=pd.DataFrame(np.random.randint(0,10,(5,5)))
df2.iloc[1:4,3]=NaN;df2.iloc[2:4,4]=NaN
df2.fillna(method='ffill')#用前面的值来填充
df2.fillna(method='bfill',limit=2)#用后面的值来填充,填充倒数两行
df2.fillna(method="ffill",limit=1,axis=1)#用前面的值横向填充一列

Five, multi-condition screening

ask1 = ask[(ask['队列'].isin(duilie))&(ask['排队目标']!='针对会话')]

Sixth, judge whether the unified user enters
shift continuously today , and use it in combination with groupby to move the column up and down to facilitate judgment.
The forward or backward shift is controlled by the value in the shift function, forward is a positive number, and backward is a negative number. The missing value will be filled with NaN, and the parameter in the groupby function controls which field is shifted based on.

# 向下平移
df['value_shift'] = df.groupby('name')['value'].shift(1)
df
	name	value	value_shift
0	a	1	NaN
1	a	2	1.0
2	a	3	2.0
3	b	4	NaN
4	b	5	4.0
5	c	6	NaN

# 向上平移
df['value_shift_1'] = df.groupby('name')['value'].shift(-1)
df
	name	value	value_shift	value_shift_1
0	a	1	NaN	2.0
1	a	2	1.0	3.0
2	a	3	2.0	NaN
3	b	4	NaN	5.0
4	b	5	4.0	NaN
5	c	6	NaN	NaN

Seven, pivot table

pivot_table(data, values=None, index=None, columns=None,aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')

#index --索引,左边固定不动,可设置多层索引
pd.pivot_table(df,index=[u'对手',u'主客场'])
#values  需要计算的值
pd.pivot_table(df,index=[u'主客场',u'胜负'],values=[u'得分',u'助攻',u'篮板'])
#aggfun 对数据进行聚合,默认为aggfun='mean'
#aggfunc=[np.sum,np.avg]
#Columns类似Index可以设置列层次字段,它不是一个必要参数,作为一种分割数据的可选方式。

 #fill_value填充空值,margins=True进行汇总
 pd.pivot_table(df,index=[u'主客场'],columns=[u'对手'],values=[u'得分'],aggfunc=[np.sum],fill_value=0,margins=1)

8. Time conversion and calculation
1. String conversion to date

 pd.to_datetime('2020-01-01')

2. The timestamp is converted to a date

df['gtime']=pd.to_datetime(df['gtime'],unit='s'))

Here are the optional parameters for referencing:
origin: scalar, the default is'unix' to
define the reference date. Since the reference date, the value will be interpreted as a unit number (defined by unit).
If it is "unix" (or POSIX) time; the origin is set to 1970-01-01.
If it is'julian', the unit must be'D' and the origin is set to the beginning of the Julian Calendar. The Julian day number 0 is designated as the day beginning at noon on January 1, 4713 BC.
If the time stamp is convertible, the origin is set to the time stamp identified by the origin.

For details of time conversion, please refer to this article:
https://blog.csdn.net/XiaoMaEr66/article/details/104349986

 df.stack()
 df.set_index(['产品类型','产品名称’]).stack().reset_index()
 效果类似数据透视表,通过index 设置多重索引后,实现透视效果。
 
 df.melt()
pd.melt(df,id_vars=['产品类型','产品名称’],value_vars=['一月销售',‘二月销售’,‘三月销售’],value_name='销售金额’)
id_vars是指普通列的列名,不会被转换,value_vars是指那些需要转换的列名,转换后的列名为value_name

Insert picture description here

Some problems encountered in use will be summarized and updated here one after another. Welcome everyone to discuss together.

Guess you like

Origin blog.csdn.net/XiaoMaEr66/article/details/108934139