After a period of sorting out, this issue will share 100 practical functions that I think are more commonly used. These functions can be roughly divided into six categories, namely statistical summary functions, data cleaning functions, data filtering, drawing and element-level operation functions, time Sequence functions and others.
1. Statistical summary function
In the process of data analysis, it is necessary to do some statistics and summary of data, so what functions are available for this piece of data operation that can help us? See the following tables for details.
Technology Exchange
Technology must learn to communicate and share, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.
Good articles are inseparable from the sharing and recommendation of fans, dry data, data sharing, data, technical exchange improvement, all of which can be obtained by adding the communication group, the group has more than 2,000 friends, the best way to add notes is: source + interest directions, making it easy to find like-minded friends.
Method ①, add WeChat account: dkl88194, remarks: from CSDN + data analysis
method ②, WeChat search official account: Python learning and data mining, background reply: data analysis
import pandas as pd
import numpy as np
x = pd.Series(np.random.normal(2,3,1000))
y = 3*x + 10 + pd.Series(np.random.normal(1,2,1000))
# 计算x与y的相关系数
print(x.corr(y))
# 计算y的偏度
print(y.skew())
# 计算y的统计描述值
print(x.describe())
z = pd.Series(['A','B','C']).sample(n = 1000, replace = True)
# 重新修改z的行索引
z.index = range(1000)
# 按照z分组,统计y的组内平均值
y.groupby(by = z).aggregate(np.mean)
# 统计z中个元素的频次
print(z.value_counts())
a = pd.Series([1,5,10,15,25,30])
# 计算a中各元素的累计百分比
print(a.cumsum() / a.cumsum()[a.size - 1])
2. Data cleaning function
Similarly, data cleaning is also an essential job. Common data cleaning functions are listed in the following table.
x = pd.Series([10,13,np.nan,17,28,19,33,np.nan,27])
#检验序列中是否存在缺失值
print(x.hasnans)
# 将缺失值填充为平均值
print(x.fillna(value = x.mean()))
# 前向填充缺失值
print(x.ffill())
income = pd.Series(['12500元','8000元','8500元','15000元','9000元'])
# 将收入转换为整型
print(income.str[:-1].astype(int))
gender = pd.Series(['男','女','女','女','男','女'])
# 性别因子化处理
print(gender.factorize())
house = pd.Series(['大宁金茂府 | 3室2厅 | 158.32平米 | 南 | 精装',
'昌里花园 | 2室2厅 | 104.73平米 | 南 | 精装',
'纺大小区 | 3室1厅 | 68.38平米 | 南 | 简装'])
# 取出二手房的面积,并转换为浮点型
house.str.split('|').str[2].str.strip().str[:-2].astype(float)
3. Data screening
In data analysis, if you need to subset the values in variables, you can skillfully use several functions in the table below, some of which can be used on sequences or basically in data frame objects.
np.random.seed(1234)
x = pd.Series(np.random.randint(10,20,10))
# 筛选出16以上的元素
print(x.loc[x > 16])
print(x.compress(x > 16))
# 筛选出13~16之间的元素
print(x[x.between(13,16)])
# 取出最大的三个元素
print(x.nlargest(3))
y = pd.Series(['ID:1 name:张三 age:24 income:13500',
'ID:2 name:李四 age:27 income:25000',
'ID:3 name:王二 age:21 income:8000'])
# 取出年龄,并转换为整数
print(y.str.findall('age:(\d+)').str[0].astype(int))
4. Drawing and element-level functions
np.random.seed(123)
import matplotlib.pyplot as plt
x = pd.Series(np.random.normal(10,3,1000))
# 绘制x直方图
x.hist()
# 显示图形
plt.show()
# 绘制x的箱线图
x.plot(kind='box')
plt.show()
installs = pd.Series(['1280万','6.7亿','2488万','1892万','9877','9877万','1.2亿'])
# 将安装量统一更改为“万”的单位
def transform(x):
if x.find('亿') != -1:
res = float(x[:-1])*10000
elif x.find('万') != -1:
res = float(x[:-1])
else:
res = float(x)/10000
return res
installs.apply(transform)
5. Time series function
6. Other functions
import numpy as np
import pandas as pd
np.random.seed(112)
x = pd.Series(np.random.randint(8,18,6))
print(x)
# 对x中的元素做一阶差分
print(x.diff())
# 对x中的元素做降序处理
print(x.sort_values(ascending = False))
y = pd.Series(np.random.randint(8,16,100))
# 将y中的元素做排重处理,并转换为列表对象
y.unique().tolist()
If you like this article, please forward it and like it.