Chapter2_numpy application

(1) numpy generates data

1、生成
np.array
arange
2、reshape  #  修改维度
t1.reshape(n,m)  # shape成n行m列
t1.reshape(n, )  # shape成一维
t1.reshape(n, 1)or(1,n)  # 都是二维
n = t5.shape[0]*t5.shape[1]
t5.flatten()
3、运算
就是矩阵的运算，注意广播原则（broadcast）
broadcast 在矩阵中必须有一个纬度相同

Insert picture description here

(2) numpy reads files and calls data

us_file_p1 = r'C:\Users\dell\Desktop\Python学习\14100_HM数据科学库课件\day03\code\youtube_video_data\GB_video_data_numbers.csv'
us_file_p2 = r'C:\Users\dell\Desktop\Python学习\14100_HM数据科学库课件\day03\code\youtube_video_data\GB_video_data_numbers.csv'

t1 = np.loadtxt(us_file_p1, delimiter=",", dtype='int')
t2 = np.loadtxt(us_file_p2, delimiter=",", dtype='int')
print(t1)
print('*'*100)
####### print(t2)
print(t1[[2, 8, 10]])  # 取不连续的多行，原理：外边[]使之化为一维（行）
print(t1[:, 0:2])  # 取不连续的多列
print(t1[:, 2:])  # 第三列开始的多列
print(t1[2, 3])  # 取第三行第四列的值/是一个数字，类型位numpy.int32/64
# 取第三行到第五行，第二列到第四列的结果
b = t1[2:5, 1:4]
print(b)
# 取多个不相邻的点
c = t1[[0, 2], [0, 1]]  # 取到的是（0，0）和（2，1），对应行列交点
t1[t1 < 10] = 3  # 小于10的部分变为3，直接赋值即可
np.where(t1 <= 3, 100, 300)  # 满足的变为100，否则变为300
# 把某个值修改位NAN，因为NAN是float类型，所以应先把值的int变为float
t1 = t1.astype(float)
t1[3, 3] = np.nan
# 替换NAN值
t4 = {
    
    []}
t1.sum(axis=0)

(3) Data splicing

import numpy as np

# 添加国家数据
us_data = r'C:\Users\dell\Desktop\Python学习\14100_HM数据科学库课件\day03\code\youtube_video_data\GB_video_data_numbers.csv'
uk_data = r'C:\Users\dell\Desktop\Python学习\14100_HM数据科学库课件\day03\code\youtube_video_data\GB_video_data_numbers.csv'


# 添加国家信息
us_data = np.loadtxt(us_data, delimiter=",", dtype='int')
uk_data = np.loadtxt(uk_data, delimiter=",", dtype='int')
# 构造全为0的数据
zeros_data = np.zeros((us_data.shape[0], 1)).astype(int)
ones_data = np.ones((uk_data.shape[0], 1)).astype(int)
# 分别加上一列全为0或1的数据
us_data = np.hstack((us_data, zeros_data))
uk_data = np.hstack((uk_data, ones_data))
# 拼接两个国家的数据
final_data = np.vstack((us_data, uk_data))
print(final_data)

(4) Change nan to average

def fill_ndarry(t1):
    for i in range(t1.shape[1]):
        temp_col = t1[:, i]
        nan_num = np.count_nonzero(temp_col != temp_col)
        if nan_num != 0:  # 若不为0，则说明当前列有nan
            temp_not_nan_col = temp_col[temp_col == temp_col]  # 不为nan的array
            temp_col[np.isnan(temp_col)] = temp_not_nan_col.mean()## 把为nan的位置赋上其余位置的均值
    return t1

(5) File calling and drawing

import numpy as np
from matplotlib import pyplot as plt
from matplotlib import font_manager


plt.figure(figsize=(20, 8), dpi=80)  # 调图表大小
us_file_p1 = r'C:\Users\dell\Desktop\Python学习\14100_HM数据科学库课件\day03\code\youtube_video_data\GB_video_data_numbers.csv'
us_file_p2 = r'C:\Users\dell\Desktop\Python学习\14100_HM数据科学库课件\day03\code\youtube_video_data\GB_video_data_numbers.csv'

t_us = np.loadtxt(us_file_p1, delimiter=",", dtype='int')
t_us_comments = t_us[:, -1]
t_us_comments = t_us_comments[t_us_comments <= 5000]

d = 250
bin_nums = (t_us_comments.max() - t_us_comments.min())//d
plt.hist(t_us_comments, bin_nums, density=True)  # density 表示其为频率直方分布图
plt.grid(True, linestyle="-.", alpha=0.5)
plt.show()

(6)Python commonly used statistical functions

求和：t.sum(axis=None)
均值：t.mean(a,axis=None)  受离群点的影响较大
中值：np.median(t,axis=None) 
最大值：t.max(axis=None) 
最小值：t.min(axis=None)
极值：np.ptp(t,axis=None) 即最大值和最小值只差
标准差：t.std(axis=None)

Chapter2_numpy application

Guess you like