Python data analysis and visualization commonly used library - update ing

1. Connect to the database

pymysql

import pymysql
 
#打开数据库连接
host = '111.235.32'
user = 'user'
passwd = '123'
db = 'admin'
db = pymysql.connect(host,user,passwd,db,charset="utf8", port=3306) 

#使用 cursor() 方法创建一个游标对象 cursor
cursor = db.cursor() 

#使用 execute()  方法执行 SQL 查询 
sql = """CREATE TABLE EMPLOYEE (
         FIRST_NAME  CHAR(20) NOT NULL,
         LAST_NAME  CHAR(20),
         AGE INT,  
         SEX CHAR(1),
         INCOME FLOAT )"""
cursor.execute(sql) 


#使用 fetchone() 方法获取单条数据.
#fetchall(): 接收全部的返回结果行.
#rowcount:一个只读属性,并返回执行execute()方法后影响的行数。
data = cursor.fetchone() 
print ("Database version : %s " % data) 

#搭配使用的例子
try:
   # 执行SQL语句
   cursor.execute(sql)
   # 提交到数据库执行(一般是sql语句做了增删改操作时用,不提交无法保存改动)
   db.commit()
   # 获取所有记录列表
   results = cursor.fetchall()
   for row in results:
      fname = row[0]
      lname = row[1]
      age = row[2]
      sex = row[3]
      income = row[4]
       # 打印结果
      print ("fname=%s,lname=%s,age=%s,sex=%s,income=%s" % \
             (fname, lname, age, sex, income ))
except:
   # 发生错误时回滚
   db.rollback()
   print ("Error: unable to fetch data") 
#关闭游标
cursor.close()
# 关闭数据库连接
db.close()

If you just check the data, you can also do this:

#定义函数
def read_data_sql(host,user,passwd,db,sql_query):
    conn = pymysql.connect(host = host,user = user,passwd = passwd,db = db,charset="utf8",port=3306)
    sql_query = sql_query
    df = pd.read_sql(sql_query, con=conn) 
    return df
#数据库和查询语句
host = '111.235.32'
user = 'user'
passwd = '123'
db = 'admin'
sql_query = """
    select id,email from cs where st='1'
"""
result = read_data_sql(host,user,passwd,db,sql_query)

2. Data processing

json & simplejson

Some posts think that simplejson is more efficient in terms of efficiency, but this article does not test it, and only lists commonly used functions.

The difference between dump, dumps, load, and loads
with s and without s: The one with s is for string processing, while the one without s is for file objects.

import json
#json.load()用于从json文件中读取数据。
#将字符串转换成python对象,输出<class 'dict'>
json.load(open('json.json'))
json.loads(str) 
json_list = json.loads(json_str, encoding='utf-8', strict=False)
#从文件中读取json字符串然后转换成python对象
python_object = json.load('json.json') 

#将字典类型转换成字符串类型,输出<class 'str'>
json.dumps(dist) 
#json.dump()用于将dict类型的数据转成str,并写入到json文件中
json.dump(dist,'json.json') 
json.dump(name_emb, open(emb_filename, “w”))

import simplejson
simplejson_list = simplejson.loads(simplejson_str, encoding='utf-8', strict=False)

#dumps是“转储”的意思,是从Python里转出成其他格式,即 dict -> json(str)
#loads是“加载”的意思,是从其他格式转成Python内置格式,即 json(str) -> dict

When using loads to convert the json string, sometimes the format of the json string may not fully conform to the json format, which may cause loads to report an error. At this time, we can set the parameter strict=False, indicating that loads() is not strict Check the json format.

normalize-sklearn.preprocessing

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder().fit_transform(df['register_year'])
df['year_encode']=le

df1=df.drop(columns=['u','r','y'])
scaled_df = StandardScaler().fit_transform(df1)#标准化

features = ['t', 'r', 'u', 'f']    #必须要和原版一一对应!    

df2 = pd.DataFrame(scaled_df,columns=features)

其他标准化:
scale#直接将给定数据进行标准化
StandardScaler#可保存训练集中的均值、方差参数,然后直接用于转换测试集数据
MinMaxScaler#缩放到指定最大和最小值(通常是1-0)之间,可增强方差非常小的属性的稳定性,也可维持稀疏矩阵中为0的条目。 
Normalizer#正则化:计算每个样本p-范数,再对每个元素除以该范数,处理后每个样本的p-范数(l1-norm,l2-norm)等于1。如果后续要使用二次型等方法计算两个样本之间的相似性会有用。 

datetime

Here are only some confusing ones or those that the author has just used. For more details, please refer to the author

#string和datetime的转换
from datetime import datetime #从datetime模块引入datetime类
#把str转换为datetime
datetime.strptime('2017-8-1 18:20:20', '%Y-%m-%d %H:%M:%S')
datetime.strptime('2017-8-1')
#把datetime对象格式化为字符串
now = datetime.now()
print(now.strftime('%a, %b %d %H:%M'))

#时间的加减
from datetime import datetime,timedelta
now = datetime.now()
now + timedelta(hours=10)
now - timedelta(days=1)
now + timedelta(days=2, hours=12)

3. Statistical Analysis

state models

The regression operations in statsmodels are all based on matrices, and under normal circumstances, it is necessary to add a row of constant items to the data set.

import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS,WLS
from statsmodels.tools.tools import rank,add_constant

sm.add_constant()
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
# Fit regression model (using the natural log of one of the regressors)
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
# Inspect the results
print(results.summary())

4. Visualization

seaborn

import seaborn as sns
sns.set()#设置并使用 seaborn 默认的主题、尺寸大小以及调色板
sns.set_style('whitegrid')#背景色:darkgrid、whitegrid、dark、white、ticks                           
sns.despine()     # 只留下X,Y轴

sns.countplot() #以bar的形式展示每个类别的数量

sns.relplot(data=user.average_stars,kind='line')
#kind可选scatter或line

sns.distplot()
'''
矩形为直方图(默认hist=true),曲线为核密度估计(默认kde=true)
bins控制直方图的区间划分
rag:控制是否生成观测数值的小细条
fit:控制拟合的参数分布图形,能够直观地评估它与观察数据的对应关系(黑色线条为确定的分布),如fit = norm表示拟合正态分布
kde_kws={"label":"KDE"}
vertical=True
'''
#热力图,用途:可视化一下已经有的数字
sns.heatmap(data=corr_u17,annot = True, square=True, cmap="Blues",mask= corr_u17 < 0.45)
#robust:排除极端值影响;square:是否是正方形;annot: 是否在格子上显示数字

matplotlib

import matplotlib.pyplot as plt

#参数设置
plt.rc("font",family="SimHei",size="12")  #用于解决中文显示不了的问题
plt.rcParams['figure.figsize'] = (8.0, 4.0) # 设置figure_size尺寸
plt.rcParams['figure.dpi'] = 300 #分辨率
# 默认的像素:[6.0,4.0],分辨率为100,图片尺寸为 600&400
# 指定dpi=200,图片尺寸为 1200*800
# 指定dpi=300,图片尺寸为 1800*1200
# 设置figsize可以在不改变分辨率情况下改变比例
plt.rcParams['image.cmap'] = 'gray' # 设置 颜色 style

#画图函数
plt.plot(use_index=True)
#kind参数: 'line', 'bar', 'barh', 'kde','hexbin','hist','box','area','pie','scatter'
在ipython窗口输入:
 %matplotlib inline,则在终端窗口中输出图片
 %matplotlib qt5,则在图片窗口表现图片

Guess you like

Origin blog.csdn.net/weixin_43545069/article/details/103477689