X x wind courses from personal notes to organize your class talking about bad teachers seem to figure out the code did not, say some fur, mainly to learn some new ideas and expression
First, the basics of supplementary
works 1.try is this: When after the start of a try statement, python will be marked in the context of the current program, so that when an exception occurs can be back here, try clause is executed first, followed by What will happen depends on whether or not an exception occurs during execution.
If reading a file, hope that in the case of whether or not an abnormal occurrence are closing the file, how to do it? This can be accomplished using the finally block. Note that, in a try block, except clause may be used simultaneously and finally block. If you want to use them, then a need to embed another.
The main module is glob 2.glob method, which returns a list of all files matching path (List); the method used to develop requires a path string matching parameters (absolute path string may be a relative path may be),
it returns the file name includes only the current directory in the file name, not including subfolders inside.
https://blog.csdn.net/qq_40196164/article/details/83067846
3.sys os and set system parameters https://blog.csdn.net/zengxiantao1994/article/details/58188527 '' '
4.Python lower () method converts the string in all uppercase characters to lowercase. Syntax lower () method syntax: str.lower () Parameters None
The collection (set) of another form of a data structure of the python. Action: automatically remove duplicate element set type data (set), and sort the elements. Type of ordered set of elements is not repeated disorder.
6.python zip
https://www.cnblogs.com/wdz1226/p/10181354.html
7.glob.glob
main method is glob glob module, the method returns a list of all matching file path (list)
This method requires a parameter is used to specify the path string match (absolute path string may be a relative path may be ), which returns the name of the file contains only the current directory in the file name, not including subfolders inside.
https://blog.csdn.net/qq_17753903/article/details/82180227
8.map function
https://www.runoob.com/python/python-func-map.html
9.Numpy knowledge points to add: np.vstack () & np.hstack ()
https://www.jianshu.com/p/2469e0e2a1cf
10. The popular and appreciated OvO OVR
https://blog.csdn.net/alw_123/article/details/98869193
11.isinstance
https://www.runoob.com/python/python-func-isinstance.html
12.SVM is an acronym for Support Vector Machines (SVM), and can be used for classification and regression. SVC is a Type SVM, is used to do the classification, SVR is another Type SVM, it is used to do regression.
https://blog.csdn.net/u012331016/article/details/45223135
13.Python Dictionary (Dictionary) get () method
https://www.runoob.com/python/att-dictionary-get.html
14. In doing model training time, especially to do cross-validation on the training set, typically you want to save the model down, then put on an independent test set to test, here is the preservation and re-use in Python training model .
scikit-learn model has been persistent operation, can be introduced joblib
https://blog.csdn.net/helloxiaozhe/article/details/80658438
Second, the code part
'''
代码热身
没有的话需要安装
pip install pydub
pip install python_speech_features
'''
from pydub.audio_segment import AudioSegment #切割
from scipy.io import wavfile #mp3是压缩过后的音乐 失掉很多特征
from python_speech_features.base import mfcc
import pandas as pd
import numpy as np
song=AudioSegment.from_file("./cccc/abc.MP3",format="mp3")
song.export("./cccc/abc.wav",format="wav")
rate,data=wavefile.read("./cccc/abc.wav")
#print(data)
#print(rate)
#mfcc包含了傅里叶变换 和 梅尔倒谱系数
mf_feat=mfcc(data,rate,numcep=13,nfft=2048)
#13是维度 nfft是傅里叶转化时候的速率
#print(mf_feat)#打印出来是一个13维的向量
mm=np.mean(mf_feat,axis=0) #降维处理1*13
mf=np.transpose(mm)
mc.cov(mf)#我不需要行和行,要的是列之间的 关系 转置之后看
result=mm
for k in range(len(mm)):
result=np.append(result,np.diag(mc,k))
print(result)
features
#feature 老师用的python3 因为起名可以用中文...
import pandas as pd
import numpy as np
import glob
from pydub.audio_segment import AudioSegment
from scipy.io import wavfile
from python_speech_features.base import mfcc
import os
import sys
import time
#def 获取歌单(): 我看着难受...
def getMusicMenu():
data=pd.read_csv(./ccc)
data=data[["name","tag"]] #标签使csv文件中已经打好标签了清新摇滚。。
return data
def getMusicFeatures(file):
items=file.split(",")
file.format=items[-1].lower()
file_name=file[:-len(file.format)+1)]
if file_format!="wav":
#把mp3格式的文件转化为wav,保存至原文件夹
song=AudioSegment.from_file(file,format="mp3")
file=file_name+".wav" #我怎么记的这个不合适
song.export(file,format="wav")
#提取wav格式歌曲特征
try:
rate,data=wavefile.read(file)
mfcc_feas=mfcc(data,rate,numcep=13,nfft=2048)#卷积的算法降维
mm=np.transpose(mffc_feas)
mc=np.cov(mm)
result=mc
for i in range(mm.shape[0]):
result=np.append(result,np.diag(mc,i))
return result
except Exception as msg: #为了报错不影响往下进行
print(msg)
def reatureExtraction():
df= getMusicMenu()
name_label_list=np.array(df).tolist()
name_label_dict=dict(map(lambda t:(t[0],t[1]),name_label_list))
labels=set(ame_label_dict.values())#不要忘了.values()
label_index_dict=dict(zip(labels,np.arrange(len(labels))))
all_music_files=glob.glob(歌曲路径)
all_music_files.sort()
loop_count=0
flag=True
all _mfcc=np.array([])
for file_name in all_music_files:
print("开始处理"+file_name.replace("\xa0",""))
#xa0 https://blog.csdn.net/clovejava/article/details/89511172
#因为是文件夹下 music\下面的
music_name=file_name.split("\\")[-1].split(".")[-2].split("-")[-1]
music_name=music_name.strip()
if music_name in name_label_dict:
label_index=label_index_dict[name_label_dict[music_name]]
ff=getMusicFeatures(file_name)
ff=np.append(ff,label_index)
if flag:
all_mfcc=ff
flag=Flase
else:
print("无法处理"+file_name.replace("\xa0","")+",找不到对应的lable")
print(looping----%d" % loop_count)
print(all_mfcc.shape:",end="")
print(all_mfcc.shape)
loop_count+=1
label_index_list=[]
for k in label_index_dict:
label_index_list.append([k,label_index_dict[k])
pd.DataFrame(label_index_list).to_csv(数值化标签路径,header=None,\
index=False,encoding="utf-8")
pd.DataFrame(all_mfcc).to_csv(歌曲特征文件存放路径,header=None,\
index=False,encoding="utf-8")
return all_mfcc
if __name__="main":
歌曲路径="./data/music_info.csv"
歌曲源路径"./data/music/*.mp3"
数值化标签路径="./data/music_index_label.csv"
歌曲特征文件存放路径="./data/music_features.csv"
start=time.time()
reatureExtraction()
end=time.time()
print("总耗时%.2f秒%(end-start))
```bash
#acc 是老师自己写的 预测值和真实值的差异
def get(res,tes):
n=len(res)
truth=(res==tes)
pre=0
for flag in truth:
if flag:
pre+=1
return (pre*100)/n
from sklearn import svm
from sklearn.utils import shuffle #打乱,再训练,洗牌
from sklearn.model_selection import GridSearchCV,train_test_split
#网格交叉验证是调参的,交叉验证是评估模型
from sklearn.externals import joblib
import pandas as pd
import numpy as np
import acc #自己写的
import sys
import time
#选取最优的核函数 rbf高斯核函数 linear 是独立的 poly有互相交叉相乘 半正定
def internal_cross_validation(X,Y):
parameters={
“kernel":("linear","rbf","poly"),
"c":[0.1,1] #松弛因子,泛化能力
"probability":[True,False],
“decision_function_shape":["ovo","ovr"]
}
clf=GridSearchCV(svm.SVC(random_state=0),param_grid=parameters,cv=5)
print(begining...)
clf.fit(X,Y)
print("best parameter= ",end="")
print(clf.best_params_)
print("best accuracy= ",end="")
print(clf.best_score_)
def cross_validation(music_csv_file_path=None,data_percentage=0.7):
if not music_csv_file_path:
music_csv_file_path= 歌曲特征文件存放路径
print("begining read data"+music_csv_file_path)
data=pd.read_csv(music_csv_file_path,sep=",",header=None,ending="utf-8")
sample_fact=0.7 #感觉他这个写错了
if isinstance(data_percetage,float) and 0<data_percentage<1:
sample_fact=data_percentage
data=data.sample(frac=sample_fact).T
X=data[:-1].T #忘了他不包含右边界了
Y=np.array(data[-1:])[0]
#print(X)
#print(Y)
internal_cross_validation(X,Y)
def poly_model(X,Y):
#进行魔性训练,并且计算训练集上预测值与label的准确性
clf=svm.SVC(kernel="poly",C=0.1,probability=True,decision_function_shape="ovo",random_state=0)
clf.fit(X,Y)
res=clf.predict(X)
restrain=acc.get(res,Y)
return clf,restrain
def trainModels(train_percentage=0.7,fold=1,music_csv_file_path=None,model_out_f=None)
if not music_csv_file_path:
music_csv_file_path=歌曲特征文件存放路径
data=pd.read_csv(music_csv_file_path,sep=",",header=None, encoding="utf-8")
max_train_score=None
max_test_score=None
max_source=None
best_clf=None
flag=True
for index in range(1,int(fold)+1):
print(index)
shuffle_data=shuffle(data)
X=shuffle_data.T[:1].T
Y=np.array(shuffle_data.T[-1])[0]
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,train_size=0.3, random_state=0)
(clf,train_source)=poly_model(X_train,Y_train)
y_predict=clf.predict(x_test)
test_source=acc.get(y_predict,y_test)#测试集的准确率
source=0.35*train_source+0.65*test_source #模型综合准确率
if flag:
max_source=source
max_train_source=train_source
max_test_source=test_source
best_clf=clf
flag=False
else:
if max_source<source:
max_source=source
max_train_source=train_source
max_test_source=test_source
best_clf=clf
print("第%d次训练,训练集上的正确率为:%.2f,测试集上的正确率为:%.2f,
加权平均正确率为:%.2f"%(train_source,test_source,source)
print("最优训练模型,训练集上的正确率为:%.2f,测试集上的正确率为:%.2f,
加权平均正确率为:%.2f"%(max_train_source,max_test_source,max_source)
print("最优模型是:")
print(best_clf)
if not model_out_f:
model_out_f=模型保存路径
joblib.dump(best_clf,model_out_f)
if __name__="__main__":
print("="*30+" begining searching the most suitable model..." "+"+"*30
#打印30个等号
start=time.time()
cross_validation(music_csv_file_path=None,data_percentage=0.7)
end=time.time()
print("cost time%.2f" %(end-start))
#sys.exit(0)
print("="*30+" begining searching the most suitable model..." +"+"*30)
start=time.time()
trainModels(train_percentage=0.7,fold=1000,music_csv_file_path=None,model_out_f=None)
end=time.time()
print("cost time%.2f" %(end-start))
#svm main
import feature
import pandas as pd
import numpy as np
from sklearn.externals import joblib
import sys
import time
数值化路径="./data/xx.csv"
def load_model(model_f=None):
if not model_f:
model_f=模型保存路径
clf=joblib.load(model_f):
return clf
def 歌曲标签数值化(): #fetch_index_label
#从文件中读取index和label之间的映射关系,并返回dict
data=pd.read_csv(数值化路径,header=None,encoding="utf-8")
name_label_list=np.array(data).tolist()
index_label_dict=dict(map(lambda t:(t[1],t[0],name_label_list))
return index_label_dict
index_label_dict=歌曲标签数值化()
def predict_labels(clf,X):
label_index=clf.predict([x])#win7没有方括号跑不了
label=index_label_dict[label_index[0]]
return label
if __name__=="__main__":
数值化标签路径="./data/xx.csv"
模型保存路径=“./data/music_model.pkl"
clf=load_model()
parh=
music_feature=feature.获取歌曲特征
label=predict_labels(clf,music_feature)
print("预测标签为:%s"%label)