爬取了2019年LPL职业联赛的一些数据,通过大小龙,推塔数,以及队伍击杀数来预测队伍胜利,所以分为爬虫和knn算法俩块
首先是爬虫部分,根据网页结构,创造了一个大列表来储存所有要爬取的url,然后循环这个url爬取
我是写在另一个文件里,所以下面需要导入这个函数,若写在一个文件中则不需要
第一个lol_game_url.py文件:
def make_url():
l = []
for i in range(68, 100):
#73页访问有问题,所以没有爬取
if i != 73:
s = ''
s += 'https://lpl.qq.com/es/stats.shtml?bmid=42%d' % i
l.append(s)
l1 = []
for i in range(80):
s1 = ''
if i < 10:
s1 += 'https://lpl.qq.com/es/stats.shtml?bmid=430%d' % i
else:
s1 += 'https://lpl.qq.com/es/stats.shtml?bmid=43%d' % i
l1.append(s1)
l2 = l + l1
return l2
第二个lol_game_detail.py文件:
#导入相应的模块
import sqlite3
import time
from selenium import webdriver
from lol_game_url import make_url
#获取列表,连接数据库,并循序url列表访问页面
lol_url_list=make_url()
browser=webdriver.Firefox()
# conn=sqlite3.connect(r'D:\BaiduNetdiskDownload\navicat12\navicat\database\spider.db')
# c=conn.cursor()
for i in lol_url_list:
browser.get(i)
time.sleep(2)
#俩个战队名字
team1=browser.find_element_by_css_selector('span#teamA_name').text
team2=browser.find_element_by_css_selector('span#teamB_name').text
#俩个战队战斗场数,循环爬取每一场数据
game_nums = browser.find_elements_by_css_selector('ul#smatch_bar>li')
play_group='%svs%s'%(team1,team2)
for game_num in game_nums:
num = game_num.find_element_by_css_selector('a')
game_num1=num.text
if num:
num.click()
time.sleep(1)
#战队1和战队2名称
team11 = browser.find_element_by_css_selector('p#teama-name').text
team22 = browser.find_element_by_css_selector('p#teamb-name').text
#大龙
team1_big_dragon = browser.find_element_by_css_selector('span#game-b-dragon-num-left').text
team2_big_dragon = browser.find_element_by_css_selector('span#game-b-dragon-num-right').text
#小龙
team1_small_dragon = browser.find_element_by_css_selector('span#game-s-dragon-num-left').text
team2_small_dragon = browser.find_element_by_css_selector('span#game-s-dragon-num-right').text
#推塔数
team1_tower_num = browser.find_element_by_css_selector('span#game-tower-num-left').text
team2_tower_num = browser.find_element_by_css_selector('span#game-tower-num-right').text
#总经济
team1_gold_num = browser.find_element_by_css_selector('span#game-gold-total-left').text
team2_gold_num = browser.find_element_by_css_selector('span#game-gold-total-right').text
#击杀
team1_kill=browser.find_element_by_css_selector('p#game-kda-k-total-num-left').text
team2_kill=browser.find_element_by_css_selector('p#game-kda-k-total-num-right').text
#战队成员
team1_members=[]
for i in range(1,6):
l=''
l+='nr-game-player-name-left-%d'%i
member=browser.find_element_by_css_selector('p#%s'%l).text
team1_members.append(member)
team1_member=','.join(team1_members)
team2_members = []
for i in range(1, 6):
l = ''
l += 'nr-game-player-name-right-%d' % i
member = browser.find_element_by_css_selector('p#%s' % l).text
team2_members.append(member)
team2_member = ','.join(team2_members)
print(play_group)
print(game_num1)
print('第一个战队', team11)
print('大龙数量', team1_big_dragon)
print('小龙数量', team1_small_dragon)
print('推塔数', team1_tower_num)
print('经济数量', team1_gold_num)
print('%s队员%s'%(team11,team1_member))
print('击杀数',team1_kill)
print('第二个战队', team22)
print('大龙数量', team2_big_dragon)
print('小龙数量', team2_small_dragon)
print('推塔数', team2_tower_num)
print('经济数量', team2_gold_num)
print('%s队员%s' % (team22, team2_member))
print('击杀数',team2_kill)
print('*' * 100)
#数据库添加字段并存储
sql="""insert into lpl_match(play_group,game_num,team11,team1_members,
team1_big_dragon,team1_small_dragon,team1_tower_num,team1_gold_num,team1_kill,team22,
team2_members,team2_big_dragon,team2_small_dragon,team2_tower_num,team2_gold_num,team2_kill)
values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"""
# params=(play_group,game_num1,team11,team1_member,team1_big_dragon,team1_small_dragon,
# team1_tower_num,team1_gold_num,team1_kill,team22,team2_member,team2_big_dragon,team2_small_dragon,
# team2_tower_num,team2_gold_num,team2_kill)
# c.execute(sql,params)
# conn.commit()
第三个lol_machinelearning_predict.py中:
#导入相应模块
import sqlite3
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
#连接数据库并读取
conn=sqlite3.connect(r'D:\BaiduNetdiskDownload\navicat12\navicat\database\spider.db')
sql="""select * from lpl_match"""
lol=pd.read_sql(sql,conn)
def knn_sklearn():
#我们有的数据训练集
lol_content=lol.loc[:,['team1_big_dragon','team1_big_dragon','team1_tower_num','team1_kill',
'team2_big_dragon','team2_small_dragon','team2_tower_num','team2_kill']]
#这个是我手动标记的爬取时忘了设置(也不多,就200多条。。。),哪个战队胜利,1是战队1胜,2是战队2胜
lol_victory=lol['victory']
#分为训练集和目标集,test_size表示以多大比例来分
x_train,x_test,y_train,y_test = train_test_split(lol_content,lol_victory,test_size=0.25)
#n_neighbors为k值
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train,y_train)
l=knn.predict_proba(np.array([[0,3,4,15,1,1,9,22]]))#TOP VS IG ,IG赢
#可以转换为列表形式
l1=l.tolist()[0]
print(l1)
if l1[0] > l1[1]:
print('team1胜利')
else:
print('team2胜利')
knn_sklearn()
第一个lol_game_url.py文件结果展示(截取部分):
第二个lol_game_detail.py文件结果展示(截取部分):
第三个lol_machinelearning_predict.py文件结果展示(截取部分):
数据库展示(截取部分):
选取了第一个数据来预测,结果是正确的的,因为上面选取特征都会影响经济,所以就没有把经济选进来。对knn算法的理解就是计算你当前数据的特征值和已知数据类型的数据的特征值距离,然后按照大小排序选取,选取几个就是几nn。