I have always been coveting the data of QQ space. I wanted to steal it for a long time to study it, and I started to do it after a few days of spare time. . .
The process of the whole program is: login-->get cookie-->get all friends qq_number-->traverse their talk according to all friends qq-->get all friends' talk data
The program ran for more than 20 minutes and finished it, with a total of 282 friends, and ran 60000+ talk about it
I have erased some personal privacy. . Never mind. Hey-hey
1. Login --> get cookies
Open http://i.qq.com/ , as shown below
But most of the time
We use the account password to log in here, in order to facilitate the use of the selenium automation artifact (for the usage of selenium, please refer to https://my.oschina.net/u/3264690/blog/899229 , I will not elaborate too much here)
QQ account and QQ password are stored in the userinfo.ini file, and then read out with configparser
The code read is as follows
configparser is a library for reading configuration files. The format read here is get('[value in brackets in the configuration file]', 'corresponding key value')
import configparser
config = configparser.ConfigParser(allow_no_value=False)
config.read('userinfo.ini')
self.__username =config.get('qq_info','qq_number')
self.__password=config.get('qq_info','qq_password')
After the user information is read out, you can log in
When some friends use selenium, they may find that some elements cannot be positioned because some web pages have an iFrame.
Selenium locates the iframe according to the id
self.web.switch_to_frame('login_frame')
Code to automatically log in and get cookies
def login(self):
self.web.switch_to_frame('login_frame')
log=self.web.find_element_by_id("switcher_plogin")
log.click()
time.sleep(1)
username=self.web.find_element_by_id('u')
username.send_keys(self.__username)
ps=self.web.find_element_by_id('p')
ps.send_keys(self.__password)
btn=self.web.find_element_by_id('login_button')
time.sleep(1)
btn.click()
time.sleep(2)
self.web.get('https://user.qzone.qq.com/{}'.format(self.__username))
cookie=''
for elem in self.web.get_cookies():
cookie+=elem["name"]+"="+ elem["value"]+";"
self.cookies=cookie
self.get_g_tk()
self.headers['Cookie']=self.cookies
self.web.quit()
2. Get the QQ_number of all friends
After researching for a long time, I found that in the permission setting page of the QQ space homepage, click on QQ friends only, and the following page will appear.
After pressing F12, research the js file and find that there is such a file
This js file has the friend's qq_number
So request this file to get qq_number
def get_frends_url(self):
url='https://h5.qzone.qq.com/proxy/domain/base.qzone.qq.com/cgi-bin/right/get_entryuinlist.cgi?'
params = {"uin": self.__username,
"fupdate": 1,
"action": 1,
"g_tk": self.g_tk}
url = url + parse.urlencode(params)
return url
def get_frends_num(self):
t=True
offset=0
url=self.get_frends_url()
while(t):
url_=url+'&offset='+str(offset)
page=self.req.get(url=url_,headers=self.headers)
if "\"uinlist\":[]" in page.text:
t=False
else:
if not os.path.exists("./frends/"):
os.mkdir("frends/")
with open('./frends/'+str(offset)+'.json','w',encoding='utf-8') as w:
w.write(page.text)
offset += 50
Here is a function self.g_tk() which returns an encrypted p_skey, in this js file qzfl_v8_2.1.61.js, there is such a piece of code
QZFL.pluginsDefine.getACSRFToken = function(url) {
url = QZFL.util.URI(url);
var skey;
if (url) {
if (url.host && url.host.indexOf("qzone.qq.com") > 0) {
try {
skey = parent.QZFL.cookie.get("p_skey");
} catch (err) {
skey = QZFL.cookie.get("p_skey");
}
} else {
if (url.host && url.host.indexOf("qq.com") > 0) {
skey = QZFL.cookie.get("skey");
}
}
}
if (!skey) {
skey = QZFL.cookie.get("p_skey") || (QZFL.cookie.get("skey") || (QZFL.cookie.get("rv2") || ""));
}
return arguments.callee._DJB(skey);
};
QZFL.pluginsDefine.getACSRFToken._DJB = function(str) {
var hash = 5381;
for (var i = 0, len = str.length;i < len;++i) {
hash += (hash << 5) + str.charCodeAt(i);
}
return hash & 2147483647;
};
Write it as the python version as follows
def get_g_tk(self):
p_skey = self.cookies[self.cookies.find('p_skey=')+7: self.cookies.find(';', self.cookies.find('p_skey='))]
h=5381
for i in p_skey:
h+=(h<<5)+ord(i)
print('g_tk',h&2147483647)
self.g_tk=h&2147483647
Because friend information is stored as a json file, the file information needs to be parsed
#coding:utf-8
import json
import os
def get_Frends_list():
k = 0
file_list=[i for i in os.listdir('./frends/') if i.endswith('json')]
frends_list=[]
for f in file_list:
with open('./frends/{}'.format(f),'r',encoding='utf-8') as w:
data=w.read()[95:-5]
js=json.loads(data)
# print(js)
for i in js:
k+=1
frends_list.append(i)
return frends_list
frends_list=get_Frends_list()
print(frends_list)
3. Get all friends to talk about
Similar to before, after entering the friend's talk home page, I found that there is also such a js file to display all the talk in json form
Similarly, I wrote the code to get the talk (after testing, it is best to write 20 for num in the parameter, otherwise there will be unknown results...)
def get_mood_url(self):
url='https://h5.qzone.qq.com/proxy/domain/taotao.qq.com/cgi-bin/emotion_cgi_msglist_v6?'
params = {
"sort":0,
"start":0,
"num":20,
"cgi_host": "http://taotao.qq.com/cgi-bin/emotion_cgi_msglist_v6",
"replynum":100,
"callback":"_preloadCallback",
"code_version":1,
"inCharset": "utf-8",
"outCharset": "utf-8",
"notice": 0,
"format":"jsonp",
"need_private_comment":1,
"g_tk": self.g_tk
}
url = url + parse.urlencode(params)
return url
def get_mood_detail(self):
from getFrends import frends_list
url = self.get_mood_url()
for u in frends_list[245:]:
t = True
QQ_number=u['data']
url_ = url + '&uin=' + str(QQ_number)
pos = 0
while (t):
url__ = url_ + '&pos=' + str(pos)
mood_detail = self.req.get(url=url__, headers=self.headers)
print(QQ_number,u['label'],pos)
if "\"msglist\":null" in mood_detail.text or "\"message\":\"对不起,主人设置了保密,您没有权限查看\"" in mood_detail.text:
t = False
else:
if not os.path.exists("./mood_detail/"):
os.mkdir("mood_detail/")
if not os.path.exists("./mood_detail/"+u['label']):
os.mkdir("mood_detail/"+u['label'])
with open('./mood_detail/'+u['label']+"/" +str(QQ_number)+"_"+ str(pos) + '.json', 'w',encoding='utf-8') as w:
w.write(mood_detail.text)
pos += 20
time.sleep(2)
Store the required data in the database
#存入数据库
def dataToMysql():
con=pymysql.connect(
host='127.0.0.1',
user='root',
password="×××",
database='qq_z',
port=3306,
)
cur=con.cursor()
sql="insert into info (qq_number,created_time,content,commentlist,source_name,cmtnum,name) values ({},{},{},{},{},{},{});"
d=[i for i in os.listdir('mood_detail') if not i.endswith('.xls')]
for ii in d:
fl=[i for i in os.listdir('mood_detail/'+ii) if i.endswith('.json')]
print('mood_detail/'+ii)
k=1
for i in fl:
with open('mood_detail/'+ii+"/"+i,'r',encoding='latin-1') as w:
s=w.read()[17:-2]
js=json.loads(s)
print(i)
for s in js['msglist']:
m=-1
if not s['commentlist']:
s['commentlist']=list()
cur.execute(sql.format(int(i[:i.find('_')]),s['created_time'],str(s['content']),str([(x['content'],x['createTime2'],x['name'],x['uin']) for x in list(s['commentlist'])]),str(s['source_name']),int(s['cmtnum']),str(s['name'])))
k+=1
con.commit()
con.close()
Save the required data into Excel
def dataToExcel():
d=[i for i in os.listdir('mood_detail') if not i.endswith('.xls')]
for ii in d:
wb=xlwt.Workbook()
sheet=wb.add_sheet('sheet1',cell_overwrite_ok=True)
sheet.write(0,0,'content')
sheet.write(0,1,'createTime')
sheet.write(0,2,'commentlist')
sheet.write(0,3,'source_name')
sheet.write(0,4,'cmtnum')
fl=[i for i in os.listdir('mood_detail/'+ii) if i.endswith('.json')]
print('mood_detail/'+ii)
k=1
for i in fl:
with open('mood_detail/'+ii+"/"+i,'r',encoding='latin-1') as w:
s=w.read()[17:-2]
js=json.loads(s)
print(i)
for s in js['msglist']:
m=-1
sheet.write(k,m+1,str(s['content']))
sheet.write(k,m+2,str(s['createTime']))
if not s['commentlist']:
s['commentlist']=list()
sheet.write(k,m+3,str([(x['content'],x['createTime2'],x['name'],x['uin']) for x in list(s['commentlist'])]))
sheet.write(k,m+4,str(s['source_name']))
sheet.write(k,m+5,str(s['cmtnum']))
k+=1
if not os.path.exists('mood_detail/Excel/'):
os.mkdir('mood_detail/Excel/')
try:
wb.save('mood_detail/Excel/'+ii+'.xls')
except Exception:
print("error")
4. Analyze the data
24 hours of talk count
People released more talk at noon and evening, and less in the early morning.
Talk about the most ranked top20
Talk about the least ranking top20
Sure enough, the boring people made more talk. . . Hahaha
From 2000 to 2018, the distribution is as follows
It seems that my friends were quite boring when they were young, and as they get older, they talk less and less. .
Thanks https://zhuanlan.zhihu.com/p/24656161 for the tip. . . Take a lot less detours
The data capture speed is very fast, and all my friends (282+) 60000+ were captured in 20 minutes. .
Project has been sent to
Code Cloud: https://git.oschina.net/nanxun/QQ_zone.git
Friends, I think it's useful to use a star. . Crab. . .