Grab 60000+QQ space, talk about it and do a data analysis

I have always been coveting the data of QQ space. I wanted to steal it for a long time to study it, and I started to do it after a few days of spare time. . .

The process of the whole program is: login-->get cookie-->get all friends qq_number-->traverse their talk according to all friends qq-->get all friends' talk data

The program ran for more than 20 minutes and finished it, with a total of 282 friends, and ran 60000+ talk about it

I have erased some personal privacy. . Never mind. Hey-hey

1. Login --> get cookies

Open http://i.qq.com/ , as shown below

But most of the time

We use the account password to log in here, in order to facilitate the use of the selenium automation artifact (for the usage of selenium, please refer to https://my.oschina.net/u/3264690/blog/899229 , I will not elaborate too much here)

QQ account and QQ password are stored in the userinfo.ini file, and then read out with configparser

The code read is as follows

configparser is a library for reading configuration files. The format read here is get('[value in brackets in the configuration file]', 'corresponding key value')

import configparser
config = configparser.ConfigParser(allow_no_value=False)
config.read('userinfo.ini')
self.__username =config.get('qq_info','qq_number')
self.__password=config.get('qq_info','qq_password')

After the user information is read out, you can log in

When some friends use selenium, they may find that some elements cannot be positioned because some web pages have an iFrame set

Selenium locates the iframe according to the id

self.web.switch_to_frame('login_frame')

Code to automatically log in and get cookies

    def login(self):
        self.web.switch_to_frame('login_frame')
        log=self.web.find_element_by_id("switcher_plogin")
        log.click()
        time.sleep(1)
        username=self.web.find_element_by_id('u')
        username.send_keys(self.__username)
        ps=self.web.find_element_by_id('p')
        ps.send_keys(self.__password)
        btn=self.web.find_element_by_id('login_button')
        time.sleep(1)
        btn.click()
        time.sleep(2)
        self.web.get('https://user.qzone.qq.com/{}'.format(self.__username))
        cookie=''
        for elem in self.web.get_cookies():
            cookie+=elem["name"]+"="+ elem["value"]+";"
        self.cookies=cookie
        self.get_g_tk()
        self.headers['Cookie']=self.cookies
        self.web.quit()
        

2. Get the QQ_number of all friends

After researching for a long time, I found that in the permission setting page of the QQ space homepage, click on QQ friends only, and the following page will appear.

After pressing F12, research the js file and find that there is such a file

This js file has the friend's qq_number

So request this file to get qq_number

    def get_frends_url(self):
        url='https://h5.qzone.qq.com/proxy/domain/base.qzone.qq.com/cgi-bin/right/get_entryuinlist.cgi?'
        params = {"uin": self.__username,
              "fupdate": 1,
              "action": 1,
              "g_tk": self.g_tk}
        url = url + parse.urlencode(params)
        return url

    def get_frends_num(self):
        t=True
        offset=0
        url=self.get_frends_url()
        while(t):
            url_=url+'&offset='+str(offset)
            page=self.req.get(url=url_,headers=self.headers)
            if "\"uinlist\":[]" in page.text:
                t=False
            else:

                if not os.path.exists("./frends/"):
                    os.mkdir("frends/")
                with open('./frends/'+str(offset)+'.json','w',encoding='utf-8') as w:
                    w.write(page.text)
                offset += 50

 

Here is a function self.g_tk() which returns an encrypted p_skey, in this js file qzfl_v8_2.1.61.js, there is such a piece of code

  QZFL.pluginsDefine.getACSRFToken = function(url) {
    url = QZFL.util.URI(url);
    var skey;
    if (url) {
      if (url.host && url.host.indexOf("qzone.qq.com") > 0) {
        try {
          skey = parent.QZFL.cookie.get("p_skey");
        } catch (err) {
          skey = QZFL.cookie.get("p_skey");
        }
      } else {
        if (url.host && url.host.indexOf("qq.com") > 0) {
          skey = QZFL.cookie.get("skey");
        }
      }
    }
    if (!skey) {
      skey = QZFL.cookie.get("p_skey") || (QZFL.cookie.get("skey") || (QZFL.cookie.get("rv2") || ""));
    }
    return arguments.callee._DJB(skey);
  };
  QZFL.pluginsDefine.getACSRFToken._DJB = function(str) {
    var hash = 5381;
    for (var i = 0, len = str.length;i < len;++i) {
      hash += (hash << 5) + str.charCodeAt(i);
    }
    return hash & 2147483647;
  };

Write it as the python version as follows

    def get_g_tk(self):
        p_skey = self.cookies[self.cookies.find('p_skey=')+7: self.cookies.find(';', self.cookies.find('p_skey='))]
        h=5381
        for i in p_skey:
            h+=(h<<5)+ord(i)
        print('g_tk',h&2147483647)
        self.g_tk=h&2147483647

Because friend information is stored as a json file, the file information needs to be parsed

#coding:utf-8
import json
import os
def get_Frends_list():
    k = 0
    file_list=[i for i in os.listdir('./frends/') if i.endswith('json')]
    frends_list=[]
    for f in file_list:
        with open('./frends/{}'.format(f),'r',encoding='utf-8') as w:
            data=w.read()[95:-5]
            js=json.loads(data)
            # print(js)
            for i in js:
                k+=1
                frends_list.append(i)
    return frends_list


frends_list=get_Frends_list()
print(frends_list)

3. Get all friends to talk about

Similar to before, after entering the friend's talk home page, I found that there is also such a js file to display all the talk in json form

Similarly, I wrote the code to get the talk (after testing, it is best to write 20 for num in the parameter, otherwise there will be unknown results...)

    def get_mood_url(self):
        url='https://h5.qzone.qq.com/proxy/domain/taotao.qq.com/cgi-bin/emotion_cgi_msglist_v6?'
        params = {
              "sort":0,
                  "start":0,
              "num":20,
            "cgi_host": "http://taotao.qq.com/cgi-bin/emotion_cgi_msglist_v6",
              "replynum":100,
              "callback":"_preloadCallback",
              "code_version":1,
            "inCharset": "utf-8",
            "outCharset": "utf-8",
            "notice": 0,
              "format":"jsonp",
              "need_private_comment":1,
              "g_tk": self.g_tk
              }
        url = url + parse.urlencode(params)
        return url


    def get_mood_detail(self):
        from getFrends import frends_list
        url = self.get_mood_url()
        for u in frends_list[245:]:
            t = True
            QQ_number=u['data']
            url_ = url + '&uin=' + str(QQ_number)
            pos = 0
            while (t):
                url__ = url_ + '&pos=' + str(pos)
                mood_detail = self.req.get(url=url__, headers=self.headers)
                print(QQ_number,u['label'],pos)
                if "\"msglist\":null" in mood_detail.text or "\"message\":\"对不起,主人设置了保密,您没有权限查看\"" in mood_detail.text:
                    t = False
                else:
                    if not os.path.exists("./mood_detail/"):
                        os.mkdir("mood_detail/")
                    if not os.path.exists("./mood_detail/"+u['label']):
                        os.mkdir("mood_detail/"+u['label'])
                    with open('./mood_detail/'+u['label']+"/" +str(QQ_number)+"_"+ str(pos) + '.json', 'w',encoding='utf-8') as w:
                        w.write(mood_detail.text)
                    pos += 20
            time.sleep(2)

Store the required data in the database

#存入数据库
def dataToMysql():
    con=pymysql.connect(
        host='127.0.0.1',
        user='root',
        password="×××",
        database='qq_z',
        port=3306,
    )
    cur=con.cursor()
    sql="insert into info (qq_number,created_time,content,commentlist,source_name,cmtnum,name) values ({},{},{},{},{},{},{});"

    d=[i for i in os.listdir('mood_detail') if not i.endswith('.xls')]
    for ii in d:
        fl=[i for i in os.listdir('mood_detail/'+ii) if i.endswith('.json')]
        print('mood_detail/'+ii)
        k=1
        for i in fl:
            with open('mood_detail/'+ii+"/"+i,'r',encoding='latin-1') as w:
                s=w.read()[17:-2]
                js=json.loads(s)
                print(i)
                for s in js['msglist']:
                    m=-1
                    if not s['commentlist']:
                        s['commentlist']=list()
                    cur.execute(sql.format(int(i[:i.find('_')]),s['created_time'],str(s['content']),str([(x['content'],x['createTime2'],x['name'],x['uin']) for x in list(s['commentlist'])]),str(s['source_name']),int(s['cmtnum']),str(s['name'])))
                    k+=1
        con.commit()
        con.close()

Save the required data into Excel

def dataToExcel():
    d=[i for i in os.listdir('mood_detail') if not i.endswith('.xls')]
    for ii in d:
        wb=xlwt.Workbook()
        sheet=wb.add_sheet('sheet1',cell_overwrite_ok=True)
        sheet.write(0,0,'content')
        sheet.write(0,1,'createTime')
        sheet.write(0,2,'commentlist')
        sheet.write(0,3,'source_name')
        sheet.write(0,4,'cmtnum')
        fl=[i for i in os.listdir('mood_detail/'+ii) if i.endswith('.json')]
        print('mood_detail/'+ii)
        k=1
        for i in fl:
            with open('mood_detail/'+ii+"/"+i,'r',encoding='latin-1') as w:
                s=w.read()[17:-2]
                js=json.loads(s)
                print(i)
                for s in js['msglist']:
                    m=-1
                    sheet.write(k,m+1,str(s['content']))
                    sheet.write(k,m+2,str(s['createTime']))
                    if not s['commentlist']:
                        s['commentlist']=list()
                    sheet.write(k,m+3,str([(x['content'],x['createTime2'],x['name'],x['uin']) for x in list(s['commentlist'])]))
                    sheet.write(k,m+4,str(s['source_name']))
                    sheet.write(k,m+5,str(s['cmtnum']))
                    k+=1
        if not os.path.exists('mood_detail/Excel/'):
            os.mkdir('mood_detail/Excel/')
        try:
            wb.save('mood_detail/Excel/'+ii+'.xls')
        except Exception:
            print("error")

4. Analyze the data

24 hours of talk count

People released more talk at noon and evening, and less in the early morning.

 

 

Talk about the most ranked top20

 

Talk about the least ranking top20

 

Sure enough, the boring people made more talk. . . Hahaha

 

From 2000 to 2018, the distribution is as follows

It seems that my friends were quite boring when they were young, and as they get older, they talk less and less. .

Thanks https://zhuanlan.zhihu.com/p/24656161 for the tip. . . Take a lot less detours

The data capture speed is very fast, and all my friends (282+) 60000+ were captured in 20 minutes. .

Project has been sent to

Code Cloud:  https://git.oschina.net/nanxun/QQ_zone.git

github : https://github.com/nanxung/QQ_zone.git

Friends, I think it's useful to use a star. . Crab. . .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325544727&siteId=291194637
Recommended