# Python3 microblogging reptile [requests + pyquery + selenium + mongodb]

Crawler Python3 Twitter [requests + pyquery + selenium + mongodb]

Access to big data era, data is basic research, and acquiring vast amounts of data through artificial nature can not obtain, reptiles due to transport and health. Microblogging as a new era of domestic hot social media platform with a large number of user behavior and business data, learn to get through the reptile required data will be essential skills for future researchers. Because microblogging anti-crawling action more perfect, PC end of the microblogging API too restrictive and difficult to obtain data on reptiles, this selection crawling end of the phone color version of microblogging, mobile terminal contains all the data needed, and for reptiles more friendly.

Programs written in Ubuntu18.04.5, Python version 3.6.7, by setting a request header information and using proxy IP and Proxy account for reptiles packaging, the use of selenium drive Chrome to complete page of acquiring, acquiring data by pyquery page parsing, and non-relational data stored in MongoDB.

Main technique

  1. Requests Library: Third-party libraries request for fetch page Analog browser sends a request to the server.

  2. Selenium library: an automated testing tool, with the Chrome browser and ChromeDrive drive to simulate browser perform specific actions, such as clicking, drop down and so on.

  3. Pyquery library: a robust parsing library, then crawl the page code, with CSS selectors to extract the required information from a web page.

  4. MongoDB database: a non-relational database, based on a key pair, there is no coupling between the data, high performance, storage JSON objects like form, very flexible.

  5. PyMongo library: repository for implementation and data exchange between MongoDB.

  6. Python threads: the thread is a minimum unit capable of operating the system operation scheduling, be included in the process, a thread refers to the process flow of control of a single sequence, the program waits for the effect achieved by delaying the thread.

  7. Requires a proxy IP and proxy accounts, IP proxy is best not to use the free, such as the West thorns, hanging free agent most fast, unstable and slow, slow can no longer set the timeout within the specified time to load a web page, resulting in abnormal program termination

    Micro-blog account to purchase: http: //www.xiaohao.fun/ a No. 5 or http://www.xiaohao.live/,1 hair

    IP Agent: http: //h.zhimaruanjian.com/getapi/#obtain_ip

Site Analysis

  1. First observation URL, a string of numbers can be found on the page remains the same jump during this string of numbers is the unique identifier for the user, that is, the user's uid, uid can uniquely identify a user.

  2. The number of microblogging users, the number of concerned, there is a span of fans in class class named tc div tag named tip2 under the.

  3. Each micro-blog information exists in the class of a div tag called c.

  4. There are two points at div class label div tag named c, used to display a character, for displaying a picture.

  5. Total number of pages of information in the identification of pagelist div tag, acquires the total number of pages for turning cycle crawling all micro-blog information.

  6. Page turning operation observed that only the tail page URL variables change, the use of variable page flip operation can be achieved.

  7. Enter the user's details page, just add info can be found at the original URL suffix, you can access the user's details page.

  8. Within the user details page corresponding tag you can get all the information about the user.

Program flow chart

ProcessOn Online Draw

Programming

Database Selection

Database using non-relational database MongoDB, based on a key pair, there is no coupling between the data, high performance, storage JSON objects like form, very flexible. For reptiles data storage, the data there may be a case of failure to extract some fields missing, possibly due to network problems may be due to some fields had no offers, but also the fields subject to adjustment. In addition, there is a nested relationship between the data, if you use a relational database to store, one needs to build tables in advance, and second, if the data nested relationship exists, the need for serialization can store, very inconvenient, but if use non-relational databases, you can avoid some trouble, more simple and efficient.

其中文本信息及图片URL全部存储在weibo数据库中,数据库下以用户昵称为名称新建集合,集合里面每一条数据以微博发布时间命名,图片无法存入数据库,所以单独下载保存在以用户昵称命名的文件夹中,图片名称也为微博发布时间,方便从某一条微博找到该微博对应的图片。

client=pymongo.MongoClient(host=’localhost’,port=27017)  

db=client.weibo #指定数据库  

id = input(‘请输入微博用户id:’)  

global collection  

collection=db[id] 

代理IP测试

微博对爬虫的限制较为严格,为防止在调试过程中频繁的访问微博服务器而识别为爬虫访问导致本地ip被封,一般使用代理ip。考虑到大多数免费ip使用人数过多,速度慢,易被检测,本程序使用芝麻http代理的付费ip,通过访问一个可以测试访问ip的网站,一方面检测代理ip的速度,另一方面可以测试代理ip是否使用成功,同时代码中加入的异常处理模块。

# 测试代理IP
def check_ip():
    print(r'正在检查代理IP是否可用...')
    # 测试ip是否可用

    proxies = {
        'http': proxy,
        'https': proxy,
    }

    print('当前测试的代理IP为:' + proxy)

    print('...')

    print('测试结果:')

    try:
        # 超时设置,避免使用不稳定的IP
        r = requests.get('http://ip111.cn/',
                         proxies=proxies,timeout=3)

        r.encoding = 'utf-8'

        doc = pq(r.text)
        # print(doc)
        ip = doc('body > div.container '
                 '> div.card-deck.mb-3.text-center')

        ip = str(ip.find('div:nth-child(1) > div.card-header').text()) + \
             " : " + \
             str(ip.find('div:nth-child(1) > div.card-body > p:nth-child(1)').text())

        print(ip)

        print('ip可正常使用...')

    except:

        print("抱歉,此IP无法使用,请更换IP重试")

        os._exit(0)

模拟登录

模拟登录板块使用selenium配合ChromeDrive驱动Chrome完成相关操作,为了应对微博的反爬取措施,设置了代理和伪装的请求头,为避免账号被封,使用代理账号爬取所需信息。一般网站为了减少服务器压力,会对访问速度过快的用户进行封ip操作,因此在程序中适当的加入了推迟线程操作,通过推迟线程,实现等待操作,使爬虫的操作更像是人类。

判断是否登录成功是通过尝试获取用户的详情页面,如果登录失败,尝试获取详情页面会跳转到登录页面,此时能返回的源代码中可以找到类名为login-wrapper的div标签,如果找到此标签则表示当前代理ip节点速度过慢或者生命周期将至,提示更换代理ip,关闭浏览器,并且直接结束程序。

至于这里为什么不考虑递归调用登录函数,是因为此次的登录失败是由代理ip节点异常导致的,递归调用重新登录很大可能还是登录失败,即使登录成功,之后获取信息的时候也会出现由于网络问题而导致获取页面失败的情况,所以选择提示更换ip并且直接结束程序。

如果使用的本地ip,而非代理ip,则考虑递归调用重新登录才是更好的选择,本地ip登录失败很有可能是某一刻的网络问题,恰好在登录的时候碰到了网络堵塞问题,在后续获取信息的操作中则不会受到影响。

def log_in():

    chromeOptions = webdriver.ChromeOptions()

    chromeOptions.add_argument("--proxy-server={0}".format(proxy))#设置代理

    chromeOptions.add_argument('lang=zh_CN.UTF-8')

    chromeOptions.add_argument(
        'User-Agent:"Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
        'AppleWebKit/537.36 (KHTML, like Gecko) '
        'Chrome/76.0.3809.100 Safari/537.36"'
    )

    global browser

    browser = webdriver.Chrome(chrome_options=chromeOptions)

    try:

        print(u'正在登陆新浪微博手机端...')

        #给定登陆的网址
        url = 'https://passport.weibo.cn/signin/login'

        browser.get(url)

        time.sleep(3)
        #找到输入用户名的地方,并将用户名里面的内容清空,然后送入你的账号
        username = browser.find_element_by_css_selector('#loginName')

        time.sleep(2)

        username.clear()

        username.send_keys('[email protected]')#输入自己的账号
        #找到输入密码的地方,然后送入你的密码

        password = browser.find_element_by_css_selector('#loginPassword')

        time.sleep(2)

        password.send_keys('91ih6d7j')

        #点击登录
        browser.find_element_by_css_selector('#loginAction').click()

        #这里给个15秒非常重要,因为在点击登录之后,新浪微博会有个验证码,
        #下图有,通过程序执行的话会有点麻烦,这里就手动
        time.sleep(15)


    except:

        print('登录出现错误,请检查网速或者代理IP是否稳定!!!')

        os._exit(0)

    browser.get('http://weibo.cn/' + id + '/info')

    doc = pq(browser.page_source, parser='html')

    if doc.find('.login-wrapper'):

        print('登录出现错误,请检查网速或者代理IP是否稳定!!!')

        browser.close()

        os._exit(0)

    print('完成登陆!')

获取用户详细信息

通过观察URL发现,在微博域名后面添加用户uid转到用户的主页,在用户uid后面添加info字段转到用户详细信息页面。通过这个信息可以很容易获取到用户详情页面的源代码。本模块中有两个地方可能出现异常,一个是请求用户详情页面,请求页面可能因为网路问题像网速慢等情况,或者代理ip节点寿命到期或者节点不稳定的情况出发异常,这些异常暗示着网络中断,另一个是存储进MongoDB时会出现异常,这种异常可以暂时跳过,获取后面的数据,而无需从头开始执行程序。

本模块的处理是通过先构造一个空字典用于临时存储信息,通过驱动Chrome浏览器请求页面,并返回页面源代码,通过网页源码构造pyquery对象,进行源码的解析提取数据提取数据使用css选择器。

def get_basic_info(id):

    dict = {
        '_id':'基本信息'
    }

    global url
    url = 'http://weibo.cn/' + id

    try:

        browser.get(url + '/info')#这里可能出现获取不到页面的情况

    except TimeoutError:

        print("请求超时,节点可能不太稳定,请跟换节点")

        os._exit(0)

    except:

        print("出现错误,错误uid")

        os._exit(0)

    doc = pq(browser.page_source,parser='html')

    if doc.find('.tm'): j=1
    else : j=0

    info = doc('body > div:nth-child('+str(6+j)+')').text()

    nickname = str(info).split('\n', 1)[0]
    print(nickname)

    dict[nickname.split(':')[0]] = nickname.split(':')[1]

    #创建存储该用户图片的文件夹,命名为用户昵称
    make_dir(dict[nickname.split(':')[0]])

    dict['uid'] = id

    other_info = str(info).split('\n', 1)[1].strip()[:-5]

    img = doc('body > div:nth-child('+str(3+j)+')> img').attr('src')
    img = '头像:' + str(img)
    print(img)

    dict[img.split(':',1)[0]] = img.split(':',1)[1]

    rank = doc('body > div:nth-child('+str(4+j)+')').text()

    rank = str(rank).split('\n', 1)[0].split(':')[1].strip()[:2]
    rank = '会员等级:' + rank

    print(rank)

    dict[rank.split(':')[0]] = rank.split(':')[1]

    other_info=other_info.replace(':',':')

    print(other_info)

    other_info=other_info.strip()

    for a in other_info.split('\n'):

        dict[a.split(':')[0]]=a.split(':')[1]

    browser.get(url)

    doc = pq(browser.page_source, parser='html')

    follow_and_fans = str(doc('body > div.u > div').text()).strip().split('分')[0]

    follow_and_fans=follow_and_fans.strip() #去处前后空格

    for a in follow_and_fans.split():

        t = a.split('[')[1]

        dict[a.split('[')[0]+'数']=t.split(']')[0]

    if save_image(nickname.split(':')[1]+'的头像',dict[img.split(':',1)[0]]):
        print(nickname.split(':')[1]+'的头像下载成功')

    try:
        collection.insert_one(dict)

    except:
        print('基本信息存储进mongodb出现错误')
        os._exit(0)

    finally:
        tot_page = doc('#pagelist > form > div').text()

        tot_page = str(tot_page).split('/')[1][:-1]

        return tot_page

获取用户全部微博

通过研究源码可以发现,微博存在类名为c的div标签中,通过源码中这种标签的数量可以得到每一页的微博数,其中要去除第一个div标签和倒数前两个标签。在每一个获取的标签中可以获取微博的文字和图片,其中图片的有无会影响到赞数,转发数和评论数的获取,如果获取到的图片不为空,则下载图片,这里使用if-else处理。图片的下载,源码中提供了两种下载地址,原图的下载地址需要传递cookies,而源码中还提供了流畅图的下载地址,源码中的原图的下载地址点击之后下载URL发生变化,通过和流畅图的URL下载地址相比较,发现只要把流畅图的下载地址中的wap180转化为large即可获取到原图的下载地址,这里巧妙的获取到了原图。通过修改URL可以遍历所有微博,获取数据,每次获取完一页的数据之后,推迟线程0.5s,模拟人工查看微博时出现的网络等待时间。

def get_weibo(tot_page):

    for k in range(1,int(tot_page)+1):

        browser.get(url+'?page='+str(k))

        doc=pq(browser.page_source,parser='html')

        c = doc('.c')
        lens = len(c)
        c = c.items()

        i = 0;j=1


        for cc in c:

            dict = {}

            i = i + 1

            if (i == 1): continue
            if (i == lens - 1): break

            print("正在爬取第"+str(k)+"页,第"+str(j)+"条微博...")

            dict['_id'] = cc.find('div > span.ct').text()

            dict['info'] = cc.find('.ctt').text()

            dict['img'] = cc.find('div:nth-child(2) > a:nth-child(1) > img').attr('src')

            if dict['img']:

                dict['img']=dict['img'].replace('wap180', 'large')

                if save_image(dict['_id'], dict['img']):

                    print("微博图片下载成功...")

                dict['active'] = str(cc.find('div:nth-child(2) > a:nth-child(4)').text()) + ',' + str(cc.find('div:nth-child(2) > a:nth-child(5)').text()) + ',' + str(cc.find('.cc').text())

            else:

                dict['active'] = str(cc.find('div > a:nth-child(3)').text()) + ',' + str(cc.find('div > a:nth-child(4)').text()) + ',' + str(cc.find('.cc').text())

            try:
                collection.insert_one(dict)
                print("第" + str(k) + "页,第" + str(j) + "条微博存储成功...")
            except:
                print("第" + str(k) + "页,第" + str(j) + "条微博存储失败!!!")

            j = j + 1

            print('')

            time.sleep(0.5)

运行结果

由于robo 3t查询本身有bug,只能显示前50条记录,可以MongoDB命令行下使用查询命令查看所有数据。

Guess you like

Origin www.cnblogs.com/sstealer/p/11498037.html