python requests, xpath reptile increase blog traffic

This is an analysis of IP proxy site, to access the CSDN blog by ip proxy website, in order to achieve different ip visit to the same blog, the main entertainment, you can play it.

First, the preparatory work to set User-Agent:

#1.headers
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}

Then Baidu an IP proxy site, I use a https://www.kuaidaili.com/free, parsing the page, extract the ip, port, type and save to list:

 

#1.获取IP地址
html=requests.get('https://www.kuaidaili.com/free').content.decode('utf8')
tree = etree.HTML(html)
ip = tree.xpath("//td[@data-title='IP']/text()")
port=tree.xpath("//td[@data-title='PORT']/text()")
model=tree.xpath("//td[@data-title='类型']/text()")

url address then analyze each article in a personal blog, in order to save list

 

#2.获取CSDN文章url地址   ChildrenUrl[]
url='https://blog.csdn.net/weixin_43576564'
response=requests.get(url,headers=headers)
Home=response.content.decode('utf8')
Home=etree.HTML(Home)
urls=Home.xpath("//div[@class='article-item-box csdn-tracking-statistics']/h4/a/@href")
ChildrenUrl=[]

 

Then through a proxy ip to access individual articles personal blog by a for loop, a ip access all the articles again, by resolving "my blog" page to get the total page views, page views whether changes in real-time monitoring, set the number of tasks , real-time display task progress, set sleep time by random.randint (), so that the spider safer. Full code is as follows:

import os
import time
import random
import requests
from lxml import etree
#准备部分
#1.headers
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}

#1.获取IP地址
html=requests.get('https://www.kuaidaili.com/free').content.decode('utf8')
tree = etree.HTML(html)
ip = tree.xpath("//td[@data-title='IP']/text()")
port=tree.xpath("//td[@data-title='PORT']/text()")
model=tree.xpath("//td[@data-title='类型']/text()")

#2.获取CSDN文章url地址   ChildrenUrl[]
url='https://blog.csdn.net/weixin_43576564'
response=requests.get(url,headers=headers)
Home=response.content.decode('utf8')
= Etree.HTML Home (Home) 
URLs = Home.xpath ( "// div [@ class = 'Article This article was CSDN-Item-Box-statistics-Tracking'] / H4 / A / @ the href") 
ChildrenUrl = [] 
for I Range in (. 1, len (URLs)): 
    ChildrenUrl.append (URLs [I]) 


oldtime = time.gmtime () 

browses = int (iNPUT ( "enter the desired visits:")) 
the Browse = 0 
. #. 3 cycles disguise ip and crawling articles 
for I in Range (. 1, len (Model)): 
    # design proxy ip 
    Proxies = {Model [I]:. '{} {}' the format (ip [I], Port [I])} 
    Curl in ChildrenUrl for: 
        the try: 
            the Browse + =. 1 
            Print ( "progress: {} / {}." the format (the Browse, browses), End = "\ T") 
            # traversing articles 
            response = requests.get (Curl, headers = headers,= Proxies Proxies) 
            # Gets Visitors 
            look = etree.HTML (response.content)
            = Look.xpath Nuwmunber ( "// div [@ class = 'Grade-Box clearfix'] / DL [2] / dd / text ()") 
            COUNT = Nuwmunber [0] .strip () 
            Print ( "Total views :. "the format (COUNT), End =" {} \ T ") 
            '' ' 
            
            to re-implement 
            
            # a query each IP 
            IF Curl == ChildrenUrl [. 5]: 
                ipUrl =' HTTP: //www.ip138.com / ' 
                Response = requests.get (ipUrl, Proxies = Proxies) 
                iphtml = response.content 
                ipHtmlTree = etree.HTML (iphtml) 
                ipaddress ipHtmlTree.xpath = ( "P // [@ class =' Result '] / text ()" ) 
                Print (IP [I], ipaddress)             
            '''
            I = the random.randint (. 5,30)
            print ( "second interval {}" .format (I), End = "\ T") 
            the time.sleep (I) 
            print ( "Skimming current address: {}." the format (Curl)) 
            IF == browses the Browse: 
                print ( "crawling tasks completed, consuming a total of {s}" .format (int (time.perf_counter ()))) 
                os._exit (0) 

        the except: 
            Print ( 'error') 
            os._exit (0) 

    # Print The current proxy ip 
    Print (Proxies)

 Actual operating results map:Run renderings

 

Guess you like

Origin www.cnblogs.com/yxkj/p/11260383.html