[Study notes] Python3 crawler

Case 1: Python3 crawler proxy pool

1. Find a public proxy ip site

For example, Westthorn free agent IP

2. Write xpath to filter

//tr/td[2]/text()

//tr/td[3]/text()

//tr/td[6]/text()

3. Write code

Copy code
import requests
import os, team, random
from fake_useragent import UserAgent
from lxml import etree

class ProxySpider(object):
    def __init__(self):
        self.baseurl = 'https://www.xicidaili.com/nn/ {}'
        self.xpathip = '//tr/td[2]/text()'
        self.xpathport = '//tr/td[3]/text()'
        self.xpathhttps = '//tr/td[6]/text()'
        self.ua = UserAgent()

    def request_html(self,url):
        try:
            header = {'User-Agent':'Mozilla/4.0'}
            html = requests.get(url=url, headers=header).text
            return html
        except Exception as e:
            print (s)
            return 'error'
    def proxy_request_html(self,url,ip,isHttps):
        time.sleep (random.randint (1,2))
        proxy = {}
        if isHttps is True:
            proxy = {
                'https': ip
            }
        else:
            proxy = {
                'http': ip
            }
        try:
            header = {'User-Agent': self.ua.random}
            html = requests.get(url=url,headers=header, proxies=proxy, timeout=8)
            return True
        except Exception as e:
            print (ip, e)
            return False

    def get_html(self,url):
        print(url)
        html = self.request_html(url)
        self.parse_html(html)

    def parse_html(self,html):
        item_ip = []
        item_port = []
        item_http = []

        xpathobj = etree.HTML(html)
        iplist = xpathobj.xpath(self.xpathip)

        for ip in iplist:
            item_ip.append(ip)

        port_list = xpathobj.xpath(self.xpathport)
        for port in port_list:
            item_port.append(port)

        httpsStrs = xpathobj.xpath(self.xpathhttps)
        for is_https in httpsStrs:
            item_http.append(is_https)

        for li in range(0, len(item_ip),1):
            test_ip = item_ip[li]+":"+item_port[li]
            print ('Start to detect ip')
            if item_http[li] == "HTTPS":
                self.test_proxy(test_ip,True)
            elif item_http[li] == "HTTP":
                self.test_proxy(test_ip, False)

    def test_proxy(self,proxy_address,isHttps):
        ret = self.proxy_request_html('https://www.baidu.com/',proxy_address,isHttps)
        if ret is True:
            with open('proxy.log','a+') as f:
                f.write(proxy_address+'\n')
            print ('Agent node available', proxy_address)

    def run(self):
        url = self.baseurl.format (1)
        self.get_html(url)

if __name__ == '__main__':
    spider = ProxySpider()
    spider.run();
Copy code

 

4. Results

 

 It can be seen that http is basically usable, and HTTPS is basically not usable

************************************************************************************************************************************************************************************

Case 2: Python3 crawler-baidutieba-xpath

1. Use the xpath plugin to filter

Direct mouse click on the text or picture you want to filter, right click, there is xpath, then F12, just modify it

 

 ------------------------------------

 

 

2. Write code

Response Content

We can read the content of the server’s response. Consider the GitHub timeline again:

>>> import requests

>>> r = requests.get('https://api.github.com/events') >>> r.text '[{"repository":{"open_issues":0,"url":"https://github.com/... 

Requests will automatically decode content from the server. Most unicode charsets are seamlessly decoded.

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encodingproperty:

>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1' 

If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text. You might want to do this in any situation where you can apply special logic to work out what the encoding of the content will be. For example, HTML and XML have the ability to specify their encoding in their body. In situations like this, you should use r.content to find the encoding, and then set r.encoding. This will let you use r.text with the correct encoding.

Requests will also use custom encodings in the event that you need them. If you have created your own encoding and registered it with the codecs module, you can simply use the codec name as the value of r.encoding and Requests will handle the decoding for you.

Binary Response Content

You can also access the response body as bytes, for non-text requests:

>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...

The gzip and deflate transfer-encodings are automatically decoded for you.

For example, to create an image from binary data returned by a request, you can use the following code:

>>> from PIL import Image
>>> from io import BytesIO >>> i = Image.open(BytesIO(r.content)) 

JSON Response Content

There’s also a builtin JSON decoder, in case you’re dealing with JSON data:

>>> import requests

>>> r = requests.get('https://api.github.com/events') >>> r.json() [{'repository': {'open_issues': 0, 'url': 'https://github.com/... 

In case the JSON decoding fails, r.json() raises an exception. For example, if the response gets a 204 (No Content), or if the response contains invalid JSON, attempting r.json() raises ValueError: No JSON object could be decoded.

It should be noted that the success of the call to r.json() does not indicate the success of the response. Some servers may return a JSON object in a failed response (e.g. error details with HTTP 500). Such JSON will be decoded and returned. To check that a request is successful, user.raise_for_status() or check r.status_code is what you expect.

Raw Response Content

In the rare case that you’d like to get the raw socket response from the server, you can access r.raw. If you want to do this, make sure you set stream=True in your initial request. Once you do, you can do this:

>>> r = requests.get('https://api.github.com/events', stream=True) >>> r.raw <urllib3.response.HTTPResponse object at 0x101194810> >>> r.raw.read(10) '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03' 

In general, however, you should use a pattern like this to save what is being streamed to a file:

with open(filename, 'wb') as fd: for chunk in r.iter_content(chunk_size=128): fd.write(chunk) 

Using Response.iter_content will handle a lot of what you would otherwise have to handle when using Response.raw directly. When streaming a download, the above is the preferred and recommended way to retrieve the content. Note that chunk_size can be freely adjusted to a number that may better fit your use cases.

Copy code
from lxml import etree
import re,time,os,random
import requests
from urllib import parse
from fake_useragent import UserAgent

class BaiduTiebaSpider(object):
    def __init__(self):
        self.baseurl = r'http://tieba.baidu.com/f?kw={}&pn={}'
        self.title_baseurl = r'https: //tieba.baidu.com {} '
        self.picXpath = r'//cc//img[@class="BDE_Image"]/@src'
        self.titleurlXpath = r'//li//a[@class="j_th_tit "]/@href'
        self.videoXpath = r'/div[@class="video_src_wrap_main"]/video/@src'
        self.ua = UserAgent()
        self.savePath = r'/home/user/work/spider/baidu/BaiduTieba/'

    def get_html(self,url):
        # header = {'User-Agent':self.ua.random}
        header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko'}
        res = requests.get(url=url,headers=header).content
        return res

    def parse_html(self,html):
        parse = etree.HTML(html)
        titlelink_list = parse.xpath(self.titleurlXpath)
        for li in titlelink_list:
            titleurl = self.title_baseurl.format (li)
            print(titleurl)
            self.save_html(titleurl)
        time.sleep (random.randint (2,3))

    def save_html(self,url):
        html = self.get_html(url)
        parse = etree.HTML(html)
        piclinks = parse.xpath(self.picXpath)
        for pics in piclinks:
            self.save_img(pics,self.savePath+pics[-10:])
        videolinks = parse.xpath(self.videoXpath)
        for videos in videolinks:
            self.save_img(videos,self.savePath+videos[-10:])

    def save_img(self,imgurl,filename):
        img = self.get_html(imgurl)
        with open(filename,'wb') as f :
            f.write(img)
        print(filename,'DownLoad Sucess')

    def run(self):
        name = input ('Enter the name of the post you want to query>')
        start = input('Start Page>')
        end = input('End Page>')
        mainurl = self.baseurl.format(parse.quote(name),0)
        print(mainurl)
        pagehtml = self.get_html(mainurl)
        self.parse_html(pagehtml)

if __name__ == '__main__':
    spider = BaiduTiebaSpider()
    spider.run();
Copy code

 Note the header

3. Results

 

*****************************************************************************************************************************************************************************************

Case 3: Python3 crawler-chain home 2 hand room-xpath

 

1. XPath is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in XML documents.

 

XPath is the main element of the W3C XSLT standard, and both XQuery and XPointer are built on XPath expressions.

 

Therefore, the understanding of XPath is the foundation of many advanced XML applications.

 

2. The Xpath plug-in is available in the Chrome browser store and also in the 360 ​​browser extension center

 

 

 

 

 Open a webpage, F12, at the end

 

 

 

 

3.xpath filtering

 

 

 

 

 

 

 

Test it inside, test it well, put it in the code

 

4. Python3 code

 

Copy code
import requests
import random,re,time
from fake_useragent import UserAgent
from lxml import etree

class LianJiaSpider(object):
    def __init__(self):
        self.baseurl = 'https://sz.lianjia.com/ershoufang/pg{}/'
        self.ua = UserAgent()


    def get_html(self,url):
        header = {'User-Agent':self.ua.random}
        html = requests.get(url,headers=header,timeout=5).text
        # html.encoding = 'utf-8'
        self.parse_html(html)

    def parse_html(self,html):
        parse = etree.HTML(html)

        li_list = parse.xpath('//ul[@class="sellListContent"]/li[@class="clear LOGVIEWDATA LOGCLICKDATA"]')

        item = {}
        for i in li_list:
            item['name'] = i.xpath('.//a[@data-el="region"]/text()')[0]
            info_list = i.xpath('.//div[@class="houseInfo"]/text()')[0].split('|')
            item['model'] = info_list[0].strip()
            item['area'] = info_list[1].strip()
            item['direction'] = info_list[2].strip()
            item['perfect'] = info_list[3].strip()
            item['floor'] = info_list[4].strip()
            item['age'] = info_list[5].strip()
            item['address'] = i.xpath('.//div[@class="positionInfo"]/a/text()')[0].strip()
            item['total'] = i.xpath('.//div[@class="totalPrice"]/span/text()')[0].strip()
            item['unit'] = i.xpath('.//div[@class="unitPrice"]/span/text()')[0].strip()[2:-4]

            print(item)

    def run(self):
        url = self.baseurl.format (1)
        self.get_html(url)

if __name__ == '__main__':
    spider = LianJiaSpider ();
    spider.run();
Copy code

 

 

 

5. Results

 

 

 

 

 Poor, poor, can't afford it, can't afford it with both hands

 

********************************************************************************************************************************************************************************************************************

Case 4: python3 crawler-Baidu pictures

 

import requests
import re,time,random,os
from urllib import parse
from fake_useragent import UserAgent

class BaiduImgSpider(object):
    def __init__(self):
        self.baseurl = 'https://image.baidu.com/search/index?tn=baiduimage&word= {}'
        self.count = 1;
        self.ua = UserAgent()
        self.savepath = '/home/user/work/spider/day03/'
        self.re_str = r'{"thumbURL":"(.*?)","replaceUrl":'

    def get_html(self,name,orgname):
        header = {'User-Agent':self.ua.random}
        url = self.baseurl.format (name)
        html = requests.get(url=url,headers = header).text
        pattent = re.compile(self.re_str,re.S)
        img_list = pattent.findall(html)
        path = self.savepath+orgname
        if not os.path.exists(path):
            os.mkdir(path)
        for img_link in img_list:
            print(img_link)
            self.save_img(img_link,path)
            time.sleep (random.randint (1,2))

    def save_img(self,url,path):
        header = {'User-Agent': self.ua.random}
        html = requests.get(url=url,headers=header).content
        filename = path+"/"+str(self.count)+'.jpg'
        with open(filename,'wb') as f:
            f.write(html)
            print ('Successfully downloaded', filename)
        self.count += 1

    def run(self):
        search_name = input ('Enter the name to get>');
        word = parse.quote(search_name)
        self.get_html(word,search_name)


if __name__ == '__main__':
    spider = BaiduImgSpider()
    spider.run();
Copy code

 

Code directly, very simple

 

 

****************************************************************************************************************************************************************************************************************************

Case 5: Python3 Reptile Movie Paradise

from urllib import request
import re,time,random
from fake_useragent import UserAgent

class DyTTSpider(object):
    def __init__(self):
        self.base_url = 'https://www.dytt8.net'
        self.url_one = 'https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html'
        self.ua = UserAgent()

    def get_html(self,url):
        header = {'User-Agent':self.ua.random}
        req = request.Request(url,headers=header)
        res = request.urlopen(req)
        ret = res.read().decode('gb2312','ignore')
        return ret;

    def re_html(self,html,restr):
        patent = re.compile(restr,re.S)
        ret = patent.findall (html)
         return ret;

    def parse_html(self,one_url):
        html_ret = self.get_html(one_url);
        re_str = r'<table .*?<td width="5%".*?<a href="(.*?)" class="ulink">.*?</table>';
        ret_list = self.re_html(html_ret, re_str);
        for link in ret_list:
            print(link)
            self.parse_sencond(self.base_url+link)
            time.sleep (random.randint ( 2,3 ))

    def parse_sencond(self,second_html):
        item = {}
        html_ret = self.get_html(second_html)
        re_str = r'<div class="title_all".*?<font.*?>(.*?)</font>.*?<td style="WORD-WRAP:.*?gaygpquf="(.*?)" clickid.*?</td>'
        two_list = self.re_html(html_ret,re_str)
        item['name'] = two_list[0].strip()
        item['dlink'] = two_list[1].strip()


    def run(self):
        geturl = self.url_one.format(1)
        self.parse_html(geturl)

if __name__ == '__main__':
    dy = DyTTSpider ()
    dy.run();

 ************************************************************************************************************************************************************

Case 6: Python3 crawler youdao

import requests
import random,time
from hashlib import md5
from fake_useragent import UserAgent

'''
            var t = n.md5(navigator.appVersion)
          , r = "" + (new Date).getTime()
          , i = r + parseInt(10 * Math.random(), 10);
        return {
            ts: r,
            bv: t,
            salt: i,
            sign: n.md5("fanyideskweb" + e + i + "Nw(nmmbP%A-r6U3EUn]Aj")
        }
'''

class FanyiSpider(object):
    def __init__(self):
        self.baseurl = ' http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule ' 
        self.ua = UserAgent ()

    def make_formdata_string(self,word):
        formdata = {
            "i": "",
            "from": "AUTO",
            "to": "AUTO",
            "smartresult": "dict",
            "client": "fanyideskweb",
            "salt": "",
            "sign": "",
            "ts": "",
            "bv": "37074a7035f34bfbf10d32bb8587564a",
            "doctype": "json",
            "version": "2.1",
            "keyfrom": "fanyi.web",
            "action": "FY_BY_REALTlME",
        }
        s = md5()
        formdata['i'] = word;
        formdata['ts'] = str(int(time.time()*1000));

        # formdata ['bv'] = s.hexdigest (); 

        formdata [ ' salt ' ] = formdata [ ' ts ' ] + str (int (random.randint (0,9 )))
        signstring = "fanyideskweb" + word + formdata['salt'] + "Nw(nmmbP%A-r6U3EUn]Aj"

        s.update(signstring.encode())
        formdata['sign'] = s.hexdigest();

        return formdata;

    def make_headerString(self):
        headerdata = {
            "Accept": "application/json, text/javascript, */*; q=0.01",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Connection": "keep-alive",
            "Content-Length": "240",
            "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
            "Cookie": "DICT_UGC=be3af0da19b5c5e6aa4e17bd8d90b28a|; [email protected]; JSESSIONID=abcd4CqXd2rvOfBBNVfgx; OUTFOX_SEARCH_USER_ID_NCOO=928698907.9532578; _ntes_nnid=163fba552b6912766f975a5c9077e584,1587086791577; SESSION_FROM_COOKIE=fanyiweb; YOUDAO_FANYI_SELECTOR=OFF; ___rl__test__cookies=1587095527239",
            "Host": "fanyi.youdao.com",
            "Origin": "http://fanyi.youdao.com",
            "Referer": "http://fanyi.youdao.com/?keyfrom=fanyi-new.logo",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
            "X-Requested-With": "XMLHttpRequest",
        }
        # headerdata['User-Agent'] = self.ua.random;
        return headerdata;

    def request_str(self):
        pass

    def post_html(self,headerdata,formdata):
        ret = requests.post(url=self.baseurl,data=formdata,headers=headerdata)
        print(ret.text)


    def run(self):
        word = input ( ' Enter the text to be queried> ' )
        formdata = self.make_formdata_string(word);

        self.post_html(self.make_headerString(),formdata)




if __name__ == '__main__':
    spider = FanyiSpider()
    spider.run();

 

 

Guess you like

Origin www.cnblogs.com/nightnine/p/12719581.html