Python crawler entry summary

I. Introduction

        Web crawlers, also known as web spiders and web robots, are programs or scripts that automatically crawl information on the World Wide Web in accordance with certain rules. The so-called crawling data is the process of writing a program to simulate the browser to surf the Internet, and then let it go to the browser to grab the data. Classification of crawlers in usage scenarios:

  • General crawler: An important part of the crawling system. What is fetched is a whole page of data.
  • Focus on crawlers: built on the basis of general crawlers. What is crawled is a specific part of the page.
  • Incremental crawler: Detect the update of data in the website. Only the latest updated data from the website will be crawled.

       Some data of the portal website may involve privacy issues or in order to prevent some crawlers from causing damage to the website, a corresponding anti-crawling mechanism will be established to prevent crawlers from crawling website data by specifying corresponding strategies or technical means. Almost at the same time that the anti-climb mechanism was born, the robots.txt anti-climb protocol was born. The robots.txt protocol is a gentleman's protocol, which clearly stipulates which data in the website can be crawled and which cannot be crawled. Check the robots anti-climbing protocol of the website: domain name/robots.txt, for example, check Baidu's robots protocol: https://www.baidu.com/robots.txt. With the anti-climb mechanism, anti-anti-climb was born. By formulating relevant strategies or technical means, the crawler program can crack the anti-climbing mechanism in the portal website to obtain the data of the portal website.

       Since it is crawling information on the World Wide Web, it is inevitable to understand the HTTP and HTTPS protocols first. HTTP protocol is the abbreviation of Hyper Text Transfer Protocol (Hyper Text Transfer Protocol), which is a transfer protocol used to transfer hypertext from a World Wide Web (WWW: World Wide Web) server to a local browser. Simple understanding is a form of data interaction between the server and the client. The HTTPS protocol is the abbreviation of Secure Hyper Text Transfer Protocol. It establishes an SSL encryption layer on HTTP and encrypts data, which is more secure than HTTP. The relationship between the two is: HTTPS=HTTP+TLS/SSL. HTTP has risks of information eavesdropping, tampering, hijacking, etc., while HTTPS will perform information encryption, integrity verification, and identity verification to reduce the risk. From the working mechanism of HTTPS and the relationship with HTTP, it can be concluded that they have the following differences:

  • Cost: HTTPS protocol needs to apply for a certificate at ca. Generally, there are fewer free certificates and a certain fee is required;
  • Transmission: HTTP is a clear text transmission method, while HTTPS is a secure hypertext transmission protocol, which is a secure SSL encrypted transmission protocol;
  • Port: HTTP and HTTPS use completely different connection methods, and the ports used are also different. The former uses port 80, while the latter uses port 443;
  • Security: The HTTP connection is very simple and stateless, while the HTTPS protocol is a network protocol composed of SSL+HTTP that can be encrypted and authenticated, with higher security.

In the process of establishing a connection, the data header will carry a lot of important information. Common request header information: User-Agent: the identity of the request carrier; Connection: the browser tells the server through this header whether to disconnect or stay connected after the request. Common response header information: Content-Type: The server tells the browser the type of data sent back through this header.

 

Second, data crawling

       Among the modules involved in network requests, there are the urllib module and the requests module. The latter is more concise and efficient. Now the latter is more commonly used. For data crawling, the requests module is also mainly used here. The requests module is a native network request-based module in python. It is very powerful, simple and convenient to use, and more efficient in handling crawler operations. Its function is to simulate the browser to send a request, and open the browser to request a web page. First, enter the request URL in the address bar, and then press Enter to send the request. After the sending is successful, the browser displays the response data information, which can be sent by requests. The coding ideas for web page requests are as follows:

  1. Specified url
  2. send request
  3. Get response data
  4. Persistent storage

       The following describes data crawling from the shallower to the deeper. First, take the request for Baidu homepage (www.baidu.com) as an example, use the requests library to request the Baidu homepage and print the returned result, and save the response data locally:

# -*-  coding:utf-8 -*-
import requests


# 需求:爬取百度网页的数据
if __name__ == '__main__':
    # step1: 指定url
    url = 'https://www.baidu.com/'

    # step2: 发送求情
    # get方法会返回一个响应数据
    response =  requests.get(url=url)

    # step3: 获取响应数据
    # .text返回字符串式的响应数据
    page_text = response.text
    print(page_text)

    # step4: 持久化存储
    with open('./baidu.html','w',encoding='utf-8') as fp:
        fp.write(page_text)
    print('爬取数据结束!')

       Through the above piece of code, the page data of the response Baidu homepage can be obtained. Of course, for most people, the important function of the browser is to search for information. Here we take the search keyword "Hello" as an example to request a web page:

# -*- coding:utf-8 -*-

import requests

if __name__ == '__main__':
    # UA伪装:将对应的User-Agent封装到一个字典中
    headers = {
        'User-Agent':'你的标识'
    }

    # step1: 指定url
    url = 'https://www.sogou.com/web'
    params = '你好'

    # step2: 发送请求
    response = requests.get(url=url,params=params,headers=headers)

    # step3: 获取响应数据
    page_text = response.text

    # step4: 持久化存储
    fileName = query+'.html'
    with open(fileName,'w',encoding='utf-8') as fp:
        fp.write(page_text)

    print(fileName,'保存成功!')

       The above code realizes opening the browser, entering keywords to search, and returning the data information of the search result page. The difference from the previous code is that one is to set the parameters attached to the url, and the other is to set the header information for UA disguise. The so-called UA, or User-Agent, is the identity of the request carrier. Some portals will use UA detection as an anti-crawl mechanism, and their server will detect the identity of the request carrier. If the identity of the request carrier is detected as a certain browser, it will be considered as a normal request; otherwise, if it is detected If the request is not based on a certain browser, it is considered that this is not a normal request, such as a crawler program, then the server may refuse access and the crawler program fails to crawl the data. Therefore, the User-Agent is set here, and the UA disguise is performed to make the requesting server think that this is a normal request.

       The above is to store the response data as an html file, and the response data can also be stored as a json format file through the json library. Take Baidu Translate as an example. When the translation content is entered, the partial content of the page will be refreshed. Here is the ajax technology. You can check the response data information by checking -> network -> XHR, where XHR corresponds to Ajax data request, after receiving the response data, save it as a json format file. The encoding is as follows:

# -*- coding:utf-8 -*-

import requests
import json


if __name__ == '__main__':
    # 1、指定url
    url = 'https://fanyi.baidu.com/sug'
    # UA伪装,设置请求载体标识
    headers = {
        'User-Agent':'你的标识'
    }
    # 参数处理
    kw = input('请输入要翻译的文字:')
    data = {
        'kw':kw
    }

    # 2、发送请求
    response = requests.post(url=url,data=data,headers=headers)

    # 3、获取响应数据。json()方法返回一个json对象。
    res_obj = response.json()
    print(res_obj)

    # 4、持久化存储
    fileName = kw+'.json'
    fp = open(fileName,'w',encoding='utf-8')
    json.dump(res_obj,fp=fp,ensure_ascii=False)

    print(kw+'翻译结果已存储!')

       In addition, unlike the above two pieces of code, when persistent storage is performed, it is directly assigned to fp, while the with keyword was used before. with is a new grammar introduced from python2.5. It is a context management protocol. The purpose is to remove the try, except and final keywords and the code related to resource allocation and release from the flowchart, simplifying try and except , Finally processing flow. The basic syntax format is: with exp [as target]: with body. In addition, in this example, it can be found that some data of a page is likely to be dynamically loaded. For example, the data obtained by the above code is dynamic and not obtained through the address of the webpage address bar. The above examples are all directly based on the URL to capture the data of the webpage. If you want to capture the detailed data of a website data, it may involve the processing of response data and the redirection of the webpage. For example, the following example:

# -*- coding:utf-8 -*-

import requests
import json

if __name__ == '__main__':
    # get id
    # designate home url
    home_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'

    # dset headers
    headers = {
        '你的标识'
    }

    #  store company id and personal name
    id_list = []
    person_list = []

    # set paramenter
    for page in range(1,6):
        page = str(page)
        params = {
            "on": "true",
            "page": page,
            "pageSize": 15,
            "productName":"",
            "conditionType": 1,
            "applyname":""

        }
        # initiate request
        res_json = requests.post(url=home_url,data=params,headers=headers).json()
        # print(res_json)
        for dic in res_json['list']:
            id_list.append(dic['ID'])


    # print(id_list)


    # start to access personal name according to id
    # designate detail url
    detail_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'

    # set detail paraments
    for id in id_list:
        access_detail_data = {
            "id":id
        }
        #initiate request,and get detail response
        detail_res = requests.post(url=detail_url,data=access_detail_data,headers=headers).json()
        # test
        person_list.append(detail_res['businessPerson'])

    print(person_list)

    # persistently store
    fileName = 'personalname.txt'
    fp = open(fileName,'w',encoding='utf-8')
    json.dump(person_list,fp=fp,ensure_ascii=False)

    print(fileName,'已完成存储!')

         The above code takes the legal representative of a cosmetics company obtained through the Food and Drug Administration as an example, crawls the multi-page data of the webpage through a loop, extracts the required information from the crawled data, and then splices the information into an address according to the information, and crawls the required information The detailed information is finally cut and saved locally.

 

Three, data analysis

       In many cases, what we don't need is not the data of the entire webpage, but part of the useful information in the webpage. Therefore, we need to process the acquired data, analyze or extract the required information. The principle of data analysis is that the required partial text content of the webpage will be stored in the tag or as the attribute of the tag. By locating the corresponding tag, the required text content can be parsed from the tag or the attribute of the tag. Here are three summary Form: regular analysis, bs4 analysis, xpath analysis. Generally, the analysis of data is placed before the persistent storage and after the response data is obtained. The data crawling process mentioned above can be extended to the following:

  • Specified url
  • Initiate a request
  • Get response data
  • data analysis
  • Persistent storage

       The following will introduce data analysis through three examples using regular expressions, bs4 and xpath to analyze.

 

1. Regular analysis

       From the following example of crawling post bar background image, we can see that compared with the previous data crawling process, there is one more data analysis process. The response data is obtained by initiating a request to the specified page, and then the response data is parsed and filtered through regular expressions. , Obtain the required picture address data, and finally initiate a request to obtain the picture according to the obtained picture address data, and save it to a local folder. The principle of python regular analysis is to import the re module, set the corresponding regular expression, call the findall method to match the requested response data, and filter out the data that meets the requirements. Since the python1.5 version, the re module that provides perl style regular expressions has been added, which enables the python language to have all regular expression functions. In the code below, an ex variable is defined to save the corresponding regular expression, where "." means all characters except line breaks, "*" means any number of times, and "?" means optional; findll method To match the content of this article with regular expressions, here is the modifier re.M for multi-line matching, in addition to the modifier re.I means ignoring case, re.S means single-line matching, in python data In parsing, the modifier re.M is generally used. It is specific according to needs, and multiple modifiers can be used at the same time to control the matching mode. This is achieved by bitwise OP (|), if you want to match more case-insensitively For row data, both the I flag and the M flag are set (re.I | re.M). In addition, there are many other optional modes and modifiers, please refer to python regular expressions .

# -*- coding:utf-8 -*-

import requests
import re
import os

if __name__ == '__main__':
    # 指定url
    url = 'https://tieba.baidu.com/p/6469380781'

    # 伪装UA
    headers = {
        'User-Agent': '你的标识'
    }

    # 发起请求
    res_text = requests.get(url=url,headers=headers).text

    # print(res_text)
    # 使用聚焦爬虫对网页的数据进行解析或者提取
    ex = '<img class="BDE_Image" src="(.*?)" .*?><br>'
    img_list = re.findall(ex,res_text,re.S)
    # print(img_list)

    # 创建文件夹,保存数据
    if not os.path.exists('./touxiang'):
        os.mkdir('./touxiang')

    for img in img_list:
        # 获取图片数据
        img_data = requests.get(url=img,headers=headers).content
        # 设置图片名称
        img_name = img[-10:]
        # 将图片保存至本地
        img_path = './touxiang/'+ img_name
        with open(img_path,'wb') as fp:
            fp.write(img_data)
            print(img_name,'已成功下载!')

2, bs4 analysis

       The above regular analysis method can be used not only in the python language, but also in other languages. It is a more general data analysis method. And bs4 is a unique data analysis method of python language, which can only be used in python language. This parsing method requires an lxml parser and instantiation of the BeautifulSoup object, so if you want to use bs4 for data analysis, you need to install the bs4 module and lxml module in the environment: pip install bs4 and pip install lxml. The analysis principle is to instantiate the BeautifulSoup object, load the data to be parsed into the object, and then call the BeautifulSoup-related attributes or methods to locate the data or extract the data. The parsed data loaded into the BeautifulSoup object can come from a local file or a file crawled from the web. If it is the former, the local file can be opened by the open function and loaded into the BeautifulSoup object: soup = BeautifulSoup(open('local file to be parsed'),'lxml'); if it is the latter, the crawled network data will be passed. The text method is converted into the data of the string object, and then loaded into the BeautifulSoup object: soup = BeautifulSoup('Webpage data to be parsed','lxml'). After loading the object, perform label positioning or data extraction as needed:

  • Positioning tag : soup.tagName. For example soup.a.
  • Get attributes : soup.tagName.attrs will return a dictionary containing all the attributes of the tag. To get the image name: soup.img.attrs['alt'] or soup.img['alt'].
  • Get content : soup.tagName.text, soup.tagName.string, soup.tagName.get_text. The string method can only get the direct text content under the tag, and the other two can get all the text content in a tag.
  • Find a tag that meets the requirements : soup.find('tagName'[,title=""/alt=""/class_=""/id=""]).
  • Find all tags that meet the requirements : soup.find_all('tagName') returns all tagName tags, soup.find_all('tagName1','tagName1') returns all tagName1 and tagName2 tags, soup.find_all('tagName', limit= 2) Return to the first two labels.
  • Select the specified content according to the selector : soup.select('tag_name/.class_name/#id_name/level selector') will return a list. For the level selector, you can use ">" and "". The former represents one level and the latter represents multiple levels.
# -*- coding:utf-8 -*-

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    # step1: 爬取章节页面的数据
    # 伪装UA
    headers = {
        'User-Agent': '你的标识'
    }
    # 指定url
    chapter_url = 'http://www.shicimingju.com/book/hongloumeng.html'
    # 发起请求
    chapter_text = requests.get(url=chapter_url,headers=headers).text

    # step2: 对章节页面的数据进行解析,获得章节标题和详情页的url,并将爬取的数据保存至本地
    # 实例化beautifulsoup对象,把章节页面的源码数据加载到该对象中
    chapter_soup = BeautifulSoup(chapter_text,'lxml')
    # print(chapter_soup)
    chapter_list = chapter_soup.select('.book-mulu > ul > li')

    fp = open('./红楼梦.txt','w',encoding='utf-8')
    for li in chapter_list:
        # 获得详情页的url、章节标题
        detail_url = 'http://www.shicimingju.com'+li.a['href']
        chapter_title = li.a.string

        # 请求详情页,获得章节详细内容
        detail_text = requests.get(url=detail_url,headers=headers).text
        # 对详情页的数据进行解析,获得每个章节的具体内容
        detail_soup = BeautifulSoup(detail_text,'lxml')
        detail_content = detail_soup.find('div',class_='chapter_content').text
        fp.write(chapter_title+':'+detail_content)
        print(chapter_title,'下载完成!')

 

3. Xpath analysis

        Xpath parsing is the most commonly used and most convenient and efficient parsing method with strong versatility. This parsing method has many similarities with the above bs4. First of all, the environment needs to be installed, because this parsing requires an lxml parser, and the etree module is used to instantiate the etree object: pip install lxml. The principle of this parsing method is to instantiate an etree object, load the source code data to be parsed into the etree object, and call the xpath method and xpath expression in etree to implement label positioning and specified data extraction. Objects loaded into etree can also be roughly divided into two categories, one is local file data, and the other is crawled network data. If it is a local file, call the parse method to instantiate the etree object:tree = etree.parse(文件名);如是通用爬虫爬取的网络源码,使用.text将网页源码数据转化为字符串内容,再调用HTML方法加载到etree对象中:tree = etree.HTML(网页内容字符串)。实例化etree对象后,通过 xpath表达式、调用xpath方法进行标签定位和数据提取,常见的xpath表达式如下:

  •  /  Start positioning from the root node, indicating one level; //  Indicating multiple levels, you can start positioning from any position.
  • 属性定位://div[@class="class_name"]。找到class属性值为class_name的div标签。
  • Level and index positioning : //div[@class="class_name"]/tag1/tag2[1]/tag3。找到class属性值为class_name的div直系子标签tag1下的第一个自标签tag2下的直系自标签tag3。
  • Take attributes : //div[@class="class_name"]//tag1[3]/a/@href。获取class属性值为class_name的div的第三个直系自标签下的a标签的href属性值。
  • Take text : //div[@class="class_name"]/text(),获取class属性值为class_name的div的直系文本内容;//div[@class="class_name"]//text(),获取class属性值为class_name的div标签中的所有文本内容。
  • Fuzzy matching://div[contains(@class, "ng")] Find the div tag that contains ng in the class attribute value;//div[starts-with(@class, "ta")],找到class属性值以ta开头的div标签
  • Logical operation : //a[@href="" and @class="du"], find the a tag whose href value is empty and class value is du.
# -*- coding:utf-8 -*-

import requests
from lxml import etree
import os

if __name__ == '__main__':
    # step1: 获取图片所在网页的源码数据

    # 伪装UA
    headers = {
        'User-Agent': '你的标识'
    }
    # 指定url
    url = 'http://pic.netbian.com/4kfengjing/'
    # 发起请求
    home_text = requests.get(url=url,headers=headers).text
    # print(home_text)

    # step2: 对获取的源码数据进行解析
    home_etree = etree.HTML(home_text)
    li_list = home_etree.xpath('//div[@class="slist"]/ul/li')

    # step3: 创建文件夹,保存下载的图片文件
    if not os.path.exists('./风景图'):
        os.mkdir('./风景图')

    # print(li_list)
    for li in li_list:
        # 获取图片的地址和名称
        img_url = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]
        img_name = li.xpath('./a/img/@alt')[0].encode('iso-8859-1').decode('gbk')+'.jpg'
        # print(img_name)
        # 访问图片地址,获取图片数据
        img_content = requests.get(url=img_url,headers=headers).content

        # 对获取的图片进行持久化存储
        save_path = './风景图/'+img_name
        with open(save_path,'wb') as fp:
            fp.write(img_content)
            print(img_name,'下载成功!')

 

 

 

Reference materials:

1. Novice tutorial

2. Little Ape Circle

3. Getting started with crawler development

Guess you like

Origin blog.csdn.net/VinWqx/article/details/104292931