Asynchronous loading of dynamic web crawlers: Ctrip website review data

The traditional web crawling method is very effective in many cases. It only needs to obtain the URL of the page, and when you need to turn the page, you can usually find the pattern from the URL and obtain the required information by iteratively calling each page. However, with the development of web technology, many websites use asynchronous loading to dynamically load content, especially when turning pages. This results in some irregular parameters appearing in the URL of page turning requests. These parameters may be related to timestamps, application instructions or other factors, and it is difficult for users to deduce their specific rules.

Faced with this situation, we need another more intelligent and flexible method to solve this problem. This article takes the crawling of "Orange Island Head" reviews on the Ctrip website as an example to introduce in detail how to deal with the challenge of dynamic URLs. It also shows how to parse the crawled information in JSON format and save the final data to An XLS file for subsequent analysis and use.

 

1. Crawler of static pages (including specific operations of requesting and finding paths)

Traditional crawler methods are usually based on HTTP requests and HTML parsing. They are relatively simple to use and can implement basic crawling functions by writing a small amount of code. For most static web pages, traditional crawler methods are sufficient and can directly obtain the required content, such as text and images. This can usually quickly obtain static web page content because complex page rendering and dynamic loading are not involved. Take crawling the top data of Douban books as an example.

1. Basic preparation work (library adjustment, defining application header URL and deaders)

First import the libraries required by the crawler.

requests library is a commonly used HTTP request library, used to send requests to websites and obtain responses. You can use this library to set request headers and obtain page content.

lxml is another commonly used XML and HTML parsing library in Python. It provides efficient and flexible parsers for processing XML and HTML documents. lxml is very fast to parse and supports XPath and CSS selectors, making data extraction and positioning elements easier and more convenient.

In short, the requests library is used to get the page content of the web page, and the lxml library is used to extract the data we need.

from lxml import etree
import requests
import csv

 

Create a new file named 'douban250.csv' to save the data we crawled from the web page. Since we will crawl the names of books, image links, etc., we will save these tag names to the first line of the file.

b176c50cc6cb4c14ae618134b32632de.png

fp = open("douban250.csv",'wt',newline='',
         encoding = 'utf-8')
writer = csv.writer(fp)
writer.writerow(('name','url','author','publisher','date','price','rate','comment'))

 

Define headers request headers and url, used to apply for web page content.

First, right-click the element that needs to be queried, click 'Inspect', or press F12 (on Mac, press fn in the lower left corner and then press F12 on the touch keyboard) to jump out of the page on the right and look for headers and urls.

374cfa1f41f749888726647204304501.png

URL is an address used to identify resources on the Internet. Headers request headers are used to pass additional information to the server or control the behavior of requests and responses. The following is the search process.

①The operation to find the user-agent application header is as follows:
 

7acd228f728f480fab1b3dd675744a2e.png

9d6fd9d744c946d0b784f8323fa9dbf4.png

 

headers = {
    'User-Agent': '你的user-agent'
}

②Find url:

6651ff2fe7d44596916b857cef54182c.png

The URL of the first page is ‘https://www.qidian.com/all/page0’

The URL of the second page is ‘https://www.qidian.com/all/page25’

The URL of the third page is ‘https://www.qidian.com/all/page50’

 

It is found that every time you turn a page, the parameter after the url will be added by 25.

urls = ['https://www.qidian.com/all/page{}/'.format(str(i)) for i in range(1,10)]

 

2. Locate the specific location of the element in the source code based on xpath

Right-click a web page element and click Inspect to directly locate the element's location in the web page source code.

ac45bbafbd0e4967ac05694e5e3c8480.png

Right-click this code and copy xpath to get

//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/p[1] ,

This is the path of this information in the source code of the web page.

8e7e6af99d074fb4a3eb480ffbbb2d9b.png

for url in urls:
    html = requests.get(url,headers=headers)
    selector = etree.HTML(html.text)
    infos = selector.xpath('//div[@class="book-mid-info"]')
    for info in infos:
        name = info.xpath('td/div/a/@title')[0]
        book_infos = info.xpath('td/p/text()')[0]
        author = book_infos.split('/')[0]
        publisher = book_infos.split('/')[-3]
        date = book_infos.split('/')[-2]
        price = book_infos.split('/')[-1]
        rate = info.xpath('td/div/span[2]/text()')[0]
        comments = info.xpath('td/p/span/text')
        comment = comments[0] if len(comments) != 0 else "空"
        writer.writerow((name,url,author,publisher,date,price,rate,comment))

fp.close()

In the previous part of this code, we first used request headers to simulate the browser sending a request to obtain the source code of the target web page. By setting appropriate request header information such as User-Agent, we can make the request look more like it is sent by a real browser, thereby avoiding being intercepted by the website's anti-crawler mechanism. In this way, we can successfully obtain the original HTML content of the web page.

Next, we use the lxml library to parse the HTML document and locate the specific element information based on the predefined XPath path. XPath is a language for locating nodes in XML and HTML documents. In this example, we used an XPath path to specify the location of the target element to extract the required data from the entire HTML document.

 

2. Dynamic page-asynchronous loading crawling

Many modern web pages use asynchronous loading technology to dynamically load content through JavaScript. Traditional crawler methods often cannot directly capture these contents, resulting in the inability to obtain complete page data.

1. Performance of asynchronous loading

Asynchronous loading (Asynchronous Loading) is a way of loading content on a web page. Generally speaking, it means that there is no need to wait for the current content to be loaded before continuing to load other content. This loading method makes web pages faster and smoother for users to browse.

When a certain piece of content in a page is reverse loaded, it means that the page does not load the piece of content together when it is initially loaded. Instead, it dynamically requests new content from the server when the user scrolls the page or performs certain operations. content and then display it on the page.

Imagine you are visiting a web page that contains many images and text. In the traditional synchronous loading method, the browser will load the content on the page one by one in the order in which it appears. That is to say, if an image loads slowly, other parts of the entire page will be blocked until the image is loaded and cannot be displayed. This can cause pages to load very slowly.

Asynchronous loading is an improved solution. Web pages use asynchronous loading technology. When certain content needs to be loaded, a separate request will be initiated and then other content will continue to be loaded without waiting for the request to complete. In this way, other parts of the content can be loaded and displayed first without being blocked because a certain resource takes too long to load.

Take Ctrip’s website as an example: ‘https://you.ctrip.com/sight/changsha148/9010.html#ctm_ref=www_hp_his_lst’

During the page turning process, the web page URL does not change, and the element xpath of different pages does not change.

 

2. How to implement asynchronous loading

In practice, asynchronous loading is often implemented through JavaScript. The JavaScript code in the web page is responsible for initiating asynchronous requests when needed, obtaining resources or data, and displaying them on the web page after loading is completed.

In general, asynchronous loading is a technology that optimizes web page loading, allowing users to access web pages faster and browse content on web pages more smoothly by loading specific content only when necessary.

Therefore, you need to start with the JavaScript code to fully understand the loading process of the web page and the data generated at different stages.

JavaScript code is executed in the browser and can dynamically modify the DOM (Document Object Model) structure to achieve dynamic effects and data loading on the page. These dynamically generated content may include comments, recommendation information, user interaction data, etc. Therefore, if you only start from the original HTML source code, you may not be able to obtain complete page data.

In order to fully capture data in modern web applications, we need to simulate browser behavior to execute JavaScript code, render the page completely, and extract the required data afterward.

 

3. Asynchronously loaded page parsing

After the page turning operation is performed, a series of files will be generated. From this group of newly generated files, the file applying for page turning is found. Its request url is

‘https://m.ctrip.com/restapi/soa2/13444/json/getCommentCollapseList?_fxpcqlniredt=09031098111600950761&x-traceID=09031098111600950761-1690373857939-9219508’

This URL may contain some parameters to identify the specific information requested. Let's parse some important parts of this URL:

‘https://m.ctrip.com/restapi/soa2/13444/json/getCommentCollapseList**: This part is the server-side address, indicating the specific resources or services we want to access. ’

_fxpcqlniredt and x-traceID: These two parameters are some identifiers in the request or verify message. No need to go into detail here, just keep going.

eb19f42c902447f7be489ea1bf13b9af.png

Since the page turning request cannot be obtained directly through the URL, we can consider obtaining the dynamic URL by observing the application information of the page in the payload. In modern web applications, many web pages use JavaScript to dynamically generate request parameters when turning pages or loading asynchronously, and these parameters may be dynamically generated timestamps, request identifiers, etc. If we can extract this information from the payload of the page, we can construct the complete page turning request URL.

6067278e39b64c85af873952406300c2.png

headers = {
    'User-Agent': '你的user-agent',
    #'Cookie': 'your_cookie_value_here',
    #可以根据需要设置其他请求头
}

#base_url此处定义一个基础的url,是服务器端的地址,指明了我们要访问的具体资源或服务,和上面申请文件的url前部一样
base_url = 'https://m.ctrip.com/restapi/soa2/13444/json/getCommentCollapseList'
payload = {
    "arg": {
        "channelType": 2,
        "collapseType": 0,
        "commentTagId": 0,
        "pageIndex": 2,
        "pageSize": 10,
        "poiId": 77604,
        "sourceType": 1,
        "sortType": 3,
        "starType": 0
    },
    "head": {
        "cid": "09031098111600950761",
        "ctok": "",
        "cver": "1.0",
        "lang": "01",
        "sid": "8888",
        "syscode": "09",
        "auth": "",
        "xsid": "",
        "extension": []
    }
}

 base_url is a basic url, which does not contain parameters. It is the server-side address, indicating the specific resources or services we want to access, which is the same as the front part of the url in the application file above.

 

4. Crawling method for asynchronously loading pages

response = requests.post(base_url, data=json.dumps(payload), headers=headers)
soup = BeautifulSoup(response.text,'html.parser')
soup.text

We can use Python's built-in JSON module to parse this JSON data and obtain the content of the "comment" field. What the code gets is a js document, which is what the picture below shows. We only need to find the part we need in the information, and then we can call it through the dictionary method.

1432ac2903c24a58a6fee5dcaebbc10d.png

 

e5833bc52d3e4735a1a9821b0fea0e9b.png

As shown in the figure, the crawled content is confusing and it is difficult to directly extract the required data, so we can use the JSON library to parse it. The JSON library is one of the standard libraries in Python for processing JSON (JavaScript Object Notation) data. It can help us effectively parse information and obtain the data we need to crawl.

By using the JSON library, we can easily convert JSON data into a Python dictionary or list, making it easier to extract and process its contents.

import xlwt
import json

comment = xlwt.Workbook(encoding='utf-8')
sheet = comment.add_sheet('sheet1')

# 解析 JSON 数据
data = json.loads(soup.text)

Next, locate the information we need to crawl, in the "content" tag:

f7e426df919c44b984c276d7c595e5f7.png

content is in the items of result:

bed89f0c4aad495d9e46d104cd74100e.png

There are no other tags on the result. We get the result through data.get("result", [ ]). What we get is a list, which is saved in comments.

Then obtain the item through comments['items'], and obtain the value of the "content" field in each comment object.

# 获取 "comment" 字段的内容
comments = data.get("result", [])
items = comments['items']

a_list=[]
i=0

for item in items:
    k = item['content']
    j=0
    sheet.write(i,j,k)
    i = i+1
comment.save('评论.xls')

After assembling the above code, if you need to crawl data from different pages, you only need to modify the page_index in the payload:

#加上得分情况

book = xlwt.Workbook(encoding='utf-8')
sheet = book.add_sheet('sheet1')
header = ['评论','用户']
for h in range(len(header)):
    sheet.write(0,h,header[h])

for index in range(1,30):
    base_url = 'https://m.ctrip.com/restapi/soa2/13444/json/getCommentCollapseList'
    payload = {
        "arg": {
            "channelType": 2,
            "collapseType": 0,
            "commentTagId": 0,
            "pageIndex": index,
            "pageSize": 10,
            "poiId": 77604,
            "sourceType": 1,
            "sortType": 3,
            "starType": 0
        },
        "head": {
            "cid": "09031098111600950761",
            "ctok": "",
            "cver": "1.0",
            "lang": "01",
            "sid": "8888",
            "syscode": "09",
            "auth": "",
            "xsid": "",
            "extension": []
        }
    }

    response = requests.post(base_url, data=json.dumps(payload), headers=headers)
    soup = BeautifulSoup(response.text,'html.parser')
    
    
    # 解析 JSON 数据
    data = json.loads(soup.text)
    # 获取 "content" 字段的内容
    comments = data.get("result", [])
    items = comments['items']
    
    
    i=(index-1)*10+1
    for item in items:
        comments = item['content']
        name = item['userInfo']['userNick']
        j=0
        sheet.write(i,j,comments)
        sheet.write(i,j+1,name)
        i = i+1
        

book.save('评论.xls')    

 

 

 

Guess you like

Origin blog.csdn.net/celiaweiwei/article/details/131945814