Dynamic web crawler solution

Preface

When reviewing the early download video knowledge points, I found that some key points were not clear. Hereby sort out

Summary of crawler practical experience

If you are not proficient in dynamic web crawlers, then crawlers are not very useful.

A little story about crawlers (can be skipped)

Suppose there is such a table and chair. They have a set of default rules. When a human sits in a chair, the table provides services according to the rules.
Normally, there is a human sitting on a chair and said, I want a pen. So a pen came from below on the surface of the table. But this human needs 100 pens first, which a desk can't do. So the human opened the desk drawer, flipped through and saw the location of a hundred pens, and told its robot.
Static webpage : The human orders its robot servant to take out the pens one by one according to the address previously told to it. The human leaves the chair, the robot sits in the chair and completes the task.

Dynamic webpage : The human still orders the robot to take out the pen according to the address given to it, the human leaves the chair, and the robot sits in the chair.
At this time, in order to prevent the robot servant from taking the pen from the drawer, this table does not allow non-humans to see the inside of the drawer unless approved by the chair.
In order for the robot to help itself with the pen, tell it the shape of its butt and let it pretend It's me, so the robot sits on the chair and can open the drawer to see the inside.

But the robot can't find the pen
**selenium:** It turns out that the robot does not need to pull the drawer to reach in and take the pen. Humans need to pull the drawer to get the pen position. So the robot also imitated the human to pull the drawer and saw the position of the pen to complete the task
interface : the robot did not want to imitate the human, it saw that there was a rope in the drawer where the pen should be placed, and along the rope he found one safe. There are countless little spirits in the safe, and they have the pens it needs. (The little spirits twist countless data into a string and pass them to the drawer) There are many knobs on the safe, and the corresponding pen will come out by turning them to the designated position. The robot noticed that the knobs corresponding to each pen are the same except for one or two.

Dynamic web crawler solution

Nowadays, most webpages are dynamic webpages. Cases like the classic case Douban 250 ranking list can’t be found with a lantern.

There are only two solutions

the way advantage Disadvantage practice
Analysis interface The data can be requested directly without any analysis work. Less code, high performance The analysis interface is more complicated, especially some interfaces that are confused by js. Must have a certain basic knowledge of js. Easily spotted as a crawler Youdao interface, b station video interface, weather app interface
selenium Directly simulate the behavior of the browser. What the browser can request can also be requested using selenium. The crawler is more stable Large amount of code, low performance scrapy+selenium

Obtaining data through the interface The
first example-Youdao dictionary The
second example-Station b The
third example-Baidu pictures (this article only involves dynamic picture web pages)
Yes Dao and b station involve an important difference-the difference between post request and get request

Youdao post, the browser sends the header first, and then sends the data. There is an encryption mechanism in the data. This mechanism needs to be cracked in the js file. This js file is used by the website to send a request to its server and when submitting a data packet. What is packaged is the need to restore after beautification. Analyzing the js file, you will find the data source of the grammar specifying the encryption parameters. This encryption mechanism can be copied into the code to replace the encryption of the js file

The interface of station b----look for the playurl file in the network, and it turns out that it is in some videos, and some are not.

Insert picture description here

The interface found on the Internet is like this

'https://api.bilibili.com/x/player/playurl?' + 'bvid=' + bvid +'&cid=' + str(cid) + '&qn=64&type=&otype=json'

For different types of videos at station b, the data obtained in the network is also different. The
Insert picture description here
Insert picture description here
packet capture technology is actually very important. Expand your ability to obtain data.

The difficulty of a dictionary is that its parameters are encrypted.
The difficulty of the post b site is that it can’t directly get the video at one time. It needs to be downloaded in pieces. It seems that if there is no URL, it can only be downloaded like this. Since it is not a post method, no parameters are needed.

The third example, see the acquisition of Baidu pictures.
First come to the web page where you want to crawl Baidu pictures, right-click to check, click network, and observe whether there is an interface that contains the picture url, that is, the data package.
Insert picture description here
I didn’t find it.,,,, this is me A situation I often encounter-I have found the corresponding js data package through this method in other people's examples, but I can't find it myself. . . . . .

Even so, it is very important to find the interface that contains the required content (I found it later)

The interface of Baidu Pictures is

http://image.baidu.com/search/acjson?

Add parameters

param = 'tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord={}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=&hd=1&latest=0&copyright=0&word={}&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&cg=star&pn={}&rn=30&gsm=78&1557125391211='.format(

You can get the image url by visiting this interface

Insert picture description here
It’s much better to put it on the js parsing website,
Insert picture description here
and then download and save the basic operations

Also often occur FileNotFoundError: [Errno 2] No such file or directory: 'D: /base/0.jpg'
can regulate the operating point

cwd = os.getcwd()
        file_name = os.path.join(cwd, keyword)
        if not os.path.exists(keyword):
            os.mkdir(file_name)
        for index, url in enumerate(image_url, start=1):
            with open(file_name + '\\{}.jpg'.format(index), 'wb') as f:
                f.write(content)
            if index != 0 and index % 30 == 0:
                print('{}第{}页下载完成'.format(keyword, index / 30))

Insert picture description here
So downloading pictures through the interface is very smooth

Get data through selenium

The first example--------Tencent Animation's full station comics The
second example--------Baidu pictures (dynamic version) The
third example--------Click Get more articles

Tencent Animation

Crawl all station comics

Baidu Pictures

Go to the webpage, observe and
slide the mouse, the pictures appear one after another but the page is not refreshed, so it is a dynamic webpage. After
trying to use the normal method of xpath to no avail, the part of the code
using selenium
is as follows

subjects = driver.find_elements_by_xpath("//div[@class='imgbox']/a/img")
for subject in subjects:

        a = subject.get_attribute("data-imgurl")
        print(a)

Brief Book

Come to this webpage, observe the source code of the webpage, right-click to check, and try it with xpath helper

I found that this webpage has an anti-crawl mechanism, and the class name is dynamic. Even if you use relative position search on the code xpath, you may not be able to get the content you want. It will be very troublesome to determine the correct xpath syntax.

Now to crawl, the link or text corresponding to the topic income
Insert picture description here

Try it with the code.
Insert picture description here
Sure enough, the return is empty, unable to get the specified content

And the original webpage requires users to click more, so in order to solve these two specific problems, use selenium

Look at the code

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome(executable_path=r"D:\ProgramApp\chromedriver\chromedriver.exe")

url = "https://www.jianshu.com/p/7e2b63ed0292"
driver.get(url)

WebDriverWait(driver,5).until(
    EC.element_to_be_clickable((By.XPATH,"//section[position()=2]/div/div"))
)

while True:
    try:
        next_btn = driver.find_element_by_xpath("//section[position()=2]/div/div")
        driver.execute_script("arguments[0].click();",next_btn)
    except Exception as e:
        break

subjects = driver.find_elements_by_xpath("//section[position()=2]/div[position()=1]/a")
for subject in subjects:
    print(subject.text)

Insert picture description here
The result is achieved.
Summary:
Status 1: Use ordinary methods to crawl to get some things, but not to crawl some things
. The things that are crawled are in the source code, and the ones
that can not be crawled are not speculated in the source code. One: Source Nothing hidden in the code can be crawled, and some common methods in the
source code can be crawled. Note: The source code does not include a very long and long complex string.
Status 2: On the short book webpage, there is no topic name source code. But the package can be captured.
Speculation 2: What is not in the source code can be captured by capturing the package.
Unexpected gain: capturing the package can obtain all the comments, avatars, IDs and other information.
Current situation 3: Short book web pages, the topic name cannot be crawled with the ordinary method of xpath , Use interfaces, or use selenium.
Speculation 3: Source code hiding can only be crawled by interface or selenium.
Practical conclusion: You can easily locate with selenium, and dynamic attributes are anti-climbing for common methods;
selenium is to get elements first Get the attribute value again

Guess you like

Origin blog.csdn.net/qq_51598376/article/details/113811040