Python crawls gracefully

Disclaimer: It is only for learning purposes and does not provide any commercial value.

background

I need to get the news, and then ttsI can listen to it on the way to work every day. I will share the specific plan later. Let’s look at the universal old way I like first: get html content -> python’s tool library analysis, get the content in the element, and it’s done.

Hey guys, I know I failed the crawl. A bunch of js codes are annoying to me. As soon as I looked at the page, I found that the original news was obtained through the interface, and then js was inserted into the document, so I gnawed at the interface.

Stumped again! What is the interface pageCallback? From my experience, this is after complex js encryption, because without this parameter, you can easily get the desired data through the interface.

If there is no such parameter, I can actually do 为所欲为it. Therefore, analyzing pageCallbackthe encryption of this parameter is very time-consuming. I don't intend to study it either, so I decided to do it in another way.

Take out my ultimate big move: Selenium. Simulate the user's operation, shouldn't it block me.

Crawler 2.0

Use Seleniumthe simulated user to crawl the content of the page and output it into a file. About Seleniumwhat it is, welcome to read this article: Selenium Python Tutorial . Here, I only talk about my main implementation.

First of all, as a tool script, we probably don't like the window interface, unless you need to dynamically observe the operation of the program. So, I turned on 无头浏览器the mod.

# 无头浏览器
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)

After getting the key driver, the rest of the process is simple. Like ordinary requestscrawlers spider, you need to get the code of the page, then parse the corresponding elements, and get the attributes or text.

# titles
title_elems = driver.find_elements(by=By.CLASS_NAME, value="item-title")
titles = [title_elem.text for title_elem in title_elems]

Isn't it amazing, seeing it By.CLASS_NAME, did it suddenly remind me of CSS. Yes, your hunch was correct enough. If the above content brings you enough shock and surprise, please continue to read,

# 所有的更新时间
related_elems = driver.find_elements(by=By.CSS_SELECTOR, value="div.item-related > span.time")
relateds = [related_elem.text for related_elem in related_elems]
# 所有的描述信息
desc_elems = driver.find_elements(by=By.CSS_SELECTOR, value="div.item-desc > span")
# 需要去除新闻摘要结尾的()内容
descs = [desc_item.text[:desc_item.text.rfind('(')] for desc_item in desc_elems]

That's right, "div.item-related > span.time"what is this choice? Descendant selector. Nice, it supports all CSS selectors.

Let's have a little episode: What CSS selectors do you know?

  • element selectorp div
  • class selector.highlight
  • ID selector#id
  • attribute selector[type='text']
  • descendant selectorul li
  • child element selectorul > li
  • Adjacent Sibling Selectorh2+p
  • universal selector*

Don't think that I'm redundant. In fact, these selectors are known, and basically they are invincible in crawling pages. In addition, seleniumthere are several selectors:

class By:
    """Set of supported locator strategies."""

    ID = "id"
    XPATH = "xpath"
    LINK_TEXT = "link text"
    PARTIAL_LINK_TEXT = "partial link text"
    NAME = "name"
    TAG_NAME = "tag name"
    CLASS_NAME = "class name"
    CSS_SELECTOR = "css selector"

Commonly used or XPATH TAD_NAME CLASS_NAME CSS_SELECTORif you are interested, you can do your own research.

Finally, let me add a word. As a back-end developer, I really hope that my interface and website can be accessed normally and provide users with stable services. However, crawlers are very harmful to the website, especially the speed of computers is many times faster than that of humans, which is equivalent to suddenly increasing the burden on the server, similar to an attack DOS. Once the crawler hijacks the traffic, other users cannot access it normally.

Therefore, the interface design of the backend is generally adopted 限流, but it will also reduce the user experience. So, just study and study properly. You also have to keep the bottom line of the law, saying: "Python is a discipline that includes four meals."

Guess you like

Origin blog.csdn.net/weixin_55768452/article/details/132177141