Disclaimer: It is only for learning purposes and does not provide any commercial value.
background
I need to get the news, and then tts
I can listen to it on the way to work every day. I will share the specific plan later. Let’s look at the universal old way I like first: get html content -> python’s tool library analysis, get the content in the element, and it’s done.
Hey guys, I know I failed the crawl. A bunch of js codes are annoying to me. As soon as I looked at the page, I found that the original news was obtained through the interface, and then js was inserted into the document, so I gnawed at the interface.
Stumped again! What is the interface pageCallback
? From my experience, this is after complex js encryption, because without this parameter, you can easily get the desired data through the interface.
If there is no such parameter, I can actually do 为所欲为
it. Therefore, analyzing pageCallback
the encryption of this parameter is very time-consuming. I don't intend to study it either, so I decided to do it in another way.
Take out my ultimate big move: Selenium
. Simulate the user's operation, shouldn't it block me.
Crawler 2.0
Use Selenium
the simulated user to crawl the content of the page and output it into a file. About Selenium
what it is, welcome to read this article: Selenium Python Tutorial . Here, I only talk about my main implementation.
First of all, as a tool script, we probably don't like the window interface, unless you need to dynamically observe the operation of the program. So, I turned on 无头浏览器
the mod.
# 无头浏览器
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(options=chrome_options)
After getting the key driver
, the rest of the process is simple. Like ordinary requests
crawlers spider
, you need to get the code of the page, then parse the corresponding elements, and get the attributes or text.
# titles
title_elems = driver.find_elements(by=By.CLASS_NAME, value="item-title")
titles = [title_elem.text for title_elem in title_elems]
Isn't it amazing, seeing it By.CLASS_NAME
, did it suddenly remind me of CSS. Yes, your hunch was correct enough. If the above content brings you enough shock and surprise, please continue to read,
# 所有的更新时间
related_elems = driver.find_elements(by=By.CSS_SELECTOR, value="div.item-related > span.time")
relateds = [related_elem.text for related_elem in related_elems]
# 所有的描述信息
desc_elems = driver.find_elements(by=By.CSS_SELECTOR, value="div.item-desc > span")
# 需要去除新闻摘要结尾的()内容
descs = [desc_item.text[:desc_item.text.rfind('(')] for desc_item in desc_elems]
That's right, "div.item-related > span.time"
what is this choice? Descendant selector. Nice, it supports all CSS selectors.
Let's have a little episode: What CSS selectors do you know?
- element selector
p div
- class selector
.highlight
- ID selector
#id
- attribute selector
[type='text']
- descendant selector
ul li
- child element selector
ul > li
- Adjacent Sibling Selector
h2+p
- universal selector
*
Don't think that I'm redundant. In fact, these selectors are known, and basically they are invincible in crawling pages. In addition, selenium
there are several selectors:
class By:
"""Set of supported locator strategies."""
ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"
Commonly used or XPATH
TAD_NAME
CLASS_NAME
CSS_SELECTOR
if you are interested, you can do your own research.
Finally, let me add a word. As a back-end developer, I really hope that my interface and website can be accessed normally and provide users with stable services. However, crawlers are very harmful to the website, especially the speed of computers is many times faster than that of humans, which is equivalent to suddenly increasing the burden on the server, similar to an attack
DOS
. Once the crawler hijacks the traffic, other users cannot access it normally.Therefore, the interface design of the backend is generally adopted
限流
, but it will also reduce the user experience. So, just study and study properly. You also have to keep the bottom line of the law, saying: "Python is a discipline that includes four meals."