Getting started with python crawlers for requests and bs4


Now, let’s briefly talk about the learning results of the past few months~~~
The crawler actually crawls the source code, and then filters the source code to get what we want~
Sometimes we need some regular things ~ For
example, some text, the src of the picture, the href of the link~~
There are some lazyloaded pictures in it, and you need selenium webdriver or something. I haven't studied this yet, so don't spray it.

The answer above also mentioned that writing crawlers with requests and bs4 is really a series of fool crawlers.
It's very easy~~ But no matter what program you write, you have to read the document first. I personally think these two documents are still very friendly~ Requests
document
Quick
start - Requests 2.10.0 document BeautifulSoup document
Beautiful Soup 4.4.0 document

First import these two modules
from bs4 import BeautifulSoup
import requests

Then I have to give requests a url and tell him that I want to crawl the source code of that url. If it is the answer to this question, put
url = 'https://www.zhihu.com/question/20899988'
Sometimes it is necessary to disguise a header and pass it to the server together.
User-Agent is the browser version, and cookie is the data on the local terminal.
These two open f12 networks click doc and then you should be able to see it.
headers = {
    'User-Agent':'',
    'Cookie':''
}
Then you can use requests to crawl
data = requests.get(url, headers=headers)
In fact, this data is a response object, you
need to .text and then hand it over to bs4
soup = BeautifulSoup(data.text, 'lxml')

After that, you can use soup.select to select.
If you can't write selector, the easiest way is to open f12 and right-click copy > copy selector
. For example, let's write a simple way to get pictures.
imgs = soup.select('div.zm-editable-content > img')

This soup.select returns a list, so you need to loop for in, for
example , put all its links in a list
img_link = []
for i in imgs:
    img_link.append(i.get('data-actualsrc'))

So, I know what is the use of these links~~ Now we can use urllib.urlretrieve to download them! !
If you crawl and get reversed, you can also introduce time to let the program sleep for a while
import time
time.sleep(4)

The source code is here~~

A small python crawler library of mine, various, still in the beginning~~ but welcome to star and issue ha~~

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326760637&siteId=291194637