The target analysis website is as follows:
Shaanxi Provincial Government Procurement Network
The website looks like this
what we want is
Details announcement in the bottom right section
After opening this interface
right click
Select "Inspect"
No matter which browser is used, select "Check" at this step
It is recommended to use Google Chrome here
(ready to insert a link to install Google Chrome)
Then get the following interface
Then select the network label on the red circle
can get
We can see that the part circled in yellow is empty, so click on the part circled in red to refresh the website
Something appears in the part of the yellow circle
After clicking on this website, there is only one thing in this part, click to get
If many things appear
click on this link below
(A link is going to be inserted here)
Mouse click on something new
found out this
Check out the url for this thing
之前的网址
http://113.200.80.230/notice/list.do?noticetype=3&index=3&province=province
这个东西的网址
http://113.200.80.230/notice/noticeaframe.do?noticetype=3&isgovertment=
Surprisingly, not only the "shape" of the web page has changed
URL also changed
This is called asynchronous loading
If you find that the URL has not changed through the above operations
Then that's called synchronous loading
(Insert an explanation link for asynchronous loading and synchronous loading here)
Then we will still be surprised to find
Even if the URL changes, the "shape" of the page changes
But the information we need to crawl
still on this page
so
When the real URL is found
You can use the request request
The real URL is the URL we just pulled out
http://113.200.80.230/notice/noticeaframe.do?noticetype=3&isgovertment=
The request request requires the following two libraries
import requests # python基础爬虫库
from lxml import etree # 可以将网页转换为
Copy these two lines of code and run it
If run error
Then there is a high probability that you do not have this library installed.
need to run
these two lines of code
pip install requests
pip install lxml
then wait patiently
About five minutes later (or less, depending on internet speed and luck, and the mood of the computer)
Then run these two lines of code
import requests # python基础爬虫库
from lxml import etree # 可以将网页转换为
basically succeeded
if still error
It is recommended to knock one directly on your computer.
Or click the two links below to view the details
The request library installation error is reported. Click this
lxml library installation error click here
after installation
before the request
We need some camouflage on our request behavior
to confuse this site
General confusing website ideas
available from
header cookie refer
Three angles to consider
The most basic of these is the header
(The role and meaning of header)
Generally, ordinary websites only need the header to solve the problem
So is this site
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'}
Write this line of code to solve it
Find your local User-Agent like this
Reptile direction: the function and meaning of header and how to find it http://t.csdn.cn/8DRLp
If you need to use a lot of headers, you can solve it by building fake headers
(insert a link)
Of course, it is also possible to directly use the header provided by this code
url2 ="#这里粘贴真网址"
response2 = requests.get(url= url2,headers=headers)
response2 .encoding = 'utf-8'
wb_data_2 = response2.text
html = etree.HTML(wb_data_2)
Then
enter
html
If you can print something, it proves that the request is successful