Regular expression-simple crawler example-4K beauty articles
Article Directory
section1: statement
Your own study notes, crawling content will not be used for business
section2: Download link analysis
First of all, we have to find the details page 4K beauty we want to crawl.
The picture here is what we want.
Next, we need to look at the source code of this page for further analysis. (You can choose to right-click to check or use the shortcut key Ctrl+Shift+I )
Take the first picture as an example (the code is as follows):
the source file tag src appears, we want to get the content behind it and compare it with'http :/ /pic.netbian.com 'is partially integrated into a link, but if you try it on the web page in advance, this result will appear.
(Why is it integrated with http://pic.netbian.com , because this is a link to the homepage of the website, hahaha)
It's a thumbnail, not the high-definition big picture we want.
Let's take a closer look. There is also a hyperlink label above the source file label. Click on it and come to this page:
then we will check this page again.
Another source file was found, then let's try combining it with " http://pic.netbian.com "!
ohhhhhh, is the high-definition big picture we want!
So the next step is to clarify your ideas and start crawling!
section3: Code writing
1. Import section
import requests
import re
import os
2. Construct the request
headers = {
'user - agent': 'Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.116Safari / 537.36'
}
url='http://pic.netbian.com/4kmeinv/'
response_1=requests.get(url=url,headers=headers)
html_1=response_1.text
For the study of the Requests library, I have sorted out a little bit before (I didn’t know how to write at the time, it’s a bit messy)
Web crawler-----Introduction to the Requests library
3. The construction of regular expressions
Because we have to go to another page from the detail page to the high-definition image, the first regular expression we constructed is to go to another page.
What we need is the content behind the hyperlink tag href, so the regular expression can be written like this:
<a.*?href.+?(/tupian.+?html)
Regular expressions are common symbols, refer to the regular expression - commonly used matching rules
4. Data processing
After constructing, we need to take out these data
image_urls=re.findall('<a.*?href.+?(/tupian.+?html)',html_1,re.S)
We can print it and take a look at the result.
We can see that it is a list. The result obtained by re.findall is a result. Here we need to pay attention. Therefore, we need to traverse this result so that we can use it.
for image_url in image_urls:
picture_urls = 'http://pic.netbian.com' + image_url
Because the extracted result is not a complete link, I added it by the way when I traversed it. Then we print it and see the result.
There is a series of links. Click the first one, which happens to be the one we just checked, indicating that we have succeeded.
Then, repeat the above steps to analyze this page and extract data
for image_url in image_urls:
picture_urls = 'http://pic.netbian.com' + image_url
# print(picture_urls)
response_2 = requests.get(url=picture_urls, headers=headers)
html_2 = response_2.text
pictures = re.findall('<div.*?photo-pic.*?<img src="(/uploads.+?jpg).*?alt.+?"(.*?)"', html_2, re.S)
The regular expression here, when I rebuild, the extracted content is the source file and the content behind the alt
for picture in pictures:
picture_url = picture[0]
picture_src = 'http://pic.netbian.com' + picture_url#高清图的源文件链接
picture_name = picture[1] + '.jpg'#构建准备保存的图片的名称
picture_name = picture_name.encode('iso-8859-1').decode('gbk')#这里是防止图片名称出现乱码的情况
5. Save the data
First, we have to create a folder
if not os.path.exists('D:/4K美女'):
os.mkdir('D:/4K美女')
Secondly, do the final processing of the data we got
picture_data = requests.get(url=picture_src, headers=headers).content#写入文件的内容——也就是想要的高清大图啦
picture_path = 'D:/4K美女/' + picture_name#构建图片存储路径
Finally, write the file, save and close
with open(picture_path, 'wb') as f:
f.write(picture_data)
print(picture_path, '下载完成')
6. Complete code
import requests
import re
import os
#创建文件夹
if not os.path.exists('D:/4K美女'):
os.mkdir('D:/4K美女')
headers = {
'user - agent': 'Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.116Safari / 537.36'
}
url='http://pic.netbian.com/4kmeinv/'
response_1=requests.get(url=url,headers=headers)
html_1=response_1.text
image_urls=re.findall('<a.*?href.+?(/tupian.+?html)',html_1,re.S)
# print(image_urls)
for image_url in image_urls:
picture_urls = 'http://pic.netbian.com' + image_url
# print(picture_urls)
response_2 = requests.get(url=picture_urls, headers=headers)
html_2 = response_2.text
pictures = re.findall('<div.*?photo-pic.*?<img src="(/uploads.+?jpg).*?alt.+?"(.*?)"', html_2, re.S)
for picture in pictures:
picture_url = picture[0]
picture_src = 'http://pic.netbian.com' + picture_url#高清图的源文件链接
picture_name = picture[1] + '.jpg'#构建准备保存的图片的名称
picture_name = picture_name.encode('iso-8859-1').decode('gbk')#这里是防止图片名称出现乱码的情况
picture_data = requests.get(url=picture_src, headers=headers).content#写入文件的内容——也就是想要的高清大图啦
picture_path = 'D:/4K美女/' + picture_name#构建图片存储路径
# 保存图片
with open(picture_path, 'wb') as f:
f.write(picture_data)
print(picture_path, '下载完成')
section4: Supplement (multi-page crawling)
If you want to crawl multiple pages , you only need to do a for loop processing for the first URL, namely:
for i in range(2,5):
url='http://pic.netbian.com/4kmeinv/index_{}.html'.format(i)
response_1=requests.get(url=url,headers=headers)
html_1=response_1.text
So let's go! ! ! ! !
This is the first time I have completed a crawler example. I feel that I have written a good friend, can you give me a thumbs up and support it, hehe.
I know that I may not use very standardized terms in some places, and I hope that some big guys can give me some suggestions.