Article directory
1. Crawling target
The target of this crawling is 4K high-definition pictures of young ladies on a certain website:
2. Realize the effect
Achieve batch downloading of pictures with specified keywords and store them in specified folders:
3. Preparation work
Python:3.10
Editor:PyCharm
Third-party modules, install by yourself:
pip install requests # 网页数据爬取
pip install lxml # 提取网页数据
4. Proxy IP
4.1 What are the benefits of using a proxy?
The benefits of crawlers using proxy IPs are as follows:
- Rotate IP address: Use proxy IP to rotate IP addresses, reduce the risk of being banned, and maintain the continuity and stability of crawling.
- Improve collection speed: Proxy IP can provide multiple IP addresses, allowing the crawler to use multiple threads at the same time, thereby speeding up data collection.
- Bypass anti-crawler mechanism: Many websites adopt various anti-crawler mechanisms, such as IP bans, verification codes, request frequency limits, etc. Using proxy IP can help crawlers bypass these mechanisms and maintain normal data collection.
- Protect personal privacy: Using proxy IP can help hide the real IP address and protect personal identity and privacy information.
Bloggers often write crawler code using high-anonymity proxy IPs from huge IP houses. There are 1,000 free IPs every day:Click for a free trial
4.2 Get a free agent
1. Open the official website of Judao IP:Official website of Judao IP
2. Enter account information to register:
3. Real-name authentication is required here. If you don’t know how, you can read:Personal registration real-name tutorial:
4. Enter the member center and click to claim today’s free IP:
5. For detailed steps, please refer to the official tutorial document:Tutorial on obtaining the huge HTTP-free proxy IP package, as shown below after receiving it:< /span>
6. Click Product Management》Dynamic Agent (Time Package), you can see the free IP information we just received:
7. Add your computer’s IP to the whitelist to obtain the proxy IP. Click on authorization information:
8. Click Modify Authorization》Quick Add》Confirm
9. After the addition is completed, click to generate the extraction link:
10. Set the quantity for each extraction, click Generate Link, and copy the link:
11. Copy the link to the address bar and you can see the proxy IP we obtained:
4.3 Obtain proxy
After obtaining the image link, we need to send a request again to download the image. Since the request volume is generally large, we need to use a proxy IP. We have manually obtained the proxy IP above. Let’s see how Python hangs up the proxy IP to send a request:
1. Use a crawler to obtain the proxy IP in the API interface (Note: For the proxy URL below, see tutorial 4.2 and replace it with your own API link):
import requests
import time
import random
def get_ip():
url = "这里放你自己的API链接"
while 1:
try:
r = requests.get(url, timeout=10)
except:
continue
ip = r.text.strip()
if '请求过于频繁' in ip:
print('IP请求频繁')
time.sleep(1)
continue
break
proxies = {
'https': '%s' % ip
}
return proxies
if __name__ == '__main__':
proxies = get_ip()
print(proxies)
From the running results, you can see that the proxy IP in the interface is returned:
2. Next, when we write the crawler proxy, we can hang up the proxy IP to send the request. We only need to pass proxies
as a parameter to requests.get
Function to request other URLs:
requests.get(url, headers=headers, proxies=proxies)
5. Actual agency practice
5.1 Import module
import requests # python基础爬虫库
from lxml import etree # 可以将网页转换为Elements对象
import time # 防止爬取过快可以睡眠一秒
import os # 创建文件
5.2 Set page turning
First, let’s analyze the page turning of the website. There are 62 pages in total:
First page link:
https://pic.netbian.com/4kmeinv/index.html
Second page link:
https://pic.netbian.com/4kmeinv/index_2.html
Third page link:
https://pic.netbian.com/4kmeinv/index_3.html
It can be seen that each page only has index
followed by _页码
starting from the second page, so a loop is used to construct all web page links: a>
if __name__ == '__main__':
# 页码
page_number = 1
# 循环构建每页的链接
for i in range(1,page_number+1):
# 第一页固定,后面页数拼接
if i ==1:
url = 'https://pic.netbian.com/4kmeinv/index.html'
else:
url = f'https://pic.netbian.com/4kmeinv/index_{
i}.html'
5.3 Get image link
You can see that all image URLs are under the ul tag > a tag > img tag:
We create aget_imgurl_list(url)
function that passes in the web page link to obtain the web page source code, and uses xpath to locate the link to each image:
def get_imgurl_list(url,imgurl_list):
"""获取图片链接"""
# 请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
# 发送请求
response = requests.get(url=url, headers=headers)
# 获取网页源码
html_str = response.text
# 将html字符串转换为etree对象方便后面使用xpath进行解析
html_data = etree.HTML(html_str)
# 利用xpath取到所有的li标签
li_list = html_data.xpath("//ul[@class='clearfix']/li")
# 打印一下li标签个数看是否和一页的电影个数对得上
print(len(li_list)) # 输出20,没有问题
for li in li_list:
imgurl = li.xpath(".//a/img/@src")[0]
# 拼接url
imgurl = 'https://pic.netbian.com' +imgurl
print(imgurl)
# 写入列表
imgurl_list.append(imgurl)
operation result:
Click on a picture link to see:
OK no problem!!!
5.4 Download pictures
The image link is now available, as is the proxy IP. Now we can download the image. Define a get_down_img(img_url_list)
function, pass in the picture link list, then traverse the list, switch the agent every time a picture is downloaded, and download all pictures to the specified folder:
def get_down_img(imgurl_list):
# 在当前路径下生成存储图片的文件夹
os.mkdir("小姐姐")
# 定义图片编号
n = 0
for img_url in imgurl_list:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
# 调用get_ip函数,获取代理IP
proxies = get_ip()
# 每次发送请求换代理IP,获取图片,防止被封
img_data = requests.get(url=img_url, headers=headers, proxies=proxies).content
# 拼接图片存放地址和名字
img_path = './小姐姐/' + str(n) + '.jpg'
# 将图片写入指定位置
with open(img_path, 'wb') as f:
f.write(img_data)
# 图片编号递增
n = n + 1
5.5 Call the main function
Here we can set the page numbers to be crawled:
if __name__ == '__main__':
# 1. 设置获取的页数
page_number = 63
imgurl_list = [] # 用于存储所有的图片链接
# 2. 循环构建每页的链接
for i in range(1,page_number+1):
# 第一页固定,后面页数拼接
if i ==1:
url = 'https://pic.netbian.com/4kmeinv/index.html'
else:
url = f'https://pic.netbian.com/4kmeinv/index_{
i}.html'
# 3. 获取图片链接
get_imgurl_list(url,imgurl_list)
# 4. 下载图片
get_down_img(imgurl_list)
5.6 Complete source code
Note: For the proxy URL below, see the 4.2 tutorial to replace it with your own API link:
import requests # python基础爬虫库
from lxml import etree # 可以将网页转换为Elements对象
import time # 防止爬取过快可以睡眠一秒
import os
def get_ip():
url = "这里放你自己的API链接"
while 1:
try:
r = requests.get(url, timeout=10)
except:
continue
ip = r.text.strip()
if '请求过于频繁' in ip:
print('IP请求频繁')
time.sleep(1)
continue
break
proxies = {
'https': '%s' % ip
}
return proxies
def get_imgurl_list(url,imgurl_list):
"""获取图片链接"""
# 请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
# 发送请求
response = requests.get(url=url, headers=headers)
# 获取网页源码
html_str = response.text
# 将html字符串转换为etree对象方便后面使用xpath进行解析
html_data = etree.HTML(html_str)
# 利用xpath取到所有的li标签
li_list = html_data.xpath("//ul[@class='clearfix']/li")
# 打印一下li标签个数看是否和一页的电影个数对得上
print(len(li_list)) # 输出20,没有问题
for li in li_list:
imgurl = li.xpath(".//a/img/@src")[0]
# 拼接url
imgurl = 'https://pic.netbian.com' +imgurl
print(imgurl)
# 写入列表
imgurl_list.append(imgurl)
def get_down_img(imgurl_list):
# 在当前路径下生成存储图片的文件夹
os.mkdir("小姐姐")
# 定义图片编号
n = 0
for img_url in imgurl_list:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
# 调用get_ip函数,获取代理IP
proxies = get_ip()
# 每次发送请求换代理IP,获取图片,防止被封
img_data = requests.get(url=img_url, headers=headers, proxies=proxies).content
# 拼接图片存放地址和名字
img_path = './小姐姐/' + str(n) + '.jpg'
# 将图片写入指定位置
with open(img_path, 'wb') as f:
f.write(img_data)
# 图片编号递增
n = n + 1
if __name__ == '__main__':
# 1. 设置获取的页数
page_number = 50
imgurl_list = [] # 用于存储所有的图片链接
# 2. 循环构建每页的链接
for i in range(1,page_number+1):
# 第一页固定,后面页数拼接
if i ==1:
url = 'https://pic.netbian.com/4kmeinv/index.html'
else:
url = f'https://pic.netbian.com/4kmeinv/index_{
i}.html'
# 3. 获取图片链接
get_imgurl_list(url,imgurl_list)
# 4. 下载图片
get_down_img(imgurl_list)
operation result:
The download was successful and there was no error. The quality of the proxy IP is still good! ! !
5.7 What should I do if the proxy IP is not enough?
What should I do if the 1,000 free proxy IPs per day are not enough?? Friends who often write crawler code and have a high demand for proxy IP recommend using the unlimited proxy IP package of Huge IP House. The IP validity period: 30-60 seconds is enough:Click Buy
There are 5 proxy pools by default, with a maximum extraction of 50 at a time and extraction once per second; if 1 is extracted at a time, 50 extractions per second can be achieved. If you feel that 50 proxy IPs at a time are not enough, you can increase the IP pool.
I calculated the default five pools and found that 50 proxy IPs can be extracted in 1 second, which is 86400 seconds a day, that is to say50x86400=4,320,000 proxy IPs can be extracted in a day< /span>, my dear, so I, the blogger, decisively arranged an annual package for myself, not to mention how cool it is:
6. Summary
Proxy IP is inseparable for crawlers. Proxy IP can help crawlers hide the real IP address. Friends who need a proxy IP can try Juliangjia’s proxy IP: Huge IP official website