Article directory
1. Crawling target
The target of this crawling is another 4K high-definition young lady picture from a certain website:
2. Realize the effect
Achieve batch downloading of pictures with specified keywords and store them in specified folders:
3. Preparation work
Python:3.10
Editor:PyCharm
Third-party modules, install by yourself:
pip install requests # 网页数据爬取
pip install lxml # 提取网页数据
4. Obtain free proxy IP
4.1 What are the benefits of using a proxy?
The benefits of crawlers using proxy IPs are as follows:
- Rotate IP addresses : Use proxy IP to rotate IP addresses, reduce the risk of being banned, and maintain the continuity and stability of crawling.
- Improve collection speed : Proxy IP can provide multiple IP addresses, allowing the crawler to use multiple threads at the same time, thereby speeding up data collection.
- Bypassing anti-crawler mechanisms : Many websites adopt various anti-crawler mechanisms, such as IP bans, verification codes, request frequency limits, etc. Using proxy IP can help crawlers bypass these mechanisms and maintain normal data collection.
- Protect personal privacy : Using proxy IP can help hide the real IP address and protect personal identity and privacy information.
The blogger recently discovered a pretty good proxy IP, the high-anonymity proxy IP of Yipin HTTP. You can get 1G of traffic for free after registration. It’s very good : click to try it for free.
4.2 Get a free agent
1. Open Yipin HTTP official website : click Free Trial
2. All proxy IPs require real-name authentication before they can be used. If you don’t know how, please refer to the real-name tutorial : Real-name Tutorial
2. Select Whitelist > Add whitelist:
3. Click API Extraction》Select direct connection extraction:
4. Select traffic extraction and click to generate API link (we have 1G free traffic here for free use):
If you still can’t extract the free proxy IP, you can ask the customer service in the lower left corner :
4.3 Obtain proxy
After obtaining the image link, we need to send a request again to download the image. Since the request volume is generally large, we need to use a proxy IP. We have manually obtained the proxy IP above. Let’s see how Python hangs up the proxy IP to send a request:
1. Get the proxy IP in the API interface through a crawler ( Note: For the proxy URL below, see tutorial 4.2 to change it to your own API link ):
import requests
import time
import random
def get_ip():
url = "这里放你自己的API链接"
while 1:
try:
r = requests.get(url, timeout=10)
except:
continue
ip = r.text.strip()
if '请求过于频繁' in ip:
print('IP请求频繁')
time.sleep(1)
continue
break
proxies = {
'https': '%s' % ip
}
return proxies
if __name__ == '__main__':
proxies = get_ip()
print(proxies)
5. Actual agency practice
5.1 Import module
import requests # python基础爬虫库
from lxml import etree # 可以将网页转换为Elements对象
import time # 防止爬取过快可以睡眠一秒
import os # 创建文件
5.2 Set page turning
First, let’s analyze the page turning of the website. There are 10 pages in total:
First page link:
https://www.moyublog.com/95-2-2-0.html
Second page link:
https://www.moyublog.com/95-2-2-1.html
Third page link:
https://www.moyublog.com/95-2-2-2.html
It can be seen that each page is only 95-2-2-
added sequentially starting from the second page 1
, so a loop is used to construct all web page links:
if __name__ == '__main__':
# 页码
page_number = 10
# 循环构建每页的链接
for i in range(0,page_number+1):
# 页数拼接
url = f'https://www.moyublog.com/95-2-2-{
i}.html'
5.3 Get image link
You can see that all image URLs are under the ul tag > li tag > a tag > img tag:
We create a get_imgurl_list(url)
function that passes in the web page link to obtain the web page source code, and uses xpath to locate the link to each image:
def get_imgurl_list(url,imgurl_list):
"""获取图片链接"""
# 请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
# 发送请求
response = requests.get(url=url, headers=headers)
# 获取网页源码
html_str = response.text
# 将html字符串转换为etree对象方便后面使用xpath进行解析
html_data = etree.HTML(html_str)
# 利用xpath取到所有的li标签
li_list = html_data.xpath("//ul[@class='clearfix']/li")
# 打印一下li标签个数看是否和一页的电影个数对得上
print(len(li_list)) # 输出20,没有问题
for li in li_list:
imgurl = li.xpath(".//a/img/@data-original")[0]
print(imgurl)
# 写入列表
imgurl_list.append(imgurl)
operation result:
Click on a picture link to take a look, OK, no problem:
5.4 Download pictures
The image link is now available, as is the proxy IP. Now we can download the image. Define a get_down_img(img_url_list)
function, pass in the image link list, then traverse the list, switch the agent every time a picture is downloaded, and download all pictures to the specified folder:
def get_down_img(imgurl_list):
# 在当前路径下生成存储图片的文件夹
os.mkdir("小姐姐")
# 定义图片编号
n = 0
for img_url in imgurl_list:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
# 调用get_ip函数,获取代理IP
proxies = get_ip()
# 每次发送请求换代理IP,获取图片,防止被封
img_data = requests.get(url=img_url, headers=headers, proxies=proxies).content
# 拼接图片存放地址和名字
img_path = './小姐姐/' + str(n) + '.jpg'
# 将图片写入指定位置
with open(img_path, 'wb') as f:
f.write(img_data)
# 图片编号递增
n = n + 1
5.5 Call the main function
Here we can set the page numbers to be crawled:
if __name__ == '__main__':
page_number = 10 # 爬取页数
imgurl_list = [] # 存放图片链接
# 1. 循环构建每页的链接
for i in range(0,page_number+1):
# 页数拼接
url = f'https://www.moyublog.com/95-2-2-{
i}.html'
print(url)
# 2. 获取图片链接
get_imgurl_list(url,imgurl_list)
# 3. 下载图片
get_down_img(imgurl_list)
5.6 Complete source code
Note: For the proxy URL below, see tutorial 4.2 to extract the 1G free proxy IP and replace it with your own API link:
import requests # python基础爬虫库
from lxml import etree # 可以将网页转换为Elements对象
import time # 防止爬取过快可以睡眠一秒
import os
def get_ip():
url = "这里放你自己的API链接"
while 1:
try:
r = requests.get(url, timeout=10)
except:
continue
ip = r.text.strip()
if '请求过于频繁' in ip:
print('IP请求频繁')
time.sleep(1)
continue
break
proxies = {
'https': '%s' % ip
}
return proxies
def get_imgurl_list(url,imgurl_list):
"""获取图片链接"""
# 请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
# 发送请求
response = requests.get(url=url, headers=headers)
# 获取网页源码
html_str = response.text
# 将html字符串转换为etree对象方便后面使用xpath进行解析
html_data = etree.HTML(html_str)
# 利用xpath取到所有的li标签
li_list = html_data.xpath("//ul[@class='clearfix']/li")
# 打印一下li标签个数看是否和一页的电影个数对得上
print(len(li_list)) # 输出20,没有问题
for li in li_list:
imgurl = li.xpath(".//a/img/@data-original")[0]
print(imgurl)
# 写入列表
imgurl_list.append(imgurl)
def get_down_img(imgurl_list):
# 在当前路径下生成存储图片的文件夹
os.mkdir("小姐姐")
# 定义图片编号
n = 0
for img_url in imgurl_list:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
# 调用get_ip函数,获取代理IP
proxies = get_ip()
# 每次发送请求换代理IP,获取图片,防止被封
img_data = requests.get(url=img_url, proxies=proxies, headers=headers).content #
# 拼接图片存放地址和名字
img_path = './小姐姐/' + str(n) + '.jpg'
# 将图片写入指定位置
with open(img_path, 'wb') as f:
f.write(img_data)
# 图片编号递增
n = n + 1
if __name__ == '__main__':
page_number = 10 # 爬取页数
imgurl_list = [] # 存放图片链接
# 1. 循环构建每页的链接
for i in range(0,page_number+1):
# 页数拼接
url = f'https://www.moyublog.com/95-2-2-{
i}.html'
print(url)
# 2. 获取图片链接
get_imgurl_list(url,imgurl_list)
# 3. 下载图片
get_down_img(imgurl_list)
operation result:
The download was successful and there was no error. The quality of the proxy IP is still good! ! !
6. Summary
Proxy IP is inseparable for crawlers. Proxy IP can help crawlers hide the real IP address. Friends who need proxy IP can try Yipin HTTP Home: Click for free trial
Book recommendations
Customer retention data analysis and prediction (data science and big data technology)
For any business that relies on recurring revenue and repeat sales, keeping customers active and continuing to buy is essential. Customer churn (or "churn") is a costly and frustrating event that can be prevented. By using the techniques described in this book, you can identify the early warning signs of customer churn and learn to identify and retain customers before they leave.
Customer Retention Data Analysis and Prediction teaches developers and data scientists proven techniques and methods to stop customer churn before it happens. This book contains many real-world examples of how to transform raw data into measurable behavioral indicators, calculate customer lifetime value, and use demographic data to improve customer churn predictions. By following Zuora's Chief Data Scientist Carl Gold's approach, you'll reap the benefits of high customer retention.
main content:
● Calculate churn indicators
● Predict customer churn through customer behavior
● Use customer segmentation strategies to reduce customer churn
● Apply churn analysis technology to other business areas
● Use artificial intelligence technology for accurate customer churn prediction
JD.com address : https://item.jd.com/13999686.html