Article Directory
foreword
This article introduces how to use Python crawler technology to obtain proxy IP and save it to a file. By using the third-party library requests to send HTTP requests, and using the lxml library to parse HTML, we can get IP, Port and address information from multiple web pages. This article will analyze each part of the code step by step to help readers better understand how the crawler works.
Import dependent libraries
import requests
from lxml import etree
Import requests
the library for sending HTTP requests, and lxml
the library for parsing HTML.
Open the file ready to write data
with open('IP代理.txt','w',encoding='utf-8') as f:
Use open
the function to create a file object f
, specify the file name 'IP代理.txt'
, and open the file in write mode. Encoding is set to 'utf-8'
.
Loop to crawl multiple pages
for i in range(1,10):
url = f'http://www.66ip.cn/{
i}.html'
print(f'正在获取{
url}')
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
}
resp = requests.get(url,headers = headers)
resp.encoding ='gbk'
e = etree.HTML(resp.text)
ips = e.xpath('//div[1]/table//tr/td[1]/text()')
ports = e.xpath('//div[1]/table//tr/td[2]/text()')
addrs = e.xpath('//div[1]/table//tr/td[3]/text()')
for i,p,a in zip(ips,ports,addrs):
f.write(f'IP地址:{
i}----port端口号:{
p}-----地址:{
a}\n')
This part of the code uses a loop to crawl the proxy information of multiple pages. The loop variable i
ranges from 1 to 9. For each page, the full URL is first constructed: http://www.66ip.cn/{i}.html
, where {i}
is the page number of the page. Then, use print
the function to print out the URL of the page being fetched.
Next, in order to disguise its own browser, a headers
dictionary is defined, which contains the browser's User-Agent information.
Send a GET request through requests
the library, using headers
the User-Agent information in the dictionary. The obtained response content is saved in resp
the variable.
Set the encoding of the response to 'gbk'
because the target website uses GBK encoding.
Parse the response content into an operable HTML object and assign it to a variable e
, using the function lxml
of the library etree.HTML
.
Extract a list of IP, Port, and addresses from an HTML object through an XPath expression. The IP list is stored in ips
, the Port list is stored in ports
, and the address list is stored addrs
in .
Use zip
the function to pack the three lists together one by one, and then use for
the loop to traverse the packed data. In the loop, use the method f
of the file object write
to write each piece of proxy information into the file, and the writing format is 'IP地址:{i}----port端口号:{p}-----地址:{a}\n'
.
The function of the whole code is to crawl the IP, Port and address information in multiple web pages, and save the results in a 'IP代理.txt'
file named .
full code
import requests
from lxml import etree
# 定义保存结果的文件
with open('IP代理.txt', 'w', encoding='utf-8') as f:
# 循环爬取多个页面
for i in range(1, 10):
# 构造完整的URL
url = f'http://www.66ip.cn/{
i}.html'
print(f'正在获取{
url}')
# 伪装浏览器请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
}
# 发送GET请求
resp = requests.get(url, headers=headers)
# 设置响应的编码为GBK
resp.encoding = 'gbk'
# 解析HTML
e = etree.HTML(resp.text)
# 提取IP、Port和地址信息
ips = e.xpath('//div[1]/table//tr/td[1]/text()')
ports = e.xpath('//div[1]/table//tr/td[2]/text()')
addrs = e.xpath('//div[1]/table//tr/td[3]/text()')
# 将提取的代理信息写入文件
for ip, port, addr in zip(ips, ports, addrs):
f.write(f'IP地址:{
ip}----port端口号:{
port}-----地址:{
addr}\n')
running result
conclusion
Through the Python crawler technology introduced in this article, you can easily obtain the proxy IP and save it in a file. This is very useful for data collection, anti-crawler processing or other web crawler applications that need to use proxy IP. I hope this article can help you better understand the working principle of crawlers and play a role in actual projects.