[Play with python series] [Must-see for Xiaobai] Use Python crawler technology to obtain proxy IP and save it to a file

insert image description here

foreword

This article introduces how to use Python crawler technology to obtain proxy IP and save it to a file. By using the third-party library requests to send HTTP requests, and using the lxml library to parse HTML, we can get IP, Port and address information from multiple web pages. This article will analyze each part of the code step by step to help readers better understand how the crawler works.

Import dependent libraries

insert image description here

import requests
from lxml import etree

Import requeststhe library for sending HTTP requests, and lxmlthe library for parsing HTML.

Open the file ready to write data

insert image description here

with open('IP代理.txt','w',encoding='utf-8') as f:

Use openthe function to create a file object f, specify the file name 'IP代理.txt', and open the file in write mode. Encoding is set to 'utf-8'.

Loop to crawl multiple pages

insert image description here

for i in range(1,10):
    url = f'http://www.66ip.cn/{
      
      i}.html'
    print(f'正在获取{
      
      url}')
    headers = {
    
    
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
    }
    resp = requests.get(url,headers = headers)
    resp.encoding ='gbk'
    e = etree.HTML(resp.text)
    ips = e.xpath('//div[1]/table//tr/td[1]/text()')
    ports = e.xpath('//div[1]/table//tr/td[2]/text()')
    addrs = e.xpath('//div[1]/table//tr/td[3]/text()')

    for i,p,a in zip(ips,ports,addrs):
        f.write(f'IP地址:{
      
      i}----port端口号:{
      
      p}-----地址:{
      
      a}\n')

This part of the code uses a loop to crawl the proxy information of multiple pages. The loop variable iranges from 1 to 9. For each page, the full URL is first constructed: http://www.66ip.cn/{i}.html, where {i}is the page number of the page. Then, use printthe function to print out the URL of the page being fetched.

Next, in order to disguise its own browser, a headersdictionary is defined, which contains the browser's User-Agent information.

Send a GET request through requeststhe library, using headersthe User-Agent information in the dictionary. The obtained response content is saved in respthe variable.

Set the encoding of the response to 'gbk'because the target website uses GBK encoding.

Parse the response content into an operable HTML object and assign it to a variable e, using the function lxmlof the library etree.HTML.

Extract a list of IP, Port, and addresses from an HTML object through an XPath expression. The IP list is stored in ips, the Port list is stored in ports, and the address list is stored addrsin .

Use zipthe function to pack the three lists together one by one, and then use forthe loop to traverse the packed data. In the loop, use the method fof the file object writeto write each piece of proxy information into the file, and the writing format is 'IP地址:{i}----port端口号:{p}-----地址:{a}\n'.

The function of the whole code is to crawl the IP, Port and address information in multiple web pages, and save the results in a 'IP代理.txt'file named .

full code

import requests
from lxml import etree

# 定义保存结果的文件
with open('IP代理.txt', 'w', encoding='utf-8') as f:
    # 循环爬取多个页面
    for i in range(1, 10):
        # 构造完整的URL
        url = f'http://www.66ip.cn/{
      
      i}.html'
        print(f'正在获取{
      
      url}')

        # 伪装浏览器请求头
        headers = {
    
    
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
        }

        # 发送GET请求
        resp = requests.get(url, headers=headers)

        # 设置响应的编码为GBK
        resp.encoding = 'gbk'

        # 解析HTML
        e = etree.HTML(resp.text)

        # 提取IP、Port和地址信息
        ips = e.xpath('//div[1]/table//tr/td[1]/text()')
        ports = e.xpath('//div[1]/table//tr/td[2]/text()')
        addrs = e.xpath('//div[1]/table//tr/td[3]/text()')

        # 将提取的代理信息写入文件
        for ip, port, addr in zip(ips, ports, addrs):
            f.write(f'IP地址:{
      
      ip}----port端口号:{
      
      port}-----地址:{
      
      addr}\n')

running result

insert image description here

conclusion

Through the Python crawler technology introduced in this article, you can easily obtain the proxy IP and save it in a file. This is very useful for data collection, anti-crawler processing or other web crawler applications that need to use proxy IP. I hope this article can help you better understand the working principle of crawlers and play a role in actual projects.

Guess you like

Origin blog.csdn.net/qq_33681891/article/details/132003374