Minimalist proxy IP Code --Python crawling crawling free proxy IP

Two days and picked up the reptile knowledge for a long time do not touch, because relatives and friends in the circle of friends to pull people to vote, find the point went in and saw no login or registration, that is not complicated, and sometimes feeling really ambitious to engage in a practice, see their own knowledge forgot to sawed.

Analysis of a look, in fact, a post request, all the information you need on the page, the only problem is that sites do IP restrictions, IP Only a vote.

On GitHub to see the star up proxy IP pool project, but due to the big brother crawling agency does not distinguish between http and https, so use the available rate is further reduced.

We probably looked at often crawling ip proxy URL, there is a type of HTTP proxy specifically listed, a little analysis of what the page, relying on their own almost forgotten over knowledge, wrote a minimalist crawling scheme. code show as below:

import requests
from bs4 import BeautifulSoup

def proxy_list():
    url = 'https://www.xicidaili.com/wt'
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
    r = requests.get(url = url,headers = headers)
    s = BeautifulSoup(r.text,'lxml')
    tr_list = s.select('tr[class="odd"]')
    proxy_list = []
    for tr in tr_list:
        ip = tr.select('td')[1].text
        potal = tr.select('td')[2].text
        proxy_list.append('http://'+ip+':'+potal)
    return proxy_list

The site only had ua restrictions, without ua will be 503, plus ua can be. Of course, crawling down nor is it can be used, the need for further verification job.

Paging function Needless to say, relatively simple, you can own DIY. Have to say, requests + BeautifulSoup with a nice, entry required.

Guess you like

Origin www.cnblogs.com/mathbox/p/11089424.html