0x012.Python learning-HTTP, Url scanner

Python Internet module

Some important modules of Python network programming are listed below:

protocol	Function use	The port number	Python module
HTTP	Webpage visit	80	httplib, urllib, xmlrpclib
NNTP	Read and post news articles, commonly known as "posts"	119	nntplib
FTP	file transfer	20	ftplib, urllib
SMTP	send email	25	smtplib
POP3	incoming mail	110	poplib
IMAP4	Get mail	143	imaplib
Telnet	Command Line	23	telnetlib
Gopher	Information search	70	gopherlib, urllib

If you only use python3.X, you don't need to read it below, just remember that there is a urllib library.

python2.X has these library names available: urllib , urllib2 , urllib3, httplib , httplib2, requests

python3.X has these library names available: urllib, urllib3, httplib2, requests

Both have urllib3 and requests, which are not standard libraries. urllib3 provides thread-safe connection pool and file post support, which has little to do with urllib and urllib2. requests calls itself HTTP for Humans, which is more concise and convenient to use

For python2.X:

The main difference between urllib and urllib2:

urllib2 can accept the Request object to set header information for the URL, modify the user agent, set cookies, etc., urllib can only accept a normal URL.
Urllib provides some primitive and basic methods while urllib2 does not, such as urlencode

Several examples of urllib official documentation

使用带参数的GET方法取回URL

>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
>>> print f.read()
使用POST方法
>>> import urllib
>>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
>>> print f.read()
使用HTTP代理,自动跟踪重定向
>>> import urllib
>>> proxies = {'http': 'http://proxy.example.com:8080/'}
>>> opener = urllib.FancyURLopener(proxies)
>>> f = opener.open("http://www.python.org")
>>> f.read()
不使用代理
>>> import urllib
>>> opener = urllib.FancyURLopener({})
>>> f = opener.open("http://www.python.org/")
>>> f.read()

PYTHON urllib2 get access with cookie

    url="http://www.baidu.com"
    HEADERS = {"Cookie":cookies}
    request = urllib2.Request(url=url,headers=HEADERS)
    socket = urllib2.urlopen(request)
    return = socket.read()

使用HTTP代理,自动跟踪重定向
>>> import urllib
>>> proxies = {'http': 'http://proxy.example.com:8080/'}
>>> opener = urllib.FancyURLopener(proxies)
>>> f = opener.open("http://www.python.org")
>>> f.read()
不使用代理
>>> import urllib
>>> opener = urllib.FancyURLopener({})
>>> f = opener.open("http://www.python.org/")
>>> f.read()

urllib2的几个官方文档的例子:

GET一个URL
>>> import urllib2
>>> f = urllib2.urlopen('http://www.python.org/')
>>> print f.read()

使用基本的HTTP认证
import urllib2
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
                          uri='https://mahler:8092/site-updates.py',
                          user='klem',
                          passwd='kadidd!ehopper')
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
urllib2.urlopen('http://www.example.com/login.html')
build_opener() 默认提供很多处理程序, 包括代理处理程序, 代理默认会被设置为环境变量所提供的.

一个使用代理的例子
proxy_handler = urllib2.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib2.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')

opener = urllib2.build_opener(proxy_handler, proxy_auth_handler)
opener.open('http://www.example.com/login.html')

添加HTTP请求头部
import urllib2
req = urllib2.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
r = urllib2.urlopen(req)

更改User-agent
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')
httplib 和 httplib2 httplib 是http客户端协议的实现,通常不直接使用, urllib是以httplib为基础 httplib2 是第三方库, 比httplib有更多特性

httplib比较底层，一般使用的话用urllib和urllib2即可

对于python3.X:

这里urllib成了一个包, 此包分成了几个模块, 

urllib.request 用于打开和读取URL, 
urllib.error 用于处理前面request引起的异常, 
urllib.parse 用于解析URL, 
urllib.robotparser用于解析robots.txt文件
python2.X 中的 urllib.urlopen()被废弃, urllib2.urlopen()相当于python3.X中的urllib.request.urlopen()

几个官方例子:

GET一个URL
>>> import urllib.request
>>> with urllib.request.urlopen('http://www.python.org/') as f:
...     print(f.read(300))

PUT一个请求
import urllib.request
DATA=b'some data'
req = urllib.request.Request(url='http://localhost:8080', data=DATA,method='PUT')
with urllib.request.urlopen(req) as f:
    pass
print(f.status)
print(f.reason)

基本的HTTP认证
import urllib.request
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
                          uri='https://mahler:8092/site-updates.py',
                          user='klem',
                          passwd='kadidd!ehopper')
opener = urllib.request.build_opener(auth_handler)
urllib.request.install_opener(opener)
urllib.request.urlopen('http://www.example.com/login.html')

使用proxy
proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')

opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
opener.open('http://www.example.com/login.html')

添加头部
import urllib.request
req = urllib.request.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
r = urllib.request.urlopen(req)

更改User-agent
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')

使用GET时设置URL的参数
>>> import urllib.request
>>> import urllib.parse
>>> params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> url = "http://www.musi-cal.com/cgi-bin/query?%s" % params
>>> with urllib.request.urlopen(url) as f:
...     print(f.read().decode('utf-8'))
...

使用POST时设置参数
>>> import urllib.request
>>> import urllib.parse
>>> data = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> data = data.encode('ascii')
>>> with urllib.request.urlopen("http://requestb.in/xrbl82xr", data) as f:
...     print(f.read().decode('utf-8'))
...

指定proxy
>>> import urllib.request
>>> proxies = {'http': 'http://proxy.example.com:8080/'}
>>> opener = urllib.request.FancyURLopener(proxies)
>>> with opener.open("http://www.python.org") as f:
...     f.read().decode('utf-8')
...
不使用proxy, 覆盖环境变量的proxy
>>> import urllib.request
>>> opener = urllib.request.FancyURLopener({})
>>> with opener.open("http://www.python.org/") as f:
...     f.read().decode('utf-8')
...

httplib in python2.X was renamed to http.client

When using the 2to3 tool to convert the source code, the import of these libraries will be automatically processed

In general, use python3, remember that there is only urllib, if you want to be more concise and easy to use, use requests, but it is not universal enough

Simple URL scanner:

import urllib,urllib2,re
def getResult(word,page=0):
    result = []
    url = 'http://www.baidu.com/s?wd=%s&pn=%s'% (urllib.quote(word),page * 10)
    #&rn=50每页返回多少结果
    html = urllib2.urlopen(url).read()
    m =re.findall('''result c-container.*?href=(.*?)''',html,re.S)
    for x in m:
        try:
            res =urllib2.urlopen(x)
            u =res.geturl()
            result.append(u)
        except Exception,e:
            if hasattr(e,'url'):
                result.append(e.url)
    return result

print getResult('a',2)

Please indicate: Adminxe's Blog » 0x012.Python learning-HTTP, Url scanner