urllib.urlencode()

Both urllib and urllib2 are related modules that accept URL requests, but provide different functionality. The two most notable differences are as follows:

urllib can only accept URLs and cannot create Request class instances with headers set;

But urllib provides urlencodemethods for GET query string generation, while urllib2 does not. (This is the main reason why urllib and urllib2 are often used together)

The encoding work uses urllib's urlencode()functions to help us convert key:valuesuch key-value pairs into "key=value"such strings, and the decoding work can use urllib's unquote()functions. (note, not urllib2.urlencode() )

# IPython2 中的测试结果
In [1]: import urllib

In [2]: word = {"wd" : "传智播客"}

# 通过urllib.urlencode()方法，将字典键值对按URL编码转换，从而能被web服务器接受。
In [3]: urllib.urlencode(word)  
Out[3]: "wd=%E4%BC%A0%E6%99%BA%E6%92%AD%E5%AE%A2"

# 通过urllib.unquote()方法，把 URL编码字符串，转换回原先字符串。
In [4]: print urllib.unquote("wd=%E4%BC%A0%E6%99%BA%E6%92%AD%E5%AE%A2")
wd=传智播客

Generally, HTTP request submission data needs to be encoded into URL encoding format, and then used as part of the url, or passed to the Request object as a parameter.

Get method

GET requests are generally used for us to obtain data from the server. For example, we use Baidu to search 传智播客: https://www.baidu.com/s?wd= Chuanzhi Podcast

The url of the browser will jump as shown in the figure:

https://www.baidu.com/s?wd=%E4%BC%A0%E6%99%BA%E6%92%AD%E5%AE%A2

In it, we can see that in the request part, http://www.baidu.com/s?there is a long string that contains the keyword we want to query, so we can try to use the default Get method to send the request.

# urllib2_get.py

import urllib      #负责url编码处理
import urllib2

url = "http://www.baidu.com/s"
word = {"wd":"传智播客"}
word = urllib.urlencode(word) #转换成url编码格式（字符串）
newurl = url + "?" + word    # url首个分隔符就是 ?

headers={ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}

request = urllib2.Request(newurl, headers=headers)

response = urllib2.urlopen(request)

print response.read()

Batch crawling post bar page data

First we create a python file, tiebaSpider.py, what we have to do is to enter a Baidu Tieba address, such as:

The first page of Baidu Tieba LOL Bar: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=0

Second page: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=50

The third page: http://tieba.baidu.com/f?kw=lol&ie=utf-8&pn=100

Find the rule, the difference between each page in the post bar is the value of the pn at the end of the url, and the rest are the same, we can grasp this rule.

Simply write a small crawler program to crawl all the web pages of Baidu LOL.

First write a main, prompt the user to enter the name of the post to be crawled, and use urllib.urlencode() to transcode, and then combine the url, assuming it is lol, then the combined url is:http://tieba.baidu.com/f?kw=lol

# 模拟 main 函数
if __name__ == "__main__":

    kw = raw_input("请输入需要爬取的贴吧:")
    # 输入起始页和终止页，str转成int类型
    beginPage = int(raw_input("请输入起始页："))
    endPage = int(raw_input("请输入终止页："))

    url = "http://tieba.baidu.com/f?"
    key = urllib.urlencode({"kw" : kw})

    # 组合后的url示例：http://tieba.baidu.com/f?kw=lol
    url = url + key
    tiebaSpider(url, beginPage, endPage)

Next, we write a Baidu Tieba crawler interface, we need to pass 3 parameters to this interface, one is the combined url address in the main, and the starting page number and the ending page number, indicating the range of page numbers to be crawled.

def tiebaSpider(url, beginPage, endPage):
    """
        作用：负责处理url，分配每个url去发送请求
        url：需要处理的第一个url
        beginPage: 爬虫执行的起始页面
        endPage: 爬虫执行的截止页面
    """


    for page in range(beginPage, endPage + 1):
        pn = (page - 1) * 50

        filename = "第" + str(page) + "页.html"
        # 组合为完整的 url，并且pn值每次增加50
        fullurl = url + "&pn=" + str(pn)
        #print fullurl

        # 调用loadPage()发送请求获取HTML页面
        html = loadPage(fullurl, filename)
        # 将获取到的HTML页面写入本地磁盘文件
        writeFile(html, filename)

We have previously written a code to crawl a web page. Now, we can encapsulate it into a small function loadPage for us to use.

def loadPage(url, filename):
    '''
        作用：根据url发送请求，获取服务器响应文件
        url：需要爬取的url地址
        filename: 文件名
    '''
    print "正在下载" + filename

    headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}

    request = urllib2.Request(url, headers = headers)
    response = urllib2.urlopen(request)
    return response.read()

Finally, if we want to store the information crawled to each page on the local disk, we can simply write an interface for storing files.

def writeFile(html, filename):
    """
        作用：保存服务器响应文件到本地磁盘文件里
        html: 服务器响应文件
        filename: 本地磁盘文件名
    """
    print "正在存储" + filename
    with open(filename, 'w') as f:
        f.write(html)
    print "-" * 20

In fact, many websites are like this. The html page numbers under similar websites correspond to the webpage serial numbers after the URL respectively. As long as the rules are found, the pages can be crawled in batches.

POST method:

Above we said that there is a data parameter in the Request request object, which is used in POST. The data we want to transmit is this parameter data, and data is a dictionary, which must match key-value pairs.

Youdao dictionary translation website:

Enter the test data and observe by using Fiddler. One of them is a POST request, and the request data sent to the server is not in the url, so we can try to simulate this POST request.

So, we can try to send the request with POST.

import urllib
import urllib2

# POST请求的目标URL
url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null"

headers={"User-Agent": "Mozilla...."}

formdata = {
    "type":"AUTO",
    "i":"i love python",
    "doctype":"json",
    "xmlVersion":"1.8",
    "keyfrom":"fanyi.web",
    "ue":"UTF-8",
    "action":"FY_BY_ENTER",
    "typoResult":"true"
}

data = urllib.urlencode(formdata)

request = urllib2.Request(url, data = data, headers = headers)
response = urllib2.urlopen(request)
print response.read()

When sending a POST request, you need to pay special attention to some attributes of the headers:

Content-Length: 144: means that the length of the sent form data is 144, that is, the number of characters is 144.

X-Requested-With: XMLHttpRequest: Represents an Ajax asynchronous request.

Content-Type: application/x-www-form-urlencoded: Indicates that the browser is used when submitting a web form, and the form data will be encoded in the form of name1=value1&name2=value2 key-value pair.

Python web crawler notes (6) GET request and POST request