《python网络数据采集》读后感第四章：API

API 可以通过 HTTP 协议下载文件，和 URL 访问网站获取数据的协议一样，它几乎可以实现所有在网上干的事情。API 之所以叫 API 而不是叫网站的原因，其实是首先 API 请求使用非常严谨的语法，其次 API 用 JSON 或 XML 格式表示数据，而不是HTML 格式。

API 用一套非常标准的规则生成数据，而且生成的数据也是按照非常标准的方式组织的。

通常 API 验证的方法都是用类似令牌(token)的方式调用，每次 API 调用都会把令牌传递到服务器上。这种令牌要么是用户注册的时候分配给用户，要么就是在用户调用的时候才提供，可能是长期固定的值，也可能是频繁变化的，通过服务器对用户名和密码的组合处理后生成。

令牌除了在 URL 链接中传递，还会通过请求头里的 cookie 把用户信息传递给服务器。

API 有一个重要的特征是它们会反馈格式友好的数据。大多数反馈的数据格式都是 XML 和 JSON。

不同 API 的调用语法大不相同，但是有几条共同准则。当使用 GET 请求获取数据时，用URL 路径描述你要获取的数据范围，查询参数可以作为过滤器或附加请求使用。

下面是采集一个维基百科编辑历史界面的IP地址：

 1 from urllib.request import urlopen
 2 from urllib.error import HTTPError
 3 from bs4 import BeautifulSoup
 4 import json
 5 import datetime
 6 import random
 7 import re
 8 
 9 random.seed(datetime.datetime.now())
10 
11 def getLinks(articleUrl):
12     html = urlopen("http://en.wikipedia.org"+articleUrl)
13     bsObj = BeautifulSoup(html)
14     return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
15 
16 def getHistoryIPs(pageUrl):
17     #Format of history pages is: http://en.wikipedia.org/w/index.php?title=Title_in_URL&action=history
18     pageUrl = pageUrl.replace("/wiki/", "")
19     historyUrl = "http://en.wikipedia.org/w/index.php?title="+pageUrl+"&action=history"
20     print("history url is: "+historyUrl)
21     html = urlopen(historyUrl)
22     bsObj = BeautifulSoup(html)
23     #finds only the links with class "mw-anonuserlink" which has IP addresses instead of usernames
24     ipAddresses = bsObj.findAll("a", {"class":"mw-anonuserlink"})
25     addressList = set()
26     for ipAddress in ipAddresses:
27         addressList.add(ipAddress.get_text())
28     return addressList
29 
30 def getCountry(ipAddress):
31     try:
32         response = urlopen("http://freegeoip.net/json/"+ipAddress).read().decode('utf-8')
33     except HTTPError:
34         return None
35     responseJson = json.loads(response)
36     return responseJson.get("country_code")
37 
38 links = getLinks("/wiki/Python_(programming_language)")
39 
40 while(len(links) > 0):
41     for link in links:
42         print("-------------------") 
43         historyIPs = getHistoryIPs(link.attrs["href"])
44         for historyIP in historyIPs:
45             country = getCountry(historyIP)
46             if country is not None:
47                 print(historyIP+" is from "+country)
48 
49     newLink = links[random.randint(0, len(links)-1)].attrs["href"]
50     links = getLinks(newLink)

getLinks和getHistoryIPs两个函数搜索所有mw-anonuserlin类里面的链接信息，返回一个链接列表。

《python网络数据采集》读后感 第四章：API

猜你喜欢

《python网络数据采集》读后感第四章：API