Python web crawler requests, bs4 web crawling stewardess pictures

As several previous articles ( Scrapy Python Reptile Detailed framework , Python framework Scrapy of reptiles crawling a large number of embarrassments Encyclopedia piece of data ), using the Scrapy framework and crawling into a piece of embarrassments Wikipedia in MongoDB.

Scrapy good framework, but also provides many extension points, you can write the middleware processing Scrapy of Request and Response own. But it can be customized or control of it, or write your own reptile has strengthened a number.

If you simply write more controlled reptile, it is recommended to use a third-party Python libraries: requests and bs4.

### Requests and bs4 climb net flight attendants Pictures

####requests

requests Python is a very popular network data processing third-party libraries. With respect to the Python framework built urllib, urllib2, the operation requests to provide more compact and richer. As examples requests:

#HTTP请求类型
#get类型
r = requests.get('https://github.com/timeline.json')
#post类型
r = requests.post("http://m.ctrip.com/post")
#put类型
r = requests.put("http://m.ctrip.com/put")
#delete类型
r = requests.delete("http://m.ctrip.com/delete")
#head类型
r = requests.head("http://m.ctrip.com/head")
#options类型
r = requests.options("http://m.ctrip.com/get")

#获取响应内容
print r.content #以字节的方式去显示,中文显示为字符
print r.text #以文本的方式去显示

#URL传递参数
payload = {'keyword': '日本', 'salecityid': '2'}
r = requests.get("http://m.ctrip.com/webapp/tourvisa/visa_list", params=payload) 
print r.url #示例为http://m.ctrip.com/webapp/tourvisa/visa_list?salecityid=2&keyword=日本

#获取/修改网页编码
r = requests.get('https://github.com/timeline.json')
print r.encoding
r.encoding = 'utf-8'

#json处理
r = requests.get('https://github.com/timeline.json')
print r.json() #需要先import json    

#定制请求头
url = 'http://m.ctrip.com'
headers = {'User-Agent' : 'Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19'}
r = requests.post(url, headers=headers)
print r.request.headers

#复杂post请求
url = 'http://m.ctrip.com'
payload = {'some': 'data'}
r = requests.post(url, data=json.dumps(payload)) #如果传递的payload是string而不是dict,需要先调用dumps方法格式化一下

#post多部分编码文件
url = 'http://m.ctrip.com'
files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=files)

#响应状态码
r = requests.get('http://m.ctrip.com')
print r.status_code
    
#响应头
r = requests.get('http://m.ctrip.com')
print r.headers
print r.headers['Content-Type']
print r.headers.get('content-type') #访问响应头部分内容的两种方式
    
#Cookies
url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)
r.cookies['example_cookie_name']    #读取cookies
    
url = 'http://m.ctrip.com/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies) #发送cookies

#设置超时时间
r = requests.get('http://m.ctrip.com', timeout=0.001)

#设置访问代理
proxies = {
           "http": "http://10.10.10.10:8888",
           "https": "http://10.10.10.100:4444",
          }
r = requests.get('http://m.ctrip.com', proxies=proxies)

By requests, we can easily send GET, POST, DELETE, PUT request, obtain the appropriate data, and so on.

####bs4

bs4 means BeautifulSoup 4.x version. Relative to BeautifulSoup 2.x and 3.x, 4.x provides a richer and more humane api. Use BeautifulSoup, we can easily navigate to the HTML element we want to get the element values, and so on. Such as:

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

### View web page structure stewardess

First, we look at the page structure stewardess network, find each person's album page. In looking through random kongjie.com inside, you will find popular album page, as shown:

First, analyze the structure of the page, to extract each person's album page link. Figure:

Page structure

class属性为ptw的div下,ul中的每一个li都是每个人的相册封面,通过提取li中的链接,就能进入每个人的相册。

###开始爬取

####提取相册链接

从上面这个页面提取每个人相册链接的css表达式为div.ptw li.d。这样,我们就可以把这个表达式用在BeautifulSoup里面了。

如,

def parse_album_url(url):
    """
    解析出相册url,然后进入相册爬取图片
    """
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    people_list = soup.select('div.ptw li.d')
    for people in people_list:
        save_images_in_album(people.div.a['href'])

    # 爬取下一页
    next_page = soup.select_one('a.nxt')
    if next_page:
        parse_album_url(next_page['href'])

if __name__ == '__main__':
    parse_album_url(start_url)

现在,获取到了每个人的相册链接,接下来就是编写save_images_in_album()方法,进入每个人的相册里面抓取图片了。再提一下,再这里,我们提取完一页中每个人的相册链接之后,解析了网页里的“下一页”的链接,这样就能自动翻页抓取了。“下一页”链接的网页结构如下:

Next Page

通过css表达式a.nxt就能提取到这个下一页链接。

####进入相册提取图片

要编写爬虫,我们还是获取一下浏览器访问这个页面时的Request Headers,这样,就能绕过一些简单的反爬虫手段。

然后,进入到相册内部,查看一下网页结构,如图:

Picture Album List

我们得知,id为photo_pic,class为c的div里面,第一个超链接里面的img标签就是大图。所以,我们在这里提取这个链接。我们使用soup.find('div', id='photo\_pic', class\_='c')定位到id为photo_pic、class为c的div,然后通过image_div.a.img[‘src’]就能拿到这个图片的链接了。

拿到一张图片的链接之后,我们需要切换到下一张图片。可以看到,大图下面最后面有个向右的箭头,这个是下一张图的按钮,我们获取这个按钮的链接,获取连接对应的css表达式为div.pns.mlnv.vm.mtm.cl a.btn[title="下一张"],然后就可以重复上面两个步骤爬取相册里面所有的照片了。

####图片去重

在这里,我们怎么知道一个相册里的图片都爬取完成了呢?

We used to store redis crawling the picture id, if a picture id already exists in redis, then no crawling, so we can easily know whether a finished album crawling (all pictures in the album id redis in there, and it means crawling over the album).

Therefore, we write save_images_in_album () method is as follows:

def save_images_in_album(album_url):
    """
    进入空姐网用户的相册,开始一张一张的保存相册中的图片。
    """
    # 解析出uid和picid,用于存储图片的名字
    uid_picid_match = uid_picid_pattern.search(album_url)
    if not uid_picid_match:
    	return
    else:
        uid = uid_picid_match.group(1)
        picid = uid_picid_match.group(2)

    response = requests.get(album_url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    image_div = soup.find('div', id='photo_pic', class_='c')
    if image_div and not redis_con.hexists('kongjiewang', uid + ':' + picid):
        image_src = domain_name + image_div.a.img['src']
        save_img(image_src, uid, picid)
        redis_con.hset('kongjie', uid + ':' + picid, '1')

    next_image = soup.select_one('div.pns.mlnv.vm.mtm.cl a.btn[title="下一张"]')
    if not next_image:
    	return
    # 解析下一张图片的picid,防止重复爬取图片,不重复则抓取
    next_image_url = next_image['href']
    next_uid_picid_match = uid_picid_pattern.search(next_image_url)
    if not next_uid_picid_match:
    	return
	next_uid = next_uid_picid_match.group(1)
    next_picid = next_uid_picid_match.group(2)
    if not redis_con.hexists('kongjie', next_uid + ':' + next_picid):
        save_images_in_album(next_image_url)

Here, from the url albums by regular expressions: uid_picid_pattern = re.compile ( '.?.?.? * Uid = (\ d +) * picid = (\ d +) *' r) parse the user id and id each picture. Then you can come and go with redis heavy.

#### Download image

In the above function, we get a link to each picture big picture, that image_src variables. Then we can write save_img () method to save pictures. Such as:

def save_img(image_url, uid, picid):
    """
    保存图片到全局变量save_folder文件夹下,图片名字为“uid_picid.ext”。
    其中,uid是用户id,picid是空姐网图片id,ext是图片的扩展名。
    """
    try:
        response = requests.get(image_url, stream=True)
        # 获取文件扩展名
        file_name_prefix, file_name_ext = os.path.splitext(image_url)
        save_path = os.path.join(save_folder, uid + '_' + picid + file_name_ext)
        with open(save_path, 'wb') as fw:
            fw.write(response.content)
        print uid + '_' + picid + file_name_ext, 'image saved!', image_url
    except IOError as e:
        print 'save error!', e, image_url

### operating results

Finally, we run the command line Python kongjiewang.py. Look at the results:

operation result

We're done!

Interest may concern:

github address
like you can focus on micro-channel public number:

Write pictures described here

## Reference

  1. My own headline numbers: Python crawling reptile framework Scrapy of a large number of embarrassments Encyclopedia piece of data
Published 14 original articles · won praise 37 · views 110 000 +

Guess you like

Origin blog.csdn.net/c315838651/article/details/72773602
Recommended