python reptile (a) crawling Beijing short rent information

Required libraries

requests library

module requests a high degree of built-in module package on the basis of the python such that performed python network request , it becomes more simple and user-friendly.
For more see : Request specific library usage

BeautifulSoup library

Beautiful Soup is a library of python, the most important feature is the crawled pages of data
it is a toolbox, by parsing the document for the user to provide the data needed to crawl , simply because so much code does not need to be a complete write s application. Beautiful Soup automatically converted to Unicode encoding input document, output of the document is converted to utf-8 encoded .
For more see: BeautifulSoup specific library usage

Add the necessary knowledge

1. request header : headers
2.python the os library
3.python of time libraries

from bs4 import BeautifulSoup  # 主要用来解析 html 标签
import requests  # 请求网页获取数据的库
import time  # python中处理时间的标准库
import os
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/79.0.3945.130 Safari/537.36'}
global x  #必须声明两处的全局
x=0
jpg_dir="jpg_cxk/"
if not os.path.exists(jpg_dir):#如果没有这个文件夹 就新建文件夹
    os.mkdir(jpg_dir)

def judgment_sex(class_name):  # 判断性别
    if class_name == ['member_ico']:
        return '男'
    else:
        return '女'


def get_link(url):
    web_data = requests.get(url, headers=headers)
    soup = BeautifulSoup(web_data.text, 'lxml')
    links = soup.select('#page_list >  ul > li > a')  # links为URL列表  搞清楚你要爬的东西,是进入页面的URL链接
    # print(links)
    for link in links:  # link不用赋值
        href = link.get("href")  # href只能get参数
        get_info(href)  # 循环出的URL,依次调用get_info()函数


# 需要爬取的信息有:标题,地址,价格,房东名称,房东性别,和房东头像
def get_info(url):
    global x  #必须声明两处的全局
    wb_data = requests.get(url, headers=headers)
    soup = BeautifulSoup(wb_data.text, 'lxml')  # BeautifulSoup可以轻松地解析request库请求的网页 ‘lxml'速度快,容错率强
    tittles = soup.select('div.pho_info > h4 ')  # 括号内为标签 标题
    # print(tittles)
    addresses = soup.select(' span.pr5')  # 地址
    print(addresses)
    prices = soup.select('#pricePart > div.day_l > span')  # 价格   标签信息一定要写对
    imgs = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')  # 图片
    names = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')
    # print(prices) 结果表明的确得到了价格
    sexs = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > div')
    for img in imgs:
        try:
            cxk = img.get('src')  # 图片地址
            print('正在下载:%s' % cxk)
            # image = urlopen(attr)
            ir = requests.get(cxk)
            open(jpg_dir + '%d.jpeg' %x, 'wb').write(ir.content)  # 打开并写入
            x += 1
        except:
            continue
    for tittle, address, price, img, name, sex in zip(tittles, addresses, prices, imgs, names, sexs):
        data = {
            'tittle': tittle.get_text().strip(),  # 通过get_text()获取标签信息
            'address': address.get_text().strip(),
            'price': price.get_text(),
            'img': img.get("src"),
            'name': name.get_text(),
            'sex': judgment_sex(sex.get("class"))

        }
        print(data)


if __name__ == '__main__':

    urls = ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(number) for
            number in range(1, 14)]
    for single_url in urls:
        get_link(single_url)  # 循环调用get_link()函数
        time.sleep(5)  # 睡眠两秒

Published 21 original articles · won praise 17 · views 4164

Guess you like

Origin blog.csdn.net/qq_43786066/article/details/104468001
Recommended