Python爬取安居客经纪人信息

Python爬取安居客经纪人信息

Python2.7.15
今天我们来爬取安居客经纪人的信息。这次我们不再使用正则,我们使用beautifulsoup。不了解的可以先看一下这个文档,便于理解。https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

一、获取源码

for page in range(1,8):
    url ="https://beijing.anjuke.com/tycoon/p" + str(page)+"/"
    response = urllib2.urlopen(url)
    content = response.read()

老套路urllib2

二、使用bs4

首先看源码,找到经纪人信息对应的标签,然后使用beautifulsoup方法,这里的html.parser是对应的解析器

    soup = BeautifulSoup(content,'html.parser')
    a = soup.find_all('h3')
    b = soup.find_all(class_=re.compile("brokercard-sd-cont clearfix"))
    c = soup.find_all("p", attrs={"class": "jjr-desc"})
    d = soup.find_all("p", attrs={"class": "jjr-desc xq_tag"})
    e = soup.find_all(class_=re.compile("broker-tags clearfix"))

a,b,c,d,e分别对应经纪人姓名,评价,门店,熟悉,业务
每一项都是列表
将它们循环输出

    n = 0
    for jjr in a:
        o = jjr.get_text(strip=True).encode('utf-8')
        p = b[n].get_text(strip=True).encode('utf-8')
        q = c[2*n].get_text(strip=True).encode('utf-8')
        r = d[n].get_text(strip=True).encode('utf-8')
        s = e[n].get_text(strip=True).encode('utf-8')
        n+=1

这里要注意编码问题,使用beautifulsoup解析后的文档是Unicode编码,直接输出会乱码,而且这个编码模式也无法写入文档或数据库,所以后面要加上encode(‘utf-8’)来重新编码

三、写入数据库

        insert_agent = ("INSERT INTO AGENT(姓名,评价,门店,熟悉,业务)" "VALUES(%s,%s,%s,%s,%s)")
        data_agent = (o,p,q,r,s)
        cursor.execute(insert_agent, data_agent)

记得先建立数据库连接,和要写入的表

四、完整代码

# coding=utf-8
from bs4 import BeautifulSoup
import urllib2
import re
import MySQLdb

conn=MySQLdb.connect(host="127.0.0.1",user="root",passwd="199855pz",db="pz",charset='utf8')
print '连接成功'
cursor = conn.cursor()
cursor.execute("DROP TABLE IF EXISTS AGENT")
sql = '''CREATE TABLE AGENT(姓名 char(4) ,评价 char(50) ,门店 char(50) ,熟悉 char(50) ,业务 char(50))'''
cursor.execute(sql)

for page in range(1,8):
    url ="https://beijing.anjuke.com/tycoon/p" + str(page)+"/"
    response = urllib2.urlopen(url)
    content = response.read()
    soup = BeautifulSoup(content,'html.parser')
    a = soup.find_all('h3')
    b = soup.find_all(class_=re.compile("brokercard-sd-cont clearfix"))
    c = soup.find_all("p", attrs={"class": "jjr-desc"})
    d = soup.find_all("p", attrs={"class": "jjr-desc xq_tag"})
    e = soup.find_all(class_=re.compile("broker-tags clearfix"))

    n = 0
    for jjr in a:
        o = jjr.get_text(strip=True).encode('utf-8')
        p = b[n].get_text(strip=True).encode('utf-8')
        q = c[2*n].get_text(strip=True).encode('utf-8')
        r = d[n].get_text(strip=True).encode('utf-8')
        s = e[n].get_text(strip=True).encode('utf-8')
        n+=1
        insert_agent = ("INSERT INTO AGENT(姓名,评价,门店,熟悉,业务)" "VALUES(%s,%s,%s,%s,%s)")
        data_agent = (o,p,q,r,s)
        cursor.execute(insert_agent, data_agent)
conn.commit()

PS.安居客更新了,源码有一些变动,但爬取信息还是老方法。

猜你喜欢

转载自blog.csdn.net/memoirs_pz/article/details/83718484