爬虫练习 -- 链家

注意:请不要爬取过多信息,仅供学习。

分析:

  1. 业务需求分析......(此例为住房信息...)
  2. 查找相关网页信息(以链家为例)
  3. 分析URL,查找我们需要的内容,建立连接
  4. 定位数据
  5. 存储数据

首先进入链家网首页,点击租房,F12检查网页,查找我们需要的信息。如图:

第一页url:https://bj.lianjia.com/zufang/

第二页url:https://bj.lianjia.com/zufang/pg2/

然后再定位我们需要的信息:如下图

下面就开始代码实现,我们的分析过程,获取数据,对数据进行定位。

主要代码:

 
  1. # url 页码拼接

  2. url = 'https://bj.lianjia.com/zufang/pg{}'.format(page)

 
  1. # 利用Xpath 对数据进行定位

  2. ...

  3. html_pipei = html_ele.xpath('//ul[@id="house-lst"]/li')

  4.  
  5. for pipei_one in html_pipei:

  6. title = pipei_one.xpath('./div[2]/h2/a')[0].text

  7. region = pipei_one.xpath('./div[2]/div[1]/div[1]/a/span')[0].text

  8. ...

完整代码如下:

 
  1. import requests

  2. from lxml import etree

  3. import pymysql

  4.  
  5.  
  6. class Mysql(object):

  7. '''执行数据操作封装类'''

  8. def __init__(self):

  9. '''连接数据库、创建游标'''

  10. self.db = pymysql.connect(host="localhost", user="root", password="8888", database="test")

  11. self.cursor = self.db.cursor()

  12.  
  13. def mysql_op(self, sql, data):

  14. '''MySQL语句'''

  15. self.cursor.execute(sql, data)

  16. self.db.commit()

  17.  
  18. def __del__(self):

  19. '''关闭游标、关闭数据库'''

  20. self.cursor.close()

  21. self.db.close()

  22.  
  23.  
  24. # MySQL语句

  25. Insert = Mysql()

  26. # 要执行的sql 语句

  27. sql = '''INSERT INTO lianjia (title, region, zone, meters, location, price) VALUES(%s, %s, %s, %s, %s, %s)'''

  28.  
  29. # 头部报文

  30. headers = {

  31. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'

  32. }

  33.  
  34.  
  35. def download_msg():

  36. for page in range(1, 2):

  37. url = 'https://bj.lianjia.com/zufang/pg{}'.format(page)

  38. responses = requests.get(url, headers=headers)

  39. html = responses.text

  40. # 利用Xpath

  41. html_ele = etree.HTML(html)

  42.  
  43. html_pipei = html_ele.xpath('//ul[@id="house-lst"]/li')

  44. # print(html_pipei)

  45. for pipei_one in html_pipei:

  46. # ./li/div[2]/a

  47. title = pipei_one.xpath('./div[2]/h2/a')[0].text

  48. # print(title)

  49. region = pipei_one.xpath('./div[2]/div[1]/div[1]/a/span')[0].text

  50. # print(region)

  51. zone = pipei_one.xpath('./div[2]/div[1]/div[1]/span[1]/span')[0].text

  52. # print(zone)

  53. meters = pipei_one.xpath('./div[2]/div[1]/div[1]/span[2]')[0].text

  54. # print(meters)

  55. location = pipei_one.xpath('./div[2]/div[1]/div[1]/span[3]')[0].text

  56. # print(location)

  57. price = pipei_one.xpath('.//div[@class="price"]/span')[0].text

  58. # print(price)

  59. data = (title, region, zone, meters, location, price)

  60. Insert.mysql_op(sql, data)

  61.  
  62.  
  63. if __name__ == '__main__':

  64. download_msg()

  65.  

猜你喜欢

转载自blog.csdn.net/Lujuntong/article/details/82142466
今日推荐