Python简单爬虫6 - 代码天地

Python简单爬虫6

数据库 2019-01-31 17:00:49 阅读次数: 0

Xpath爬取哈尔滨所有公交车信息

以公交路线为集合名存入Mongodb数据库

from lxml import etree
import requests
import os
import pymongo as py

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

def get_stop_name(name, url, headers, path):

    myclient = py.MongoClient('localhost', 27017)

    mydb = myclient.BUS

    mycollection = mydb[str(name)]

    wb_data = requests.get(url, headers).text

    html = etree.HTML(wb_data)

    datas = html.xpath('/html/body/div[5]/div[2]/div[3]/ul[1]/li/a/text()')

    cnt = 0

   # with open(path, 'a', encoding='utf-8') as f:
        #f.write(name + '\n')

    for data in datas:
            #if data == datas[-1]:
            #    f.write(data)
            #else :
            #    f.write(data + '->')
        info = [{'序号': cnt, '站名': data}]
        mycollection.insert(info)

        cnt = cnt + 1
        #f.write('\n')

    myclient.close()


def get_bus_num(url, headers):
    bus_data = requests.get(url, headers).text

    html = etree.HTML(bus_data)

    names = html.xpath('/html/body/div[5]/div[2]/div[1]/div[2]/ul/li/a/text()')
    urls = html.xpath('/html/body/div[5]/div[2]/div[1]/div[2]/ul/li/a/@href')

    #path = 'stop.txt'
    for i in range(len(names)):
        get_stop_name(names[i], urls[i], headers, path)

if __name__ == '__main__':

    url = 'http://haerbin.gongjiao.com/lines_all.html'

    get_bus_num(url, headers)

猜你喜欢

转载自blog.csdn.net/mqc925900181/article/details/86497586

Python简单爬虫6

Python网络爬虫-6

Python爬虫学习：简单的爬虫

Python开发简单爬虫

Python 简单业务爬虫

python简单网络爬虫

Python实现简单的爬虫

简单认识Python爬虫

python简单爬虫

python 简单的爬虫技术

python简单爬虫笔记

简单的Python爬虫

python爬虫简单实例

Python简单爬虫实例

python 简单爬虫（beatifulsoup)

2，简单的Python爬虫

python 最简单的爬虫

python 简单的爬虫

Python简单爬虫项目

python 简单的并发爬虫

python爬虫的简单了解

python 实现简单爬虫

python 简单爬虫

简单的python爬虫程序

python实现简单爬虫

python超简单爬虫

python——简单的爬虫

Python 简单网页爬虫

Python 实现简单的爬虫

python简单的爬虫

今日推荐

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

国产云输入法——仅华为无云端数据上传安全问题

开源日报 | 工业开源项目OGG 1.0；姐姐，你要和我一起配置火狐吗；苹果AI遥遥落后？Fedora 40

开放签电子签章：停止新增，优化体验，前进更进（五一假期前工作）

开源日报 | 中学生开源前端动画引擎；全球首个Llama3 8B中文版开源模型；联想电脑恐出局；Linus讽刺AI炒作

周排行

浏览器对同一域名进行请求的最大并发连接数

React Hook之自定义Hook

【转】MyBatis缓存机制

-Java-泛型

自动化测试常用脚本-发送邮件

LeetCode#859: Buddy Strings

java、Python处理字符串

第二篇の博客

Hadoop伪分布式环境安装

SQL Server进阶（十一）临时表、表变量

每日归档

更多

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)

2024-04-21(0)

2024-04-20(6)

2024-04-19(5)

2024-04-18(0)