最近javaweb 项目存放图书的数据库存放的图书太少
决定去豆瓣榜单爬取一些数据

首先是爬取网页得到数据
以字典类型先储存下来

贴上代码

def init(self, keyword):
self.keyword = keyword
self.url = “https://book.douban.com/tag/” + self.keyword
self.headers = {
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400”
}

def get_page(self, start):
params = {
“start”: start * 20,
“type”: “T”
}

response = requests.get(self.url, params=params,
                        headers=self.headers).text
return response

def get_book(self, html):
doc = pq(html)
for items in doc("li.subject-item ").items():
book = items.find(“h2”).text()
message = items.find(“div.pub”).text()
score = items.find(“span.rating_nums”).text()
number = items.find(“span.pl”).text()[1:-1]
yield {
“book”: book,
“message”: message,
“score”: score,
“number”: number
}

返回字典类型

当取出来
dict 中的 message 信息凌乱但是有规律
通过 / 分割得到了 list 存放分割好的字符串
遍历 list 可以通过下标
因为 message 长度较短可以直接通过下标取得
便可以得到作者和价格

到此为止
我们就拿到了书名作者价格

这里仍然遇到了点小问题
爬出来的价格是 xx元格式
而数据库存放的是 int类型
所以在split （’元’） list 取第一项就可以得到纯数字格式的字符串剩下交给cursor 执行即可 mysql会转换为int

接下来的思路是边爬取边存到数据库

python 连接数据库较为简单

conn = pymysql.connect(
host =localhost，
port=10047,
user=‘root’,
passwd=‘qaz1234567’,
db=‘Store’,
charset=‘utf8’
)

获取到连接就可以通过 cursor 游标操作sql 语句

同时也遇到了几个插入问题

python 同样支持 mysql 预处理与JDBC 不同的是
占位符为 %s
另外由于数据库原因本身有一个ID 是自增的
在JavaWeb 是通过 Bean 对象操作的
在python 插入ID 的时候也要占位符 %s 直接去填一个0 就行
MySQL 会自动将ID 递增
另外 cursor 插入数据也需要细节

插入单行数据

def db_insert_data(self, sql, cur, *args):
    try:
        # print(args)
        result = cur.execute(sql, args)
        print('添加语句受影响的行数：  ' + '信息插入数据库成功 ', result)
    except Exception as e:
        print('db_insert_data error: ', e.args)

args 是可变参数

只要将占位符的字符一块放入即可

解决了这些问题
发现 print 输出了可以插入
但是MySQL没有效果
一定要记得结尾工作
需要将 cursor 提交事务
cur.close()

conn.commit()

conn.close()

等程序跑完之后就插入了数据

python爬取豆瓣图书榜单并存放数据库心得

插入单行数据

猜你喜欢

python爬取豆瓣图书榜单 并存放数据库心得

插入单行数据

猜你喜欢

python爬取豆瓣图书榜单并存放数据库心得