将python中爬取的数据保存到数据库中

将爬取的数据保存到数据库中，保存的方法有很多种，可以采用比较方便的python中内置的sqlite3模块

 1 #必要方法和数据库的引入
 2 import urllib.request
 3 import re
 4 import sqlite3
 5 
 6 #爬取数据的函数
 7 def get_content(page, key):
 8     url = 'https://search.51job.com/list/010000%252C020000%252C030200%252C040000,000000,0000,00,9,99,' + key + ',2,' + str(page) + '.html'
 9     a = urllib.request.urlopen(url)
10     html = a.read().decode('gbk')
11     lst = re.findall(r'<span class="t3">(北京|上海|广州|深圳).*?</span>\s+<span class="t4">(\d+\.?\d?)-(\d+\.?\d?)(万|千)/(年|月)</span>', html)  #对数据的一些筛选
12     return lst
13 
14 #使用sqlite3连接数据库，创建jobs表
15 conn = sqlite3.connect('51.db')
16 c = conn.cursor()
17 c.execute('''CREATE TABLE IF NOT EXISTS jobs
18         (key text, addr text, min float, max float)''')
19 c.execute('''delete from jobs''')
20 conn.commit()  #提交事务
21 
22 #将数据写入51.txt文件和数据库中
23 with open('51.txt', 'w') as f:
24     f.write('%s\t%s\t%s\t%s\n' % ('key','addr','min','max'))
25     for key in ('python', 'java'):
26         for each in range(1, 11):
27             for items in get_content(each, key):
28                 min = float(items[1])
29                 max = float(items[2])
30                 if items[3] == "千":    #统一单位，方便比较
31                     min /= 10
32                     max /= 10
33                 if items[4] == "年":
34                     min /= 12
35                     max /= 12
36                 f.write('%s\t%s\t%s\t%s\n' % (key, items[0], round(min, 2), round(max, 2)))
37                 c.execute("INSERT INTO jobs VALUES (?,?,?,?)", (key, items[0], round(min, 2), round(max, 2)))
38 conn.commit()
39 conn.close()
40 
41 #相当于一个入口，去执行get_content函数
42 if __name__ == '__main__':
43     lst = get_content(1, 'python')
44     print(lst)

sqlite3和pymysql模块之间有很多不同的地方。首先sqlite3是一个嵌入式模块，用户在使用时不需要专门去下载，可以直接导入使用，而pymysql需要用户在pip文件中单独下载，而且可能会出现很多问题。另外就是一个细节问题，sqlite3的占位符是?，而pymysql的占位符是%s。

将python中爬取的数据保存到数据库中

猜你喜欢