Data storage of Python crawler

Data storage of Python crawler

How to store data after crawling? This article will explain the common operations of storing data in excel, txt, and database

1. Result output

The result here is directly printed out, how to persist storage?

for title,actor,time,score,count,comment in zip(titles,actors,times,scores,counts,comments):
    actor = actor.strip()
    time = time.strip().split()[0]
    print(title,actor,time,score,count,comment)

2. Data storage

mode description
w Open a file for writing only. If the file already exists, open the file and start editing from the beginning, that is, the original content will be deleted. If the file does not exist, create a new file.
wb Open a file in binary format for writing only. If the file already exists, open the file and start editing from the beginning, that is, the original content will be deleted. If the file does not exist, create a new file.
w+ Open a file for reading and writing. If the file already exists, open the file and start editing from the beginning, that is, the original content will be deleted. If the file does not exist, create a new file.
wb+ Open a file in binary format for reading and writing. If the file already exists, open the file and start editing from the beginning, that is, the original content will be deleted. If the file does not exist, create a new file.
a Open a file for appending. If the file already exists, the file pointer will be placed at the end of the file. In other words, the new content will be written after the existing content. If the file does not exist, create a new file for writing.
from Open a file in binary format for appending. If the file already exists, the file pointer will be placed at the end of the file. In other words, the new content will be written after the existing content. If the file does not exist, create a new file for writing.
a+ Open a file for reading and writing. If the file already exists, the file pointer will be placed at the end of the file. When the file is opened, it will be in append mode. If the file does not exist, create a new file for reading and writing.
from + Open a file in binary format for appending. If the file already exists, the file pointer will be placed at the end of the file. If the file does not exist, create a new file for reading and writing.

2.1 Store data to txt

with open('text.txt','w') as f:
    # 写入单行
    f.write()
    # 写入多行
    f.writelines([])

2.2 Store data to csv

import csv
with open('bilibili.csv','w',encoding='utf-8',newline='') as f:
    # 创建一个写的对象
	writer = csv.writer(f)
    # 写入单行
 	writer.writerow([])
	# 写入多行
    writer.writerows([(),(),()])

2.3 Data storage to the database

For more complex data, we can store it in the database, take mysql as an example

pip install pymysql

2.3.1 Database connection

import pymysql
db = pymysql.connect('IP','username','passwd','DATABASE')

# 连接到Mysql
db = pymysql.connect('localhost','root','123456')

# 连接到Mysql指定数据库
db= pymysql.connect('localhost','root','123456','database')

2.3.2 Create a database, create a data table

# 首先连接到数据库
db = pymysql.connect('localhost','root','123456')

# 建立游标
cursor = db.cursor()

# 创建数据库
cursor.execute("CREATE DATABASE doubanTop250")

# 创建数据表(在已有的数据库的前提下)
cursor.execute('''CREATE TABLE movie 
               (`id` INT AUTO_INCREMENT PRIMARY KEY,
                `title` VARCHAR(100) NOT NULL,
                `actor`VARCHAR(100) NOT NULL,
                `release_time` VARCHAR(100) NOT NULL,
                `score` VARCHAR(100) NOT NULL,
                `count` VARCHAR(100) NOT NULL,
                `comment` VARCHAR(100) NOT NULL)
               	 DEFAULT CHARSET=utf8;''')

2.3.3 Insert data

import pymysql
db = pymysql.connect('localhost','root','123456','database', charset='utf8')
cursor = db.cursor()
sql = '''insert into douban_movie4(title,actor,release_time,score,count,comment) values (%s,%s,%s,%s,%s,%s)'''
cursor.execute(sql,data)
    
# 提交
db.commit()

Take the Top250 Douban movie as an example

import csv
import time
import pymysql
import requests
from lxml import etree

class MovieSpider(object):
    def __init__(self):
        self.headers = {
    
    
            'user-agent': 'Mozilla/5.0'
        }
        self.url = 'https://movie.douban.com/top250?start={}&filter='
        self.db = pymysql.connect('localhost','root','123456','doubanTop250')
        self.cursor = self.db.cursor()
        
        
    def get_html(self,url):
        resp = requests.get(url, headers=self.headers)
        html = resp.text
        self.parse_html(html)

    def parse_html(self,html):
        xp_html = etree.HTML(html)
        titles = xp_html.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()')
        scores = xp_html.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()')
        counts = xp_html.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[4]/text()')
        comments = xp_html.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/p[2]/span/text()')

        for title, score, count, comment, in zip(titles, scores, counts, comments):
            data = [title, score, count, comment]
            sql = '''insert into movie(title,score,count,comment) values (%s,%s,%s,%s)'''
            self.cursor.execute(sql,data)

                        
    def main(self):
        start_time = time.time()
        
        for i in range(0,250,25):
            url = self.url.format(i)
            self.get_html(url)
        
        self.db.commit()
        end_time = time.time()
        print('总耗时:',end_time-start_time)



if __name__ == '__main__':
    spider = MovieSpider()
    spider.main()

Recommended reading:

  1. Use xpath to crawl data
  2. jupyter notebook use
  3. BeautifulSoup crawls the top 250 Douban movies
  4. An article takes you to master the requests module
  5. Python web crawler basics-BeautifulSoup

This is the end, if it helps you, welcome to like and follow, your likes are very important to me

Guess you like

Origin blog.csdn.net/qq_45176548/article/details/112222080