Python3, database table design and data storage for website building! The egg at the end of the article, I'm sour~

Building your own website is one of the signs of
success for a programmer. What about the other signs of success?
Hey... the
left hand is holding Bai Fumei, the right hand is holding a small barbecue, the sole of the foot is stepping on Santana...
um~
such a chic life, just from the database table design and data storage started!

1. Crawl data

Some children will say, Uncle Yu, isn't this a database table design? Why are you still sending a package?
Haha~
Buy two get one free...
We crawl the data to store the data directly in the database, which saves the process of creating data~
The content we crawled today is this website

url = https://arxiv.org/list/cs/recent

Here is a collection of great god-level works of foreign bigwigs, and it's free !
If you don’t understand, you can Google Translate~!
Let's see what this website looks like
Insert picture description here

Old rules, code

# -*- coding: utf-8 -*-
"""
@ auth : carl_DJ
@ time : 2020-8-26
"""

import requests
import csv
import traceback
from bs4 import BeautifulSoup
from PaperWeb.Servers.utils.requests import get_http_session

def gen_paper(skip):
    try:
        #url的skip设定参数,每次展示100条数据
        url = 'https://arxiv.org/list/cs/pastweek?skip={skip}&show=100'
        #设定headers
        headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'

        }
        #设定代理,如果未购买,则不需要填写,
        proxies  = {
    
    
            'http':"http://127.0.0.1:1092"
            'https':"http://127.0.0.1:1092"
        }
        #引用http代理池,避免资源被耗尽
        r = get_http_session().get(url,headers=headers,proxies=proxies)
        #不运用代理,则注释掉proxies=proxies
        #r = get_http_session().get(url,headers=headers)
        #判断返回状态
        if r.status_code !=200:
            return False,'请求网页失败'
        #使用lxml,是因为解析速度快
        soup = BeautifulSoup(r.text,'lxml')

        #获取论文url的信息
        all_dt = soup.find_all('dt')
        #获取论文的标题与作者
        all_dd = soup.find_all('dd')

        #使用zip方法,循环获取dd 和dt的内容,
        for  dt,dd in zip(all_dt,all_dd):
            #因为class是关键字,所以用_作区分,
            #获取url,通过class下的a标签的href链接来获取
            api_url = dt.find(class_ = 'list-identifier').find('a').get('href')
            root_url = 'https://arxiv.org'
            full_url = root_url + api_url

            """
            最开始使用contents,是直接获取到title的内容,text获取的内容很多,多而乱
            """

            #获取title,通过class标签下的contents获取,返回list
            title = dd.find(class_ = 'list-title mathjax').contents
            #对内容进行判断,如果titles的长度>=3,就截取第2部分
            if len(title) >=3:
                title = title[2]
            else:
                #text,返回的是str,
                title = dd.find(class_ = 'list-title mathjax').text

            #论文作者
            authors = dd.find(class_='descriptor').text
            #由于一行很多作者,所以要进行切分
            authors = authors.split(':')[1].replace('\n','')

            #生成器
            yield title,full_url,authors
    except Exception as e:
        #输出错误日志信息
        log(traceback.format_exc(),'error','gen_paper.log')
        error_str =  f'authors:[full_url:[{full_url},{authors}],title:[{title}'
        log(error_str,'error','gen_paper.log')
        # #打印错误
        # print(e)
        # traceback.print_exc()

def log(content,level,filepath):
    """

    :param content:输入错误信息内容
    :param level:错误等级
    :param filepath:路径
    :return:
    """
    if level =='error':
        with open(filepath,'a') as f:
            f.write(content)
    elif level =='fail':
        with open(filepath,'a') as f :
            f.write(content)

def run_main():
    '''
    保存到csv文件中
    :return:
    '''
    #定义一个空列表
    resl = []
    #循环所有的数据信息,一共有1273条论文,间隔100
    for i in range(0,1273,100):
        #这里的i 就是 skip,
        for full_url,title,authors in gen_paper(i):
            #添加到list中
            resl.append([title,full_url,authors])
            print(title,'done')

        #打开文件,进行保存
        with open('paper.csv','w') as f :
            #打开上面获取到的文件内容
            cw = csv.writer(f)
            #写数据到csv中
            for i in resl:
                #写一行
                cw.writerow(i)


if __name__ == '__main__':
    run_main()

There is not much parsing of the content here. If you don’t understand, you can read the comments.
Personally, I think, um~, although it is not that every sentence has a comment, it does not reach that there is no comment!
If you don’t understand, look at Xiaoyu’s article "Python3, Multi-threaded Climbing Up Station B’s Video Barrage and Comments". Anything that can be explained is explained!
Take a look, this is the agent used

        #设定代理,如果未购买,则不需要填写
        proxies  = {
    
    
            'http':"http://127.0.0.1:1092"
            'https':"http://127.0.0.1:1092"
         }
        #引用http代理池,避免资源被耗尽,传入三个参数
         r = get_http_session().get(url,headers=headers,proxies=proxies)

Acting is to spend money , if not eating reptiles bowl of rice, or not fans, get hold of free agents on the line, although unstable, but also the line ~
operating results :
Insert picture description here
Why would I want to show results,
because It’s a good taste, the result of this crawling is not fragrant .

2. Create a database

2.1 Create a database table

One is to write the created data table in the pycharm project
as follows:
create.sql

--数据表

CREATE TABLE 'papers'(
    'id' int(11) NOT NULL AUTO_INCREMENT,
    'title' varchar(500) NOT NULL DEFAULT ''COMMIT '论文标题',
    'url' varchar(200) NOT NULL DEFAULT ''COMMIT '论文url',
    'authors' varchar(200) NOT NULL DEFAULT ''COMMIT '论文作者',
    'create_time' int(20) NOT NULL DEFAULT 0 COMMIT '创建时间',
    'update_time' int(20) NOT NULL DEFAULT 0 COMMIT '更新时间',
    'is_delete' tinyint(4) NOT NULL DEFAULT 0 COMMIT '是否删除',
    PRIMARY KEY ('id'),
    KEY 'title_IDX' ('title') USING BTREE,
    KEY 'authors_IDX' ('authors') USING BTREE
)ENGINE=InnoDB AUTO_INCREMENT =1 DEFAULT CHARSET=utf8 COMMIT='论文表'

Shown here is the content of MySQL's basic table creation. No, you can read Xiaoyu's "SQL Basic Usage One"
. If you don't understand it, just look at your bank card balance, your left and right hands~
estimate It will be there!
After writing, don’t forget to execute it in the database, otherwise how can you generate a table~

2.2 Connect to the database

We have created the database, we must connect the database through pymysql, so that we can use it~
Continue to code
mysql.py

# -*- coding: utf-8 -*-
"""
@ auth : carl_DJ
@ time : 2020-8-26
"""

import pymysql
import logging

'''
创建mysql数据库链接
'''

class MySQL(object):
    #创建数据库的基本链接信息
    def __init__(self,host='localhost',port = 3306,user='root',password = '123456',db='test'):
        #cursorclass = pymysql.cursors.DictCursor 数据库查询返回dict --> 默认返回值是 tuple
        self.conn = pymysql.connect(
            host = host,
            port = port,
            user=user,
            password=password,
            db=db,
            charset = 'utf8',
            cursorclass = pymysql.cursors.DictCursor
        )
        #定义logger
        self.log = logging.getLogger(__name__)

    def execute(self,sql,kwargs = None):
        # 创建mysql数据库链接
        try:
            #获取游标
            cursor = self.conn.cursor()
            #通过游标获取执行方法
            cursor.execute(sql,kwargs)
            #提交插入,删除等操作
            self.conn.commit()
            return cursor

        except Exception as e:
            #记录错误信息
            self.log.error(f'mysql execute eror:{e}',exc_info=True)
            raise e

    def query(self,sql,kwargs = None):
        #创建查询方法
        try:
            cursor = self.execute(sql,kwargs)

            if cursor:
                # 返回查询到的所有内容
                return cursor.fetchall()

            else:
                raise Exception(f'sql error:{sql}')
        except Exception as e:
            self.log.error(e)
            raise e
        finally:
            #关闭游标
            cursor.close()

    def insert(self,sql,kwargs = None):
        #数据插入
        try:
            cursor  = self.execute(sql,kwargs)
            if cursor:
                #获取最后行的id
                row_id = cursor.lastrowid
                return row_id
            else:
                raise Exception(f'sql error:{e}')
        except Exception as e:
            self.log.error(e)
            raise e
        finally:
            cursor.close()

    def escape_string(self,_):
        #对数据文件的内容进行转码,防止有一些特殊字符等
        return pymysql.escape_string(_)

db = MySQL(user='root',password='123456',db='papers')


The library used here is pymysql , which will not be used, see Xiaoyu’s article "Python3 Linking Mysql Database"

3. Data storage

We store the crawled data in the database.
Similarly, directly on the code

csv_to_mysql.py

# -*- coding: utf-8 -*-
"""
@ auth : carl_DJ
@ time : 2020-8-26
"""

import csv
import time
from PaperWeb.libs.mysql import db

def get_csv_info(path = 'paper.csv')
    #打开csv文件
    csvfile = open(path,'r')
    reader = csv.reader(csvfile)
    for item in reader:
        yield item

def get_insert_sql():
    #把数据插入到数据库
    items = []
    #获取时间
    _time = int(time.time())
    for item in get_csv_info():
        #对csv文件中的字符进行转码
        item = [db.escape_string(_) for _ in item]
        #添加到对应的list中,创建数据库时列表字段
        items.append(f"('{item[0]}',{item[1]},{item[2]},{_time},{_time})")
    #获取values值
    values = ','.join(items)

    #执行sql 语句的插入
    sql = f'''
        INSERT INTO 
         Paper ('title','url','authors','create_time','updata_time')
        Values
         {values}
    '''
    row_id = db.insert(sql)
    print(f'{row_id}条数据插入完成')


After execution, you can go to the database to see the results!
If it's not perfect, Xiaoyu thinks it's impossible~
If it's perfect, it won't be a waste of Xiaoyu's liver until twelve o'clock!

After recording the lesson, make a summary,
drink some coffee, walk the baby, and the
baby is asleep, I also start liver coding.

4. Easter eggs

After the website is built, Xiaoyu has two goals next:
1. Be able to write some blog posts on python data analysis;
2. Xiaoyu will share his experience with Ali, and finally get offer experience and interview questions.

Haha, this is spoiled in advance~

Why do I need to write about Ali’s interview experience?
Because recently, a small partner was preparing for Ali’s interview and asked me some questions about Ali’s interview. Rather than telling them one by one, it is better to sort them out and share more people!

In this way, Xiaoyu saves more time and does more things! !

Okay, finally borrow a word from Wang Er to end today's technology sharing:

If it is not forced by life, who wants to be talented!

Guess you like

Origin blog.csdn.net/wuyoudeyuer/article/details/108232508