A preliminary study of python to be a simple crawler

Ready to work

A preliminary exploration of python, this article is a summary of its own. So the tutorial is for novices and has no technical content. The python environment is basically available in Linux, and the official website under Windows also provides a convenient installation package. There are many tutorials on how to install and configure the Internet, and I will not explain them one by one here. The python version I use is Python 3.6.4, and the code behind is also based on python3.

Analyze requirements

To be a small crawler is inseparable from obtaining web page content and matching stored content, then we first install the old friend requests of the python crawler: pip install requestsand then install the pymysql extension to facilitate inserting the matched content into the mysql database:pip install pymysql

Step 1: Get the web content

What's interesting in python is that you need to import whatever you need. It's not like php to get the content of the web page and just come to file_get_contentsthe end. Not much nonsense, post the code to have a look:

# -*- coding:utf-8 -*-

# 加载 requests 模块
import requests
# GET方式获取 Response 对象
response = requests.get('https://www.xxx.com/')
if response:
    # 输出html代码到控制台
    print(response.text)
else:
    # 输出错误信息
    print('requests error')

The indentation in python must be strict. A common mistake made by beginners (4 spaces is the indentation of a statement block) is to mix the tab key and the space key, resulting in inconsistent indentation. If you see the error message: IndentationError: unexpected indent, it means that the indentation is inconsistent. If you don't have a coding foundation, then I recommend you to take a look at the basic concepts of python: http://www.kuqin.com/abyteofpython_cn/ch04.html If you already have a coding foundation but don't have a deep understanding of indentation, you can look at python The indentation specification: http://www.kuqin.com/abyteofpython_cn/ch04s09.html

Ok, after writing the code, let's get the console to try it out and output the html code perfectlyrequests

Step 2: Regularly match content

Now that we can get the html code, we need to find the part we need, which uses regular expressions. Python has added the re module since version 1.5, which provides Perl-style regular expression patterns. The specific details can be viewed in the rookie tutorial: http://www.runoob.com/python/python-reg-expressions.html , and then post the code without further ado:

# -*- coding:utf-8 -*-
# 加载 requests 模块
import requests
# 加载 re 模块
import re

response = requests.get('https://www.xxx.com/')
# 正则匹配文本
match = re.findall(r'<p><!--markdown-->([\s\S]*?)</p>', response.text)
if match:
    # 输出匹配的内容到控制台
    print(match[0])
else:
    # 输出html代码到控制台
    print(response.text)

re

Note: The original URL is a random text display, and it will change every time it is refreshed.

Step 3: Loop matching and add to the database

First, we have the database and table ready, which can be created with sql statement:

CREATE DATABASE IF NOT EXISTS `sentence`;
USE `sentence`;

CREATE TABLE IF NOT EXISTS `sexy` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `content` varchar(50) NOT NULL,
  `datetime` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  UNIQUE KEY `content` (`content`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Here, the content is set to UNIQUE KEY to ensure that the captured content is not repeated. If there is an existing value, it will be skipped directly.

# -*- coding:utf-8 -*-
# 加载 requests 模块
import requests
# 加载 re 模块
import re
# 加载 pymysql 模块
import pymysql

# 打开数据库连接
db = pymysql.connect('127.0.0.1', 'root', 'root', 'sentence', charset='utf8')
# 使用cursor()方法获取操作游标
cursor = db.cursor()

#死循环到天长地久
while(True):
    response = requests.get('https://www.xxx.com/')
    # 正则匹配文本
    match = re.findall(r'<p><!--markdown-->([\s\S]*?)</p>', response.text)
    if match:
        sql = "INSERT INTO `sexy` (`content`) VALUES ('%s')" % (match[0])
        try:
           # 执行sql语句
           cursor.execute(sql)
           # 提交到数据库执行
           db.commit()
        except:
           # 如果发生错误则回滚
           db.rollback()
        # 输出sql语句到控制台
        print(sql)
    else:
        # 输出html代码到控制台
        print(response.text)

Running the demo: pymysqlDatabase contents:mysql

Summarize

python is a good thing,All things based on python. I feel that the tutorial post is difficult to write, and there are details in every place, but if it is detailed, the article is too cumbersome, and if it is simple, it is difficult for beginners to understand what it means. pay tribute. Note: Since the original URL is not convenient to publish, all URLs in the code are replaced by xxx.com. The source code is in Github: https://github.com/st1ven/Python-Spider-Demo , welcome to Star

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325548564&siteId=291194637