Scrapy+Selenium crawls asynchronously loaded web pages and deploys them to Linux-Debain server complete tutorial

A few days ago, there was a project need, and I spent three days writing a crawler and deploying it to the server. I have never touched a Linux server before, consulted a lot of blog literature, and wrote this complete tutorial.

First is my basic environment configuration:

window11, Python3.9, Mysql, Debain11, Google browser.

Now let’s get to the point, I’ll take the Jianshu website as an example:

1. Write a crawler on this machine (Scrapy+Selenium)

1. Install Scrapy

pip install scrapy

2. Create a Scrapy crawler project

Take the paid serialization of crawling  short books  as an example. The webpage is loaded asynchronously. Pay special attention here. If you are sure that your crawler code is fine, but you just can’t extract the data in the webpage, don’t hesitate, this website is loaded asynchronously , simply using Scrapy is not enough, it must be combined with Selenium or other methods.

First of all, clarify the content to be crawled. This time, the title of the work, author, and reading volume are crawled. Then store all the data in the Mysql database. If the data you crawl contains time, the processing method is the same as the above data.

After using cmd to enter the folder where you want to store the crawler project, continue to enter in cmd:

scrapy startproject jianshuSpider

Just replace jianshuSpider with your own crawler project name, then use cd to enter the crawler project, and then enter the command to create a crawler. Note that the crawler project name and the crawler name are two concepts, and the two cannot be the same!

cd jianshuSpider
# scrapy genspider <爬虫名字> <允许爬取的域名>
scrapy genspider jianshu jianshu.com

The result is shown in the figure:

3. Write a crawler

Open the project in Pycharm, the structure is as follows:

jianshu.py: Write the logic code of the crawler, define the content to be crawled, extract the webpage data, etc., and implement it ourselves;

items.py: Write the class to save data items, you can put multiple classes, we implement it ourselves;

 middlewares.py: Middleware, writing webpage crawling data flow transmission, basically no need to change;

pipelines.py: pipeline, write data persistence code, define the addition, deletion, modification and query of the database;

settings.py: Configuration, basically no need to change, just pay extra attention to open some ports.

3.1 Modify settings.py

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# 添加的编码格式
FEED_EXPORT_ENCODING = 'utf-8'

DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True   # 自己加这一行

# Disable cookies (enabled by default)
COOKIES_ENABLED = False   # 防止被服务器追踪

# Override the default request headers:  
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'jianshuSpider.pipelines.JianshuspiderPipeline': 300,
}

When the User-Agent is modified, how to view your own User-Agent: Enter about:version in the address bar of the browser, and the User-Agent displayed is the User-Agent of the browser.

3.2 Write items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class JianshuspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    author = scrapy.Field()
    readtimes = scrapy.Field()
    pass

3.3 Write jianshu.py

This involves the extraction of web page data. I use an xpath extractor, which is very simple to use. Google browser provides us with the function of copying xpath, right click, click check, select the data we need to parse, right click, copy, copy xpath.

class JianshuSpider(scrapy.Spider):
    name = 'jianshu'
    allowed_domains = ['jianshu.com']

    # start_urls列表中可以放多个URL,爬虫会一个一个URL进行遍历访问,URL之间用英文逗号隔开
    start_urls = ['https://www.jianshu.com/mobile/books?category_id=284']

    def parse(self, response):
        books = response.xpath('//*[@id="book-waterfall"]/div')
        bookitem = JianshuspiderItem()
        for book in books:
            bookitem['name'] = book.xpath('./div/div[2]/p/text()').get()
            bookitem['author'] = book.xpath('./div/div[2]/div/span[1]/span/text()').get()
            bookitem['readtimes'] = (book.xpath('./div/div[2]/div/span[2]/text()').get()).lstrip()
            print("作品:", bookitem['name'])
            print("作者:", bookitem['author'])
            print("阅读量:", bookitem['readtimes'])
        pass

After writing here, you can run the crawler, enter scrapy crawl jianshu in the PyCharm terminal and press Enter to run, or create a new start.py file under the project, pay attention! It must be at the same level as the crawler directory!

You only need to execute this file every time you execute it. The start.py code is as follows:

from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'jianshu', '--nolog'])

After running, we found that we didn't get any data!

It is because this website is loaded asynchronously, which can be solved by using selenium.

3.4 Selenium solves the asynchronous loading problem

 Install Selenium: pip install selenium

To install the Google browser driver, refer to this article, but there is no need to add environment variables.

Let me briefly talk about the functions of these two things here. Selenium is a must for software testing. It can automatically execute the script program we defined. Coupled with the browser driver, selenium can automatically control the browsing behavior of the browser simulator. , can realize artificial functions such as webpage clicking and dragging, but when running on a Linux server, the startup will be very slow, and it will take five or six minutes.

After everything is installed, you can write a complete crawler code.

Complete jianshu.py

# jianshu.py
import scrapy
from selenium import webdriver
from jianshuSpider.items import JianshuspiderItem
from selenium.webdriver.chrome.options import Options

class JianshuSpider(scrapy.Spider):
    name = 'jianshu'
    allowed_domains = ['jianshu.com']

    # start_urls列表中可以放多个URL,爬虫会一个一个URL进行遍历访问,URL之间用英文逗号隔开
    start_urls = ['https://www.jianshu.com/mobile/books?category_id=284']

    # 实例化⼀个浏览器对象
    def __init__(self):
        # 防止网站识别selenium
        options = Options()
        options.add_argument('--no-sandbox')
        options.add_argument("--headless")
        options.add_experimental_option('excludeSwitches', ['enable-automation'])
        options.add_experimental_option('useAutomationExtension', False)
        options.add_argument('-ignore-certificate-errors')
        options.add_argument('-ignore -ssl-errors')
        self.bro = webdriver.Chrome(chrome_options=options)
        super().__init__()

    def parse(self, response):
        books = response.xpath('//*[@id="book-waterfall"]/div')
        bookitem = JianshuspiderItem()
        for book in books:
            bookitem['name'] = book.xpath('./div/div[2]/p/text()').get()
            bookitem['author'] = book.xpath('./div/div[2]/div/span[1]/span/text()').get()
            bookitem['readtimes'] = (book.xpath('./div/div[2]/div/span[2]/text()').get()).lstrip()
            yield bookitem
        pass  

    # 在爬虫中新添加的方法:关闭bro
    def closed(self, spider):
        print("spider closed")
        print("浏览器已关闭")
        self.bro.quit()



Then go to middlewares.py to make modifications. It is said on the Internet that there are two ways to use selenium in scrapy, one is to modify process_request, and the other is to modify process_response. The difference is that the former will only open one browser interface, while the latter will open multiple browser interfaces according to the code, which will be slower. So I use the first method, the code is as follows:

# 完整的middlewares.py
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
import time
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
from scrapy.http import HtmlResponse

class JianshuspiderSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class JianshuspiderDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        bro = spider.bro
        bro.get(request.url)  # 每个请求使用一个bro
        # 控制浏览器进行下拉滑动,并设置时间间隔
        bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
        time.sleep(1)
        bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
        time.sleep(1)
        text = bro.page_source
        response = HtmlResponse(url=request.url, body=text.encode('utf-8'), status=200)
        print("访问:{0}".format(request.url))
        return response

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

3.5 Connect to the database for storage

Install pymysql: pip install pymysql

First go to the local navicat to create a new database jianshu, create a new data table books, and add three variables name, author and readtimes.

Then modify pipelines.py

import pymysql
class JianshuspiderPipeline:
    def process_item(self, item, spider):
        conn = pymysql.connect(
            host="...",
            user="...",
            passwd="...",
            charset="utf8",
            use_unicode=False
        )
        cursor = conn.cursor()
        cursor.execute("USE jianshu")
        sql = "REPLACE INTO books(name, author, readtimes)" \
                   "VALUES(%s, %s, %s)"
        try:
            cursor.execute(sql,
                            (item['name'], item['author'], item['readtimes']))
            conn.commit()
            print("=================正在写入数据库==================")
        except BaseException as e:
            print("错误在这里>>>>>>>>>>>>>", e, "<<<<<<<<<<<<<错误在这里")
        conn.close()
        return item

 Run start.py, you can see the result in navicat:

It means that the local crawler has been completed! Deploy to the Linux server below!

2. Deploy to the server

1. Purchase a server

Everyone does what they can, it is very cheap for new users to buy, and there are also trial users. I use Alibaba Cloud server, the system is Debain11.

2. Upload the project to the server

Download the FileZilla software, which is specially used for data transmission with the server, and it is completely free.

We directly enter the public network IP of the server we just purchased on the host computer on the above page, user name, password, and port (input 22) in sequence, and then quickly connect, and the connection is successful.

Then right-click home in the upper right panel, create a directory as python projects, and click OK.

Click to select the python projects folder, then go to the upper left panel to select the project we need to upload, right click and select upload

 The upload is successful, if there is a transmission failure, just upload it again. Notice! After debugging the code locally, remember to upload it to the server for coverage! ! !

3. Linux server environment configuration 

0. Preparation

Use FTP software to upload the local crawler project to the /home directory of the server. Then, I use Putty software to remotely connect to the server. There is no control through the web version terminal, and there is an effect.

Follow the steps below to configure the server:

1.apt update

2. apt upgrade -y

3.apt install mysql-server

If an error is reported:

solve:

Install mysql on Linux (solve E: Package 'mysql-server' has no installation candidate and ERROR 1698 (28000)) - Programmer Sought

After installation, enter the following command to set up a remote connection to the database:

​​​​​​​mysql -u root -p

select host,user,password from user;

GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '密码(改成自己的)' WITH GRANT OPTION;

Set the port of mysql:

You can refer to this article:

https://www.cnblogs.com/linjiqin/p/5270938.html

Then go to the web page of the server to set the security group protocol and add the 3306 (mysql) protocol.

4. Without sudo:

solve:

5. The system comes with python3

6. Install Google Chrome, driver, selenium

Install chromedriver, chrome and run selenium on Ubuntu 16.04

7. Server firewall settings

apt install ufw

systemctl enable ufw

systemctl start ufw

ufw allow ssh

ufw allow http

ufw allow 3306

8. Test run, enter the crawler directory

Cd /home/Catch_Data

Python3 start.py

found to be operational.

9. Write a shell file below

Enter the main path of the server, create a new scripts folder, and write the spider.sh file

Mkdir scripts

Cd scripts

Cat > spider.sh

Enter in the spider.sh file

Cd /home/Catch_Data

Python3 start.py

Ctrl+D to save and exit

Now make the file spider.sh executable using the chmod command ,

chmod +x spider.sh

Finally, run your shell script by prefixing spider.sh with "bash" :

bash /scripts/spider.sh

10. Set timer start

Edit crontab file: crontab -e

Refer to the following articles:

Linux Crontab Timing Task | Novice Tutorial

Crontab Timing Task Getting Started Tutorial, Practical Examples_Errors_In_Life's Blog-CSDN Blog

at last:

service cron start

Congratulations, you have learned how to deploy a crawler on a linux server and set it up to run regularly! !

Guess you like

Origin blog.csdn.net/linxi4165/article/details/125274771