A few days ago, there was a project need, and I spent three days writing a crawler and deploying it to the server. I have never touched a Linux server before, consulted a lot of blog literature, and wrote this complete tutorial.
First is my basic environment configuration:
window11, Python3.9, Mysql, Debain11, Google browser.
Now let’s get to the point, I’ll take the Jianshu website as an example:
1. Write a crawler on this machine (Scrapy+Selenium)
1. Install Scrapy
pip install scrapy
2. Create a Scrapy crawler project
Take the paid serialization of crawling short books as an example. The webpage is loaded asynchronously. Pay special attention here. If you are sure that your crawler code is fine, but you just can’t extract the data in the webpage, don’t hesitate, this website is loaded asynchronously , simply using Scrapy is not enough, it must be combined with Selenium or other methods.
First of all, clarify the content to be crawled. This time, the title of the work, author, and reading volume are crawled. Then store all the data in the Mysql database. If the data you crawl contains time, the processing method is the same as the above data.
After using cmd to enter the folder where you want to store the crawler project, continue to enter in cmd:
scrapy startproject jianshuSpider
Just replace jianshuSpider with your own crawler project name, then use cd to enter the crawler project, and then enter the command to create a crawler. Note that the crawler project name and the crawler name are two concepts, and the two cannot be the same!
cd jianshuSpider
# scrapy genspider <爬虫名字> <允许爬取的域名>
scrapy genspider jianshu jianshu.com
The result is shown in the figure:
3. Write a crawler
Open the project in Pycharm, the structure is as follows:
jianshu.py: Write the logic code of the crawler, define the content to be crawled, extract the webpage data, etc., and implement it ourselves;
items.py: Write the class to save data items, you can put multiple classes, we implement it ourselves;
middlewares.py: Middleware, writing webpage crawling data flow transmission, basically no need to change;
pipelines.py: pipeline, write data persistence code, define the addition, deletion, modification and query of the database;
settings.py: Configuration, basically no need to change, just pay extra attention to open some ports.
3.1 Modify settings.py
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# 添加的编码格式
FEED_EXPORT_ENCODING = 'utf-8'
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True # 自己加这一行
# Disable cookies (enabled by default)
COOKIES_ENABLED = False # 防止被服务器追踪
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'jianshuSpider.pipelines.JianshuspiderPipeline': 300,
}
When the User-Agent is modified, how to view your own User-Agent: Enter about:version in the address bar of the browser, and the User-Agent displayed is the User-Agent of the browser.
3.2 Write items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class JianshuspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
author = scrapy.Field()
readtimes = scrapy.Field()
pass
3.3 Write jianshu.py
This involves the extraction of web page data. I use an xpath extractor, which is very simple to use. Google browser provides us with the function of copying xpath, right click, click check, select the data we need to parse, right click, copy, copy xpath.
class JianshuSpider(scrapy.Spider):
name = 'jianshu'
allowed_domains = ['jianshu.com']
# start_urls列表中可以放多个URL,爬虫会一个一个URL进行遍历访问,URL之间用英文逗号隔开
start_urls = ['https://www.jianshu.com/mobile/books?category_id=284']
def parse(self, response):
books = response.xpath('//*[@id="book-waterfall"]/div')
bookitem = JianshuspiderItem()
for book in books:
bookitem['name'] = book.xpath('./div/div[2]/p/text()').get()
bookitem['author'] = book.xpath('./div/div[2]/div/span[1]/span/text()').get()
bookitem['readtimes'] = (book.xpath('./div/div[2]/div/span[2]/text()').get()).lstrip()
print("作品:", bookitem['name'])
print("作者:", bookitem['author'])
print("阅读量:", bookitem['readtimes'])
pass
After writing here, you can run the crawler, enter scrapy crawl jianshu in the PyCharm terminal and press Enter to run, or create a new start.py file under the project, pay attention! It must be at the same level as the crawler directory!
You only need to execute this file every time you execute it. The start.py code is as follows:
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'jianshu', '--nolog'])
After running, we found that we didn't get any data!
It is because this website is loaded asynchronously, which can be solved by using selenium.
3.4 Selenium solves the asynchronous loading problem
Install Selenium: pip install selenium
To install the Google browser driver, refer to this article, but there is no need to add environment variables.
Let me briefly talk about the functions of these two things here. Selenium is a must for software testing. It can automatically execute the script program we defined. Coupled with the browser driver, selenium can automatically control the browsing behavior of the browser simulator. , can realize artificial functions such as webpage clicking and dragging, but when running on a Linux server, the startup will be very slow, and it will take five or six minutes.
After everything is installed, you can write a complete crawler code.
Complete jianshu.py
# jianshu.py
import scrapy
from selenium import webdriver
from jianshuSpider.items import JianshuspiderItem
from selenium.webdriver.chrome.options import Options
class JianshuSpider(scrapy.Spider):
name = 'jianshu'
allowed_domains = ['jianshu.com']
# start_urls列表中可以放多个URL,爬虫会一个一个URL进行遍历访问,URL之间用英文逗号隔开
start_urls = ['https://www.jianshu.com/mobile/books?category_id=284']
# 实例化⼀个浏览器对象
def __init__(self):
# 防止网站识别selenium
options = Options()
options.add_argument('--no-sandbox')
options.add_argument("--headless")
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('-ignore-certificate-errors')
options.add_argument('-ignore -ssl-errors')
self.bro = webdriver.Chrome(chrome_options=options)
super().__init__()
def parse(self, response):
books = response.xpath('//*[@id="book-waterfall"]/div')
bookitem = JianshuspiderItem()
for book in books:
bookitem['name'] = book.xpath('./div/div[2]/p/text()').get()
bookitem['author'] = book.xpath('./div/div[2]/div/span[1]/span/text()').get()
bookitem['readtimes'] = (book.xpath('./div/div[2]/div/span[2]/text()').get()).lstrip()
yield bookitem
pass
# 在爬虫中新添加的方法:关闭bro
def closed(self, spider):
print("spider closed")
print("浏览器已关闭")
self.bro.quit()
Then go to middlewares.py to make modifications. It is said on the Internet that there are two ways to use selenium in scrapy, one is to modify process_request, and the other is to modify process_response. The difference is that the former will only open one browser interface, while the latter will open multiple browser interfaces according to the code, which will be slower. So I use the first method, the code is as follows:
# 完整的middlewares.py
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
import time
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
from scrapy.http import HtmlResponse
class JianshuspiderSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class JianshuspiderDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
bro = spider.bro
bro.get(request.url) # 每个请求使用一个bro
# 控制浏览器进行下拉滑动,并设置时间间隔
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(1)
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(1)
text = bro.page_source
response = HtmlResponse(url=request.url, body=text.encode('utf-8'), status=200)
print("访问:{0}".format(request.url))
return response
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
3.5 Connect to the database for storage
Install pymysql: pip install pymysql
First go to the local navicat to create a new database jianshu, create a new data table books, and add three variables name, author and readtimes.
Then modify pipelines.py
import pymysql
class JianshuspiderPipeline:
def process_item(self, item, spider):
conn = pymysql.connect(
host="...",
user="...",
passwd="...",
charset="utf8",
use_unicode=False
)
cursor = conn.cursor()
cursor.execute("USE jianshu")
sql = "REPLACE INTO books(name, author, readtimes)" \
"VALUES(%s, %s, %s)"
try:
cursor.execute(sql,
(item['name'], item['author'], item['readtimes']))
conn.commit()
print("=================正在写入数据库==================")
except BaseException as e:
print("错误在这里>>>>>>>>>>>>>", e, "<<<<<<<<<<<<<错误在这里")
conn.close()
return item
Run start.py, you can see the result in navicat:
It means that the local crawler has been completed! Deploy to the Linux server below!
2. Deploy to the server
1. Purchase a server
Everyone does what they can, it is very cheap for new users to buy, and there are also trial users. I use Alibaba Cloud server, the system is Debain11.
2. Upload the project to the server
Download the FileZilla software, which is specially used for data transmission with the server, and it is completely free.
We directly enter the public network IP of the server we just purchased on the host computer on the above page, user name, password, and port (input 22) in sequence, and then quickly connect, and the connection is successful.
Then right-click home in the upper right panel, create a directory as python projects, and click OK.
Click to select the python projects folder, then go to the upper left panel to select the project we need to upload, right click and select upload
The upload is successful, if there is a transmission failure, just upload it again. Notice! After debugging the code locally, remember to upload it to the server for coverage! ! !
3. Linux server environment configuration
0. Preparation
Use FTP software to upload the local crawler project to the /home directory of the server. Then, I use Putty software to remotely connect to the server. There is no control through the web version terminal, and there is an effect.
Follow the steps below to configure the server:
1.apt update
2. apt upgrade -y
3.apt install mysql-server
If an error is reported:
solve:
After installation, enter the following command to set up a remote connection to the database:
mysql -u root -p
select host,user,password from user;
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '密码(改成自己的)' WITH GRANT OPTION;
Set the port of mysql:
You can refer to this article:
https://www.cnblogs.com/linjiqin/p/5270938.html
Then go to the web page of the server to set the security group protocol and add the 3306 (mysql) protocol.
4. Without sudo:
solve:
5. The system comes with python3
6. Install Google Chrome, driver, selenium
Install chromedriver, chrome and run selenium on Ubuntu 16.04
7. Server firewall settings
apt install ufw
systemctl enable ufw
systemctl start ufw
ufw allow ssh
ufw allow http
ufw allow 3306
8. Test run, enter the crawler directory
Cd /home/Catch_Data
Python3 start.py
found to be operational.
9. Write a shell file below
Enter the main path of the server, create a new scripts folder, and write the spider.sh file
Mkdir scripts
Cd scripts
Cat > spider.sh
Enter in the spider.sh file
Cd /home/Catch_Data
Python3 start.py
Ctrl+D to save and exit
Now make the file spider.sh executable using the chmod command ,
chmod +x spider.sh
Finally, run your shell script by prefixing spider.sh with "bash" :
bash /scripts/spider.sh
10. Set timer start
Edit crontab file: crontab -e
Refer to the following articles:
Linux Crontab Timing Task | Novice Tutorial
Crontab Timing Task Getting Started Tutorial, Practical Examples_Errors_In_Life's Blog-CSDN Blog
at last:
service cron start