Introduction
In the world of web crawlers, we often need to face some web pages that require user authentication, such as login, registration verification, etc. This article will introduce how to use Scrapy-Selenium to process such webpages to realize automatic login and crawling.
overview
Scrapy-Selenium combines two powerful crawler tools, Scrapy and Selenium, which can simulate browser operations within the Scrapy framework to deal with web pages that require authentication. This is especially useful for crawling websites that require a login.
text
In practical applications, there are many websites that require users to log in to obtain data. Scrapy-Selenium can help us simulate user login operations, so that crawlers can access pages that require authentication.
First, we need to settings.py
configure Selenium-related information and middleware in the project, as well as proxy settings:
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = '/path/to/chromedriver'
SELENIUM_DRIVER_ARGUMENTS = ['--headless'] # 可选,无头模式运行浏览器
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800,
'your_project_name.middlewares.ProxyMiddleware': 750
}
# 亿牛云 设置代理信息
PROXY_HOST = "www.16yun.cn"
PROXY_PORT = "3111"
PROXY_USER = "16YUN"
PROXY_PASS = "16IP"
middlewares.py
Write proxy middleware in :
class ProxyMiddleware:
def __init__(self, proxy_host, proxy_port, proxy_user, proxy_pass):
self.proxy_host = proxy_host
self.proxy_port = proxy_port
self.proxy_user = proxy_user
self.proxy_pass = proxy_pass
@classmethod
def from_crawler(cls, crawler):
return cls(
proxy_host=crawler.settings.get('PROXY_HOST'),
proxy_port=crawler.settings.get('PROXY_PORT'),
proxy_user=crawler.settings.get('PROXY_USER'),
proxy_pass=crawler.settings.get('PROXY_PASS')
)
def process_request(self, request, spider):
request.meta['proxy'] = f'http://{
self.proxy_user}:{
self.proxy_pass}@{
self.proxy_host}:{
self.proxy_port}'
Next, we can create a Spider to implement the login operation. Suppose we want to crawl a website that requires login, the following is a sample code:
import scrapy
from scrapy_selenium import SeleniumRequest
class LoginSpider(scrapy.Spider):
name = 'login_spider'
start_urls = ['https://example.com/login']
def parse(self, response):
yield SeleniumRequest(
url=response.url,
callback=self.login,
wait_time=5 # 等待时间,确保页面加载完毕
)
def login(self, response):
self.driver.find_element_by_id('username').send_keys('your_username')
self.driver.find_element_by_id('password').send_keys('your_password')
self.driver.find_element_by_id('login_button').click()
yield SeleniumRequest(
url='https://example.com/data_page',
callback=self.parse_data
)
def parse_data(self, response):
# 解析数据...
In the above code, we first visit the login page, then simulate the user to enter the user name and password through Selenium, and click the login button. After successful login, we can continue to visit pages that require authentication to crawl data.
the case
Suppose we want to crawl a website that requires login, use Scrapy-Selenium for automated login and data crawling, and then store the data in the MongoDB database.
import scrapy
from scrapy_selenium import SeleniumRequest
import pymongo
class LoginAndScrapeSpider(scrapy.Spider):
name = 'login_scrape'
start_urls = ['https://example.com/login']
def parse(self, response):
yield SeleniumRequest(
url=response.url,
callback=self.login,
wait_time=5
)
def login(self, response):
self.driver.find_element_by_id('username').send_keys('your_username')
self.driver.find_element_by_id('password').send_keys('your_password')
self.driver.find_element_by_id('login_button').click()
yield SeleniumRequest(
url='https://example.com/data_page',
callback=self.parse_data
)
def parse_data(self, response):
data = response.xpath('//div[@class="data"]/text()').get()
# 存储数据到MongoDB
client = pymongo.MongoClient(host='localhost', port=27017)
db = client['scraped_data']
collection = db['data_collection']
collection.insert_one({
'data': data})
client.close()
epilogue
Through Scrapy-Selenium, we can easily deal with web pages that require login and registration authentication. This article introduces how to configure Selenium and Scrapy, and how to write Spider to realize automatic authentication and data crawling, and add proxy settings to improve crawler efficiency. This approach can greatly improve the efficiency and functionality of crawlers.
By combining Selenium and Scrapy, we can handle various crawling tasks more flexibly and efficiently, especially when user authentication is involved. This provides more possibilities and convenience for our data collection work.