[Scrapy教9] Scrapy framework combined with Gmail must understand the secrets of sending and crawling data attachments

Please click to enter the picture description (up to 18 characters)

In the process of collecting data by the Python web crawler, in addition to storing the data in the database or importing it into a file, the other most commonly used scenario is the "message notification", which is to use the Python web crawler to collect the required information. After data, the data result is pushed through the channel of message notification.

For example , [Python crawler] Python crawler combined with LINE Notify to create an automated message notification service article integrates the LINE Notify service to notify users of the price cut news obtained by the Python web crawler, and this is to share another message notification channel with you The combination of e-mail.

This article continues the [python] detailed explanation of the practical skills of Scrapy framework crawling paging data-the eighth article, after saving the crawled results into a CSV file, mail it to the user through a Gmail attachment. Before starting, you can refer to the second step of the [Python practical application] Python sending Gmail email practical teaching article to obtain the Gmail application password, and the brakes can use its SMTP (Simple Mail Transfer Protocol) to send mail . The key points of this article include:

Scrapy web crawler framework process

Scrapy web crawler project review

Scrapy MailSender combines Gmail to send mail

1. Scrapy web crawler framework process

First, let’s review the 5 execution modules and architecture of the Scrapy framework in the "python" Quick Start-the Scrapy web crawler framework process shared in the first article, as shown below:

Please click to enter the picture description (up to 18 characters)

As can be seen from the above figure, if you want to perform subsequent processing on the data obtained by the Scrapy web crawler, you need to pass the ITEMS data model temporarily stored in the crawled data to the ITEM PIPELINE data after the SPIDERS crawler program obtains the response result (6) The model pipeline (7, 8) customizes the logic of subsequent data processing.

Therefore, it is conceivable that if you want to import the crawled results into a CSV file and send it through an attachment in Gmail, it must be written in the ITEM PIPELINE data model pipeline, which is the pipelines.py file of the Scrapy project.

2. Review of Scrapy web crawler project

Next, review the three parts of the current Scrapy project, as follows:

"SPIDERS crawler program (inside.py)"

进口沙皮
 
 
类InsideSpider (scrapy 。蜘蛛): 
    名称= “内部” 
    allowed_domains = [ 'www.inside.com.tw' ] 
    start_urls = [ 'https://www.inside.com.tw/tag/ai' ] 
    count = 1 #执行次数   
 
    def parse (self ,response ):
 
        产量从自我。刮(回应)#爬取网页内容    
 
        #定位「下一页」按钮元素
        next_page_url =响应。xpath (// a [@ class ='pagination_item pagination_item-next'] / @ href” )
 
        如果next_page_url :
 
            url = next_page_url 。get ()#取得下一页的网址  
 
            InsideSpider 。计数+ = 1 
 
            如果InsideSpider 。计数<= 3
                产量scrapy 。请求(URL ,回调=自我。解析)#发送请求  
 
    def scrape (self ,response ):
 
        #爬取文章标题
        post_titles =响应。xpath (// h3 [@ class ='post_title'] / a [@ class ='js-auto_break_title'] / text()”
        )。getall ()
 
        #爬取发布日期
        post_dates =响应。xpath (// li [@ class ='post_date'] / span / text()”
        )。getall ()
 
        #爬取作者
        post_authors =回应。xpath (// span [@ class ='post_author'] / a / text()”
        )。getall ()
 
        对于数据在拉链(post_titles ,post_dates ,post_authors ):
            NewsScraperItem = {
    
      
                “ post_title” :数据[ 0 ]
                “ post_date” :数据[ 1 ]
                “ post_author” :数据[ 2 ]
            }
 
            产生NewsScraperItem 

The above is Scrapy web crawler crawling INSIDE's hard-cracked Internet trend observation website-the article information of the first 3 pages of AI News. For the implementation instructions, please refer to [Scrapy Teaching 8] Detailed use of the Scrapy framework to crawl paging data.

"ITEMS data model (items.py)"

进口沙皮
 
 
类NewsScraperItem (scrapy 。项): 
    #在这里为您的商品定义字段,例如:
    #名称= scrapy.Field()
    post_title = scrapy 。字段()#文章标题  
    post_date = scrapy 。栏位()#发布日期  
    post_author = scrapy 。字段()#文章作者  

Contains three fields of "Article Title", "Published Date" and "Article Author" to be exported to CSV files later.

"ITEM PIPELINE data model pipeline (pipelines.py)"

从itemadapter导入ItemAdapter 
从y不休。出口商进口CsvItemExporter 
 
 
CsvPipeline类: 
    def __init__ (self ):
        自我。文件=打开('posts.csv''wb'
        自我。出口= CsvItemExporter (自我。文件,编码= '中文'
        自我。出口商。start_exporting ()
 
    def process_item (self ,item ,spider ):
        自我。出口商。export_item (项目)
        退货项目
 
    def close_spider (self ,spider ):
        自我。出口商。finish_exporting ()
        自我。文件。关闭()

More than [python] teach you how to export CSV files from the Scrapy framework to improve data processing efficiency-the seventh article, import the data crawled by Scrapy web crawlers into the CSV file section, and here is to attach the CSV file to it Sent in Gmail mail. (PS. The CsvItemExporter on line 8 is preset to UTF-8 encoding. If the CSV file exported by the reader is to be opened in Microsoft Excel, it needs to be set to Chinese encoding, otherwise garbled characters will appear)

Three, Scrapy MailSender combines Gmail to send mail

In the framework of Scrapy web crawler, if you want to implement the function of sending emails, you can use the built-in MailSender module (module), which can be achieved through basic settings. And it is a non-blocking IO (non-blocking IO) based on the Twisted framework, which can avoid code jams due to unexpected errors when sending emails.

Open the settings.py configuration file of the Scrapy project and add the following Gmail SMTP settings:

Please click to enter the picture description (up to 18 characters)

And, open the pipeline setting of the CsvPipeline data model created in the seventh article, as shown in the following example:

MAIL_HOST = “ smtp.gmail.com” 
MAIL_PORT = 587 
MAIL_FROM = “申请Gmail应用程序密码所使用的电子邮件帐号” 
MAIL_PASS = “ Gmail应用程序密码” 
MAIL_TLS = True #开启安全连线   

After the setting is complete, open the ITEM PIPELINE data model pipeline (pipelines.py) file and reference the configuration file of the Scrapy framework and the MailSender module (Module), as shown in the following example:

#配置项目管道
#参见https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    
     
    'news_scraper.pipelines.CsvPipeline'500}

Since we have to send the email after the Scrapy web crawler has imported the data into the CSV file, we need to create the Scrapy MailSender object in the close_spider() method of the CsvPipeline class (Class). The following example:

从itemadapter导入ItemAdapter 
来自news_scraper导入设置
从y不休。邮件导入MailSender 

The 15th line of the above example uses the value just set in the settings.py file to create the Scrapy MailSender object. Pay special attention to the keyword parameters (keyword parameters) that must be exactly the same.

Next, specify Gmail attachments, including "attach name (attach_name)", "Internet media type (mime_type)" and "file object (file_object)", as shown in the following example:

CsvPipeline类: 
    def __init__ (self ):
        自我。文件=打开('posts.csv''wb'
        自我。出口= CsvItemExporter (自我。文件,编码= '中文'
        自我。出口商。start_exporting ()
 
    def process_item (self ,item ,spider ):
        自我。出口商。export_item (项目)
        退货项目
 
    def close_spider (self ,spider ):
        自我。出口商。finish_exporting ()
        自我。文件。关闭()
 
        邮件= MailSender (smtphost =设置。MAIL_HOST , 
                          smtpport =设置。MAIL_PORT ,
                          smtpuser =设置。MAIL_FROM ,
                          smtppass =设置。MAIL_PASS ,
                          smtptls =设置。MAIL_TLS )

Finally, on line 26, send the CSV data file exported by the web crawler through the send() method (method) of the Scrapy MailSender module (module). The same keyword parameters (keyword parameters) need to be the same. The execution result is as follows:

CsvPipeline类: 
    def __init__ (self ):
        自我。文件=打开('posts.csv''wb'
        自我。出口= CsvItemExporter (自我。文件,编码= '中文'
        自我。出口商。start_exporting ()
 
    def process_item (self ,item ,spider ):
        自我。出口商。export_item (项目)
        退货项目
 
    def close_spider (self ,spider ):
        自我。出口商。finish_exporting ()
        自我。文件。关闭()
 
        邮件= MailSender (smtphost =设置。MAIL_HOST , 
                          smtpport =设置。MAIL_PORT ,
                          smtpuser =设置。MAIL_FROM ,
                          smtppass =设置。MAIL_PASS ,
                          smtptls =设置。MAIL_TLS )
 
        attach_name = “ posts.csv” #附件的显示名称   
        mime_type = “ application / vnd.openxmlformats-officedocument.spreadsheetml.sheet” 
        file_object = open (“ posts.csv” ,“ rb” )#读取汇出的csv档   
	
	#寄出邮件
        退回邮件。发送(至= [ “ example@gmail.com” ],#收件者  
                         subject = “ news” ,#邮件标题  
                         正文= “” ,#邮件内容  
                         attachs = [(attach_name ,MIME_TYPE ,FILE_OBJECT )]) #附件  

Fourth, in summary,
in practice, importing the data obtained by the Python web crawler into a file and mailing it to the user is a very common application. In the framework of the Scrapy web crawler, the MailSender module (module) is provided. , Allowing developers to easily combine SMTP (Simple Mail Transfer Protocol), such as Gmail, to send crawled data files through simple settings to achieve the effect of message notification. Readers who add email functionality are helpful. Welcome to leave a message below to share with me~

Guess you like

Origin blog.csdn.net/wlcs_6305/article/details/114632311