Docker most complete tutorial - from the theoretical to the practical (XIX)

Python is currently the fastest-growing mainstream popularity of programming languages, is the second most popular language developer (refer to Stack Overflow 2019 developer survey report was released). I suggest .NET, Java developers can develop Python as a second language, while Python is indeed very sharp in some areas (reptiles, algorithms, artificial intelligence, etc.), on the other hand, believe me, there is no threshold Python to get started you do not even have to buy any book!

In preparation for the recent 4.21 Changsha Developers Conference, I wasted a lot of time. But the invitation to Tencent senior technical experts, .NET Chinese community leaders, Microsoft MVP Zhang Shanyou; 52ABP of open-source framework, Microsoft MVP Liangtong Ming; famous writer Wang Peng technology class, senior Zhuo Wei Tencent, Tencent cloud Hu Li Wei, Senior Product Manager etc., interested in participating in public friends can click the menu number [Contact us] ==> [registration] to register, regardless of language technology, also no boundaries, and you look forward to sharing and exchange!

table of Contents

About Python

Official Mirror

Using Python crawling blog list

Statement of needs

Learn Beautiful Soup

Analyze and get to crawl rule

Develop code that implements the logic crawl

Write Dockerfile

Run and see the result crawl

 

About Python

Python is a computer programming language. Is a dynamic, object-oriented scripting language that was originally designed for writing automated scripts (shell), constantly updated with the addition of new features and language versions, it is used more and more independent, large-scale projects development. Python is currently the fastest-growing mainstream popularity of programming languages, is the second most popular language developer (refer to Stack Overflow 2019 developer survey report was released) .

Python is an interpreted scripting language that can be used in the following areas:

  • Web and Internet Development
  • Scientific Computing and Statistics
  • education
  • Desktop interface development
  • Software Development
  • Back-end development

Python learning curve no threshold, but through it, you can use a shorter time, more efficient machine learning to learn and master, and even deep learning skills. But only Python alone is not enough for most people, you'd better also holds a statically typed languages (.NET / Java). At the same time, I also recommend .NET, Java developers can develop Python as a second language, while Python is indeed very sharp in some areas (reptiles, algorithms, artificial intelligence, etc.), on the other hand, I believe, to use Python no threshold, you do not even have to buy any book!

 

Official Mirror

Official mirror address: https://hub.docker.com/_/python

Note, please look for the official image:

 

Using Python crawling blog list

Statement of needs

Benpian use Python to capture blog My blog park, print out the title, link, date and summary.

Blog address: http://www.cnblogs.com/codelove/

Shown as follows:

 

Learn Beautiful Soup

Beautiful Soup is a Python library can extract data from HTML or XML files, support for multiple parsers. Beautiful Soup simply put, is a flexible and convenient web page parsing library, is a crawl weapon. This tutorial we'll fetch blog data based on Beautiful Soup.

Beautiful Soup official website: https://beautifulsoup.readthedocs.io

 The main parser Description:

Analyze and get to crawl rule

First, we use the Chrome browser to open the following address: http://www.cnblogs.com/codelove/

Then press F12 to open the Developer Tools, tool we combed through the following rules:

  • Blog block (div.day)

 

  • Blog title (div. PostTitle a)

 

  • Other content acquisition, such as dates, blog links, profiles, here we will not screenshot.

 

然后我们通过观察博客路径,获取到url分页规律:

根据以上分析,我们胸有成竹,开始编码。

编写代码实现抓取逻辑

在编码前,请阅读BeautifulSoup官方文档。然后根据需求,我们编写Python的代码如下所示:

# 关于BeautifulSoup,请阅读官方文档:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id52
 
from  bs4 import BeautifulSoup
 
import os
 
import sys
 
import requests
 
import time
 
import re
 
url = "https://www.cnblogs.com/codelove/default.html?page={page}"
 
  
 
#已完成的页数序号,初时为0
 
page = 0
 
while  True:
 
     page += 1
 
     request_url = url.format(page=page)
 
     response = requests. get (request_url)
 
     #使用BeautifulSoup的html5lib解析器解析HTML(兼容性最好)
 
     html = BeautifulSoup(response.text, 'html5lib' )
 
  
 
     #获取当前HTML的所有的博客元素
 
     blog_list = html. select ( ".forFlow .day" )
 
  
 
     # 循环在读不到新的博客时结束
 
     if  not blog_list:
 
         break
 
  
 
     print( "fetch: " , request_url)
 
  
 
     for  blog in  blog_list:
 
         # 获取标题
 
         title = blog. select ( ".postTitle a" )[0]. string
 
         print( '--------------------------' +title+ '--------------------------' );
 
  
 
         # 获取博客链接
 
         blog_url = blog. select ( ".postTitle a" )[0][ "href" ]
 
         print(blog_url);
 
  
 
         # 获取博客日期
 
         date = blog. select ( ".dayTitle a" )[0].get_text()
 
         print(date)
 
  
 
         # 获取博客简介
 
         des = blog. select ( ".postCon > div" )[0].get_text()
 
         print(des)
 
  
 
         print( '-------------------------------------------------------------------------------------' );

  

如上述代码所示,我们根据分析的规则循环翻页并且从每一页的HTML中抽取出了我们需要的博客信息,并打印出来,相关代码已提供注释,这里我们就不多说了。

 

编写Dockerfile

代码写完,按照惯例,我们仍然是使用Docker实现本地无SDK开发,因此编写Dockerfile如下所示:

# 使用官方镜像
 
FROM python:3.7-slim
 
  
 
# 设置工作目录
 
WORKDIR /app
 
  
 
# 复制当前目录
 
COPY . /app
 
  
 
# 安装模块
 
RUN pip install --trusted-host pypi.python.org -r requirements.txt
 
  
 
# Run app.py when the container launches
 
CMD [ "python" , "app.py" ]

  

注意,由于我们使用到了比如beautifulsoup等第三方库,因此我们需要安装相关模块。requirements.txt内容如下所示(注意换行):

html5lib

beautifulsoup4

requests

 

运行并查看抓取结果

构建完成后,我们运行起来结果如下所示:

 

 作者:雪雁
出处:http://www.cnblogs.com/codelove/

Guess you like

Origin www.cnblogs.com/zdalongren/p/12213087.html